A Beginner's look into Artificial Intelligence

Posts

What is a Multimodal LLM?

November 02, 2025

Introduction Multimodal is the ability for a model to process several different types of inputs. For example, ChatGPT can support text input as well as image input as one combined input. You could also have a model take in .pdf, .md, and input text as inputs (yes, these count as different modalities). This allows for users to give represent different ideas in its rawest form possible, such as an image, instead of describing it. If you try to describe it, you could hit hurdles in trying to properly explain it and making sure it interpreted the description correctly. How do Multimodal LLMs work? Typically, these modalities are "fused" together. Here's an architecture diagram, Image from NVIDIA In this case, we have an encoder for a specific modality (e.g. video encoder, image encoder, audio encoder). This encodes them into embeddings the LLM can understand and use. After encoding each specific modality, we then fuse them together. Then, the output is projected and ...

AI and The Arts

September 30, 2025

Image created by FLUX.1 Kontext by Black Forest Labs Introduction As AI models grow larger and larger, companies have to find different ways to get more training data. The more diverse, rich, human content these companies can find, the better these models can get in imitating us. What do these companies do with these datasets? Well, they feed them to huge Multimodal Models (models that can support several different types of inputs) to make art, music, text, and much more. But, what does this mean for the creator? Sometimes, companies will copyright strike music from artists despite it being in the public domain because the company has a recording of it. In this case, can Artists and Musicians copyright strike AI for stealing their art? How does AI Art work and how do you detect it? In essence, a model takes in a text input and creates a noisy image, and then tries decoding it. This is called Stable Diffusion. Take Flux K.1 Kontext for example...

Explaining Active Speaker Detection (ASD)

July 31, 2025

Example of Active Speaker Detection (Ground Truth Labeled Image). Credit: " LoCoNet: Long-Short Context Network for Active Speaker Detection " ( Link ) Introduction Imagine a system that can model how people work and interact. Well, one interaction to model is detecting who spoke in a given scene/frame. This way, not only do models learn how people interact (e.g. two people talking, one waiting for the other, or both talking on top of each other), but can then be used for human-model interactions, and speech diarization (segmenting audio + identifying who spoke when), and much more. Model Architectures In this section, we will discuss two models. TalkNet ( Tao et. al. 2021 ) TalkNet's goal is to capture long-term context features, as previous models only focused on short term context features, and small time/temporal segments. Their solution utilizes two cross-attention mechanisms, swapping the query values in between the audio and visual features. Then, you concatenate ...

A Mathematical Explanation of Gradient Descent

June 13, 2025

Example of a close-to-optimal Gradient descent for a quadratic function. Credit: Dylan Introduction Although I wrote an article recently explaining Gradient Descent, I feel like a mathematical explanation of how it works would be beneficial not only to understand how it works, but also how you derive the functions. Explanation First, we need a way to calculate how far off the model is. For our example, we will use the L2 Loss Function, Let N represent the total amount of data points we have, i the iteration we are on, yᵢ represent the expected output, and ŷ represent the output predicted by the model. Let's look more into ŷ. Since ŷ is the predicted value, this is the function that needs to have their parameters changed. Our goal is to change these parameters enough to minimize the loss of ŷ. For our example, let's set ŷ to be a linear line. The values in "a" and "b" are our parameters. These will be the values we will have to change within f(x) to chang...

Top 3 Breakthroughs in Computer Vision in 2024

April 18, 2025

Introduction Over the past few year, AI has become popular; however, no year matches the pace of the past few years. 2024 was an extremely crucial year for Computer Vision, with new models and breakthroughs occurring rapidly. Today, I'll be talking about the 3 biggest advancements in the Computer Vision field during 2024. 3. Vision Language Modeling Simply put, Vision Language Models, or VLMs for short, are Large Language Models (LLMs) that take an image and/or text as input. Think of ChatGPT, where you import an image and then type out a question that you have over the image. It will be able to understand both and output an answer. These are also what you would call a "multimodal model," which is a model that can understand multiple different types of sources/inputs. Although Vision Language Models have existed for a while, they have recently become better and more accurate. However, this doesn't mean it's all sunshine an rainbows: there are still several barri...

Intro to Gradient Descent

March 11, 2025

Credit: Medium Introduction In a Neural Network, there are several possibilities for what the output could be. For example, lets say we are training a Neural Network to guess the tone of language of a person, going from 0 (chill), 0.5 (decent mood) to 1.0 (angry). Well, we have to train the Neural Network to guess the passivity somehow! So, we assign random values to each neuron because we don't know what the function is to accurately guess somebody's tone. This means in the final/output layer, our probability distribution could look like, This is a problem since it's just guessing and hasn't been able to come to a conclusion. So, we need something to train the Neural Network and tell it how far off it is. That's where Gradient Descent comes in. How Does it Work? First, we find the cost function, which is a fancy term that is to find the output value - expected output value square for all neurons summed. For our example, it would look like, f(x) = (0.5-1.0)^2 + ...