Posts

A Brief Introduction into Model Quantization

Image
  Llama. Lots of popular packages and models have "llama" in the name (I'll be using a model and package that have llama in their name!) ( Flickr user Katherine ).   Introduction Have you ever wondered how people get 20B parameter models working on their laptops? Well, they have to quantize a model. Quantization is when you lower the accuracy of the model weights (from floating point 32 bit to floating point 16 bit). Although you lose some precision, researchers have found that this loss is minimal. In fact, accuracy still rivals the pre-quantized model. There also comes an added bonus of less storage.  Researchers are even pushing the limit to 1 bit ( Wang et. al. ) and 1.58 bit ( Ma et. al. ) LLMs.  Different Quantizations Terminology Q1, Q2, Q3, etc.  The number next to the Q refers to how many bits the model will be quantized to. For example, Q6 means 6 bit, Q8 is 8 bit, and Q1 is 1 bit.  _K, _0, _1 This tells us how the model rounds.  Type-0 ...

What is a Multimodal LLM?

Image
Introduction Multimodal is the ability for a model to process several different types of inputs. For example, ChatGPT can support text input as well as image input as one combined input. You could also have a model take in .pdf, .md, and input text as inputs (yes, these count as different modalities).  This allows for users to give represent different ideas in its rawest form possible, such as an image, instead of describing it. If you try to describe it, you could hit hurdles in trying to properly explain it and making sure it interpreted the description correctly.  How do Multimodal LLMs work?  Typically, these modalities are "fused" together. Here's an architecture diagram,  Image from NVIDIA In this case, we have an encoder for a specific modality (e.g. video encoder, image encoder, audio encoder). This encodes them into embeddings the LLM can understand and use. After encoding each specific modality, we then fuse them together. Then, the output is projected and ...

AI and The Arts

Image
  Image created by FLUX.1 Kontext by Black Forest Labs   Introduction As AI models grow larger and larger, companies have to find different ways to get more training data. The more diverse, rich, human content these companies can find, the better these models can get in imitating us.  What do these companies do with these datasets? Well, they feed them to huge Multimodal Models (models that can support several different types of inputs) to make art, music, text, and much more.  But, what does this mean for the creator? Sometimes, companies will copyright strike music from artists despite it being in the public domain because the company has a recording of it.  In this case, can Artists and Musicians copyright strike AI for stealing their art? How does AI Art work and how do you detect it?  In essence, a model takes in a text input and creates a noisy image, and then tries decoding it. This is called Stable Diffusion.  Take Flux K.1 Kontext for example...

Explaining Active Speaker Detection (ASD)

Image
Example of Active Speaker Detection (Ground Truth Labeled Image). Credit: " LoCoNet: Long-Short Context Network for Active Speaker Detection " ( Link ) Introduction Imagine a system that can model how people work and interact. Well, one interaction to model is detecting who spoke in a given scene/frame. This way, not only do models learn how people interact (e.g. two people talking, one waiting for the other, or both talking on top of each other), but can then be used for human-model interactions, and speech diarization (segmenting audio + identifying who spoke when), and much more. Model Architectures In this section, we will discuss two models.  TalkNet ( Tao et. al. 2021 ) TalkNet's goal is to capture long-term context features, as previous models only focused on short term context features, and small time/temporal segments. Their solution utilizes two cross-attention mechanisms, swapping the query values in between the audio and visual features. Then, you concatenate ...

A Mathematical Explanation of Gradient Descent

Image
Example of a close-to-optimal Gradient descent for a quadratic function. Credit: Dylan Introduction Although I wrote an article recently explaining Gradient Descent, I feel like a mathematical explanation of how it works would be beneficial not only to understand how it works, but also how you derive the functions.  Explanation First, we need a way to calculate how far off the model is. For our example, we will use the L2 Loss Function,  Let N represent the total amount of data points we have, i the iteration we are on, yᵢ represent the expected output, and ŷ represent the output predicted by the model.  Let's look more into ŷ. Since ŷ is the predicted value, this is the function that needs to have their parameters changed. Our goal is to change these parameters enough to minimize the loss of ŷ. For our example, let's set ŷ to be a linear line.  The values in "a" and "b" are our parameters. These will be the values we will have to change within f(x) to chang...

Top 3 Breakthroughs in Computer Vision in 2024

Image
Introduction Over the past few year, AI has become popular; however, no year matches the pace of the past few years. 2024 was an extremely crucial year for Computer Vision, with new models and breakthroughs occurring rapidly. Today, I'll be talking about the 3 biggest advancements in the Computer Vision field during 2024.  3. Vision Language Modeling Simply put, Vision Language Models, or VLMs for short, are Large Language Models (LLMs) that take an image and/or text as input. Think of ChatGPT, where you import an image and then type out a question that you have over the image. It will be able to understand both and output an answer. These are also what you would call a "multimodal model," which is a model that can understand multiple different types of sources/inputs.  Although Vision Language Models have existed for a while, they have recently become better and more accurate. However, this doesn't mean it's all sunshine an rainbows: there are still several barri...

Intro to Gradient Descent

Image
Credit: Medium Introduction In a Neural Network, there are several possibilities for what the output could be. For example, lets say we are training a Neural Network to guess the tone of language of a person, going from 0 (chill), 0.5 (decent mood) to 1.0 (angry). Well, we have to train the Neural Network to guess the passivity somehow! So, we assign random values to each neuron because we don't know what the function is to accurately guess somebody's tone. This means in the final/output layer, our probability distribution could look like,  This is a problem since it's just guessing and hasn't been able to come to a conclusion. So, we need something to train the Neural Network and tell it how far off it is. That's where Gradient Descent comes in.  How Does it Work?  First, we find the cost function, which is a fancy term that is to find the output value - expected output value square for all neurons summed. For our example, it would look like,  f(x) = (0.5-1.0)^2 + ...