Llama. Lots of popular packages and models have "llama" in the name (I'll be using a model and package that have llama in their name!) (Flickr user Katherine).

Introduction

Have you ever wondered how people get 20B parameter models working on their laptops? Well, they have to quantize a model. Quantization is when you lower the accuracy of the model weights (from floating point 32 bit to floating point 16 bit). Although you lose some precision, researchers have found that this loss is minimal. In fact, accuracy still rivals the pre-quantized model. There also comes an added bonus of less storage.

Researchers are even pushing the limit to 1 bit (Wang et. al.) and 1.58 bit (Ma et. al.) LLMs.

Different Quantizations

Terminology

Q1, Q2, Q3, etc.

The number next to the Q refers to how many bits the model will be quantized to. For example, Q6 means 6 bit, Q8 is 8 bit, and Q1 is 1 bit.

_K, _0, _1

This tells us how the model rounds.

Type-0 ("_0") uses the following formula,

w = d * q,

where q is a quantized value and d is the scaling value to transform q into the desired weight (w).

Type-1 ("_1") uses the following formula,

w = d*q + m

where m is the minimum value of all the blocks (weight values that get quantized).

"K" stands for a distribution technique. The model weights are divided into blocks, the sub-blocks which are then normalized (Shatokhin).

Trying it out

I used llama.cpp to quantize my models. For this blog, I will be working with Llama 3.1 8B Instruct. Below is the prompt,

> write me a paragraph talking about the copenhagen interpretation on Quantum Mechanics.

BF16

This is the normal model, without any changes. It uses 16 bits, and takes around 16.1GB of storage. Here is its answer,

> The Copenhagen interpretation, introduced by Niels Bohr and Werner Heisenberg in the 1920s, is one of the earliest and most widely accepted interpretations of Quantum Mechanics. At its core, the Copenhagen interpretation suggests that particles exist in a state of superposition, meaning they can exist in multiple states simultaneously, until they are observed or measured. Upon measurement, the act of observation itself causes the superposition to collapse, and the particle "chooses" one specific state to exist in. This interpretation implies that the act of observation is not just a passive recording of a particle's state, but an active participation in the creation of reality, which has led to discussions about the role of the observer in the Quantum world. Despite its simplicity, the Copenhagen interpretation remains a fundamental and influential concept in Quantum Mechanics, shaping our understanding of the strange and counterintuitive nature of the physical world.

It was able to take in the prompt at a speed of 29.8 tokens per second (t/s), while generating its answer at a speed of 5.2 t/s.

Q4_K_M

This model uses 4 Bits, and using K-Means Clustering with a size of Medium ("_M"). This model takes only around 30% of the original models size, now taking around 4.92GB of storage. Here is its answer,

> The Copenhagen interpretation is one of the earliest and most widely accepted interpretations of Quantum Mechanics, proposed by Niels Bohr and Werner Heisenberg in the 1920s. At its core, the interpretation posits that a quantum system's properties exist in a state of superposition, with multiple possible outcomes simultaneously existing until a measurement is made. Upon measurement, the act of observation itself causes the system to "collapse" into one of the possible outcomes, effectively choosing a specific state from the superposition. This fundamental aspect of the Copenhagen interpretation is often referred to as "wave function collapse," which challenges the classical notion of determinism and instead introduces an element of randomness and uncertainty into the world of quantum physics.

It was able to take in the prompt at an insane speed of1298.8 t/s, while generating its answer at a speed of 61.1 t/s. This answer went a bit more in depth into system collapse, while the BF16 model gave a more general answer, as it talks about "active participation" and a final sentence summarizing the importance of the Copenhagen interpretation/Quantum Mechanics.

Personally, I prefer the Q4_K_M because it feels like it gets to the point a lot quicker than the BF16 version. Although I asked a general question, I got a in-depth answer. However, from the Q4_K_M answer, it might be concerning that the model is immediately pigeon holding into wave function collapse rather than sticking to broader prompt. Still, the model performs well for its size.

Q3_K_L

This model uses 3 Bits, and using K-Means Clustering with a size of Large ("_S"). This model takes about the same storage space as Q4_K_M, taking around 4.31GB of storage. Here is its answer,

> The Copenhagen interpretation, formulated by Niels Bohr and Werner Heisenberg, is a widely-held interpretation of Quantum Mechanics. It posits that a quantum system remains in a superposition of states until it is observed or measured. The act of observation itself causes the system to collapse into one definite state, known as wave function collapse. This interpretation suggests that the act of measurement is not just a passive observation, but an active process that influences the outcome. According to the Copenhagen interpretation, the observer plays a crucial role in determining the state of the system, which has led to criticisms and debates about the role of the observer in the measurement process. The Copenhagen interpretation remains one of the most influential and widely accepted interpretations of Quantum Mechanics, but its subjectivity and lack of a clear, empirical basis have led to alternative interpretations and ongoing discussions within the physics community.

It was able to take in the prompt at a speed of 830.9 t/s, while generating its answer at a speed of 48.4 t/s.

This answer is a little more vague than both Q4_K_M and BF16. In addition, there are a few details that are missing (1920 and explanation for superposition). However, it does mention a lack of clarity and subjectivity, which was not seen in our previous model versions. Unfortunately, I also don't know much about Quantum Mechanics, so I won't make a judgement on the accuracy/correctness of these responses.

A Beginner's look into Artificial Intelligence

A Brief Introduction into Model Quantization

Introduction

Different Quantizations

Terminology

Q1, Q2, Q3, etc.

_K, _0, _1

Trying it out

BF16

Q4_K_M

Q3_K_L

Comments

Post a Comment

Popular posts from this blog

What is a Multimodal LLM?

A Mathematical Explanation of Gradient Descent