A Brief Introduction into Model Quantization
Llama. Lots of popular packages and models have "llama" in the name (I'll be using a model and package that have llama in their name!) ( Flickr user Katherine ). Introduction Have you ever wondered how people get 20B parameter models working on their laptops? Well, they have to quantize a model. Quantization is when you lower the accuracy of the model weights (from floating point 32 bit to floating point 16 bit). Although you lose some precision, researchers have found that this loss is minimal. In fact, accuracy still rivals the pre-quantized model. There also comes an added bonus of less storage. Researchers are even pushing the limit to 1 bit ( Wang et. al. ) and 1.58 bit ( Ma et. al. ) LLMs. Different Quantizations Terminology Q1, Q2, Q3, etc. The number next to the Q refers to how many bits the model will be quantized to. For example, Q6 means 6 bit, Q8 is 8 bit, and Q1 is 1 bit. _K, _0, _1 This tells us how the model rounds. Type-0 ...