import torch
print(torch.iinfo(torch.int8))Quantization Fundamentals
This is completely based on Quantization Fundamentals
For the code part, you can checkout this link
Quantization helps to reduce the size of the model with little or no degradation.
Handling Big Models
Current model compression technique:-
Pruning:-remove connections that do not improve the model.
Knowledge Distillation:- Train a smaller model(Student) using the original model(Teacher). Cons:-You need to have enough hardware to fit both teacher as well student both.
Options to quantize:-
You can quantize weights.
You can quantize activations that propagate through the layers of neural network
Idea:Store the parameters of the model in lower precision
Data Types and Sizes
Integer
Unsigned Integer (8-bit):- Range is [0, 255] [0, 2n-1] (All 8 bits are used to represent the number)
Signed Integer (8-bit):- Range is [-128, 127] [-2n-1, 2n-1-1] (7 bits are used to represent the number and the 8th bit represent the sign 0:Positive 1:Negative)
| Data Type | torch.dtype | torch.dtype alias |
|---|---|---|
| 8-bit signed integer | torch.int8 | |
| 8-bit unsigned integer | torch.uint8 | |
| 16-bit signed integer | torch.int16 | torch.short |
| 32-bit signed integer | torch.int32 | torch.int |
| 64-bit signed integer | torch.int64 | torch.long |
- You can use below mentioned code to find out the more info
Floating Point
3 components in floating point: Sign:- positive/negative (always 1 bit) Exponent(range): impact the representable range of the number Fraction(precision): impact on the precision of the number
- FP32, BF16, FP16, FP8 are floating point format with a specific number of bits for exponent and the fraction.
- FP32
- Sign: 1 bit
- Exponent(range): 8 bit
- Fraction(precision): 23 bit
- Total: 32 bit
- BF16
- Sign: 1 bit
- Exponent(range): 8 bit
- Fraction(precision): 7 bit
- Total: 16 bit
- FP16
- Sign: 1 bit
- Exponent(range): 5 bit
- Fraction(precision): 10 bit
- Total: 16 bit
Comparison Of Data Types
| Data Type | Precision | Maximum |
|---|---|---|
| FP32 | Best | ~10+38 |
| FP16 | Better | ~1004 |
| BF16 | Good | ~1038 |
| Data Type | torch.dtype | torch.dtype alias |
|---|---|---|
| 16-bit floating point | torch.float16 | torch.half |
| 16-bit brain floating point | torch.bfloat16 | |
| 32-bit floating point | torch.float32 | torch.float |
| 64-bit floating point | torch.float64 | torch.double |
import torch
print("By default, python stores float data in fp64")
value = 1/3
tensor_fp64 = torch.tensor(value, dtype = torch.float64)
tensor_fp32 = torch.tensor(value, dtype = torch.float32)
tensor_fp16 = torch.tensor(value, dtype = torch.float16)
tensor_bf16 = torch.tensor(value, dtype = torch.bfloat16)
print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")
print(f"fp32 tensor: {format(tensor_fp32.item(), '.60f')}")
print(f"fp16 tensor: {format(tensor_fp16.item(), '.60f')}")
print(f"bf16 tensor: {format(tensor_bf16.item(), '.60f')}")
print(torch.finfo(torch.bfloat16))PyTorch Downcasting
when a higher data type converted to a lower data type, it results in loss of data
Advantages:
- Reduced memory footprint
- Increased compute and speed (Depends on the hardware)
Disadvantages:
- Less precise
Use case:
- Mixed precision training
- Do computation in smaller precision (FP16/BF16/FP8)
- Store and update the weights in higher precision (FP32)
- Mixed precision training
Loading Models by data type
target_dtype = torch.float16 or torch.bfloat16
model = model.to(target_dtype)
model = model.half() for fp16
model = model.bfloat16() for bfloat16
Always use bfloat16 instead of float16 while using pytorch-cpu
FP32 is default in pytorch
model.get_memory_footprint()/1e+6
torch.set_default_dtype(desired_dtype) # By doing so we can directly load the model in desired dtype without loading in full precision and then quantizing it
set it back to float32 to avoid unnecessary behaviors
Quantization Theory
Quantization refers to the process of mapping a large set to a smaller set of values.
How do we convert the FP32 weights to INT8 without losing too much information??
- It is done using a linear mapping using linear mapping parameters.
- s = scale
- z = zero point
How do we get back our original tensor from the quantized tensor?
- We can’t get exactly the original tensor but using dequantization following linear relationship that used to quantize the original tensor.
Quantize Using Quanto Library
- from quanto import quantize, freeze
- quantize(model, weights = desired_dtype, activations = desired_dtype)
- freeze(model)
- quantize create an intermediate state of the model
- after calling freeze, we get the quantized weights
Uses of the Intermediate State
- Calibration
- Calibrate model when quantizing the activations of the model.
- Range of activation values depends on what input was given. (e.g. different input text will generate different activations)
- Min/Max of activation ranges are used to perform linear quantization.
- How to get min and max range of activations?
- Gather sample input data.
- Run inference.
- Calculate min/max of activations
- Result: better quantized activations
- Calibrate model when quantizing the activations of the model.
- Quantization Aware Training
- Training in a way that controls how the model performs once it is quantized.
- Intermediate state holds both(quantized as well as unquantized weights).
- Use quantized version of model in forward pass (e.g. BF16)
- Update original, unquantized version of model weights during back propagation (e.g. FP32)
- In L4 there is a function in helper.py to calculate the model size
Linear Quantization
Even if it looks very simple, it is used in many SOTA quantization methods:
- AWQ: Activation-aware Weight Quantization
- GPTQ: GPT Quantized
- BNB: BitsandBytes Quantization
- Simple idea: linear mapping
- r = s(q-z)
- where
- r:- original value(e.g. in FP32)
- s:- scale(e.g. in FP32)
- q:- quantized value(e.g. in INT8)
- z:- zero point(e.g. INT8)
- How do we get scale and zero point??
- s = (r_max-r_min)/(q_max-q_min)
- z = int(round(q_min-r_min/s))
Quantization of LLMs
Recent SOTA quantization methods:-
- LLM.INT8 (only 8-bit)
- Decomposes the mat-mul in two stages (outlier part in float16 and non-outlier part in int8).
- QLoRA (only 4-bit)
- Quantize as well as fine-tune the adapters
- AWQ
- GPTQ
- SmoothQuant
More recent SOTA quantization methods for 2-bit quantization
- QuIP#
- HQQ
- AQLM
All are open-source
Some Quantization Methods require calibration (from above)
Some Quantization Methods require Adjustments
Many of these methods were applied to LLMs, but if we want then we can apply to other type of models by making few adjustments to the quantization methods
Some methods can be applied without making adjustments
- Linear quantization
- LL.INT8
- QLoRA
- HQQ
Other approaches are data-dependent
There are distributors on HuggingFacewhich gives a quantized version of popular models (TheBloke)
Checkout HuggingFace Open LLM leaderboard to see how these quantized models are performing
Benefits of fine-tuning a quantized model:
- Recover the accuracy from quantization
- Tailor your model for specific use-cases and applications
Fine tune with Quantization Aware Training (QAT)
- Not compatible with Post Training Quantization (PTQ) techniques.
- The linear quantization method is an example of PTQ.
- PEFT + QLoRA
- QLoRa quantizes the pre-trained base weights in 4-bit precision.
- This matches the precision of the LoRA weights.
- This allows the model to add the activations of the pre-trained and adapter weights.
- This sum of the two activations can be fed as the input to the next layer of the network.