# Quantization FastDeploy supports various quantization inference precisions including FP8, INT8, INT4, 2-bits, etc. It supports different precision inference for weights, activations, and KVCache tensors, which can meet the inference requirements of different scenarios such as low cost, low latency, and long context. ## 1. Precision Support List | Quantization Method | Weight Precision | Activation Precision | KVCache Precision | Online/Offline | Supported Hardware | |---------|---------|---------|------------|---------|---------| | [WINT8](online_quantization.md#1-wint8--wint4) | INT8 | BF16 | BF16 | Online | GPU, XPU | | [WINT4](online_quantization.md#1-wint8--wint4) | INT4 | BF16 | BF16 | Online | GPU, XPU | | [Block-wise FP8](online_quantization.md#2-block-wise-fp8) | block-wise static FP8 | token-wise dynamic FP8 | BF16 | Online | GPU | | [WINT2](wint2.md) | 2Bits | BF16 | BF16 | Offline | GPU | | MixQuant | INT4/INT8 | INT8/BF16 | INT8/BF16 | Offline | GPU, XPU | **Notes** 1. **Quantization Method**: Corresponds to the "quantization" field in the quantization configuration file. 2. **Online/Offline Quantization**: Mainly used to distinguish when to quantize the weights. - **Online Quantization**: The weights are quantized after being loaded into inference engine. - **Offline Quantization**: Before inference, weights are quantized offline and stored as low-bit numerical types. During inference, the quantized low-bit numerical values are loaded. 3. **Dynamic/Static Quantization**: Mainly used to distinguish the quantization method of activations - **Static Quantization**: Quantization coefficients are determined and stored before inference. During inference, pre-calculated quantization coefficients are loaded. Since quantization coefficients remain fixed (static) during inference, it's called static quantization. - **Dynamic Quantization**: During inference, quantization coefficients for the current batch are calculated in real-time. Since quantization coefficients change dynamically during inference, it's called dynamic quantization. ## 2. Model Support List | Model Name | Supported Quantization Precision | |---------|---------| | ERNIE-4.5-300B-A47B | WINT8, WINT4, Block-wise FP8, MixQuant| ## 3. Quantization Precision Terminology FastDeploy names various quantization precisions in the following format: ``` {tensor abbreviation}{numerical type}{tensor abbreviation}{numerical type}{tensor abbreviation}{numerical type} ``` Examples: - **W8A8C8**: W=weights, A=activations, C=CacheKV; 8 defaults to INT8 - **W8A8C16**: 16 defaults to BF16, others same as above - **W4A16C16 / WInt4 / weight-only int4**: 4 defaults to INT4 - **WNF4A8C8**: NF4 refers to 4bits norm-float numerical type - **Wfp8Afp8**: Both weights and activations are FP8 precision - **W4Afp8**: Weights are INT4, activations are FP8