Files
FastDeploy/docs/quantization/wint2.md
AIbin 54b458fd98 [Doc] update wint2 doc (#3819)
* update_wint2_doc
2025-09-03 11:27:43 +08:00

141 lines
5.1 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# WINT2 Quantization
Weights are compressed offline using the [CCQ (Convolutional Coding Quantization)](https://arxiv.org/pdf/2507.07145) method. The actual stored numerical type of weights is INT8, with 4 weights packed into each INT8 value, equivalent to 2 bits per weight. Activations are not quantized. During inference, weights are dequantized and decoded in real-time to BF16 numerical type, and calculations are performed using BF16 numerical type.
- **Supported Hardware**: GPU
- **Supported Architecture**: MoE architecture
This method relies on the convolution algorithm to use overlapping bits to map 2-bit values to a larger numerical representation space, so that the model weight quantization retains more information of the original data while compressing the true value to an extremely low 2-bit size. The general principle can be seen in the figure below:
[卷积编码量化示意图](./wint2.png)
CCQ WINT2 is generally used in resource-constrained and low-threshold scenarios. Taking ERNIE-4.5-300B-A47B as an example, weights are compressed to 89GB, supporting single-card deployment on 141GB H20.
## Executing WINT2 Offline Inference
- When executing TP2/TP4 models, you can change the `model_name_or_path` and `tensor_parallel_size` parameters.
```
model_name_or_path = "baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle"
prompts = ["解析三首李白的诗"]
from fastdeploy import LLM, SamplingParams
sampling_params = SamplingParams(temperature=0.7, top_p=0, max_tokens=128)
llm = LLM(model=model_name_or_path, tensor_parallel_size=1, use_cudagraph=True,)
outputs = llm.generate(prompts, sampling_params)
print(outputs)
```
## Run WINT2 Inference Service
- When executing TP2/TP4 models, you can change the `--model` and `tensor-parallel-size` parameters.
```
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--cache-queue-port 8183 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--use-cudagraph \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-seqs 256
```
## Request the Service
After starting the service, the following output indicates successful initialization:
```shell
api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
INFO: Started server process [13909]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
```
### Health Check
Verify service status (HTTP 200 indicates success):
```shell
curl -i http://0.0.0.0:8180/health
```
### cURL Request
Send requests to the service with the following command:
```shell
curl -X POST "http://0.0.0.0:1822/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write me a poem about large language model."}
],
"stream": true
}'
```
### Python Client (OpenAI-compatible API)
FastDeploy's API is OpenAI-compatible. You can also use Python for requests:
```python
import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
{"role": "user", "content": "Write me a poem about large language model."},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```
By specifying `--model baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle`, the offline quantized WINT2 model can be automatically downloaded from AIStudio. In the config.json file of this model, there will be WINT2 quantization-related configuration information, so there's no need to set `--quantization` when starting the inference service.
Example of quantization configuration in the model's config.json file:
```
"quantization_config": {
"dense_quant_type": "wint8",
"moe_quant_type": "w4w2",
"quantization": "wint2",
"moe_quant_config": {
"moe_w4_quant_config": {
"quant_type": "wint4",
"quant_granularity": "per_channel",
"quant_start_layer": 0,
"quant_end_layer": 6
},
"moe_w2_quant_config": {
"quant_type": "wint2",
"quant_granularity": "pp_acc",
"quant_group_size": 64,
"quant_start_layer": 7,
"quant_end_layer": 53
}
}
}
```
- For more deployment tutorials, please refer to [get_started](../get_started/ernie-4.5.md);
- For more model descriptions, please refer to [Supported Model List](../supported_models.md).
## WINT2 Performance
On the ERNIE-4.5-300B-A47B model, comparison of WINT2 vs WINT4 performance:
| Test Set | Dataset Size | WINT4 | WINT2 |
|---------|---------|---------|---------|
| IFEval |500|88.17 | 85.95 |
|BBH|6511|94.43|90.06|
|DROP|9536|91.17|89.32|
|CMMLU|11477|89.92|86.55|