mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-12-24 13:28:13 +08:00
Some checks failed
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
Publish Job / publish_pre_check (push) Has been cancelled
Publish Job / print_publish_pre_check_outputs (push) Has been cancelled
Publish Job / FD-Clone-Linux (push) Has been cancelled
Publish Job / Show Code Archive Output (push) Has been cancelled
Publish Job / BUILD_SM8090 (push) Has been cancelled
Publish Job / BUILD_SM8689 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled
Publish Job / Run FD Image Build (push) Has been cancelled
Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled
Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
Publish Job / Run Base Tests (push) Has been cancelled
Publish Job / Run Accuracy Tests (push) Has been cancelled
Publish Job / Run Stable Tests (push) Has been cancelled
CI Images Build / FD-Clone-Linux (push) Has been cancelled
CI Images Build / Show Code Archive Output (push) Has been cancelled
CI Images Build / CI Images Build (push) Has been cancelled
CI Images Build / BUILD_SM8090 (push) Has been cancelled
CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled
CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
CI Images Build / Run Base Tests (push) Has been cancelled
CI Images Build / Run Accuracy Tests (push) Has been cancelled
CI Images Build / Run Stable Tests (push) Has been cancelled
CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled
2.4 KiB
2.4 KiB
在线量化
在线量化是指推理引擎在加载 BF16 权重后对权重做量化,而不是加载离线量化好的低精度权重。FastDeploy 支持将 BF16 在线量化到多种精度,包括:INT4, INT8 和 FP8.
1. WINT8 & WINT4
仅将权重在线量化为 INT8 或 INT4,推理时即时地将权重反量化为 BF16 后与激活进行计算。
- 量化粒度:仅支持 channel-wise 粒度的量化;
- 支持硬件:GPU,XPU
- 支持结构:MoE 结构,Dense Linear
启动WINT8或WINT4推理服务
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 --engine-worker-queue-port 8181 \
--cache-queue-port 8183 --metrics-port 8182 \
--tensor-parallel-size 8 \
--quantization wint8 \
--max-model-len 32768 \
--max-num-seqs 32
- 通过指定
--model baidu/ERNIE-4.5-300B-A47B-Paddle可自动从AIStudio下载模型。FastDeploy依赖Paddle格式的模型,更多说明参考支持模型列表。 - 通过设置
--quantization为wint8或wint4选择在线 INT8/INT4 量化。 - 部署 ERNIE-4.5-300B-A47B-Paddle WINT8 最少需要 80G 8卡, WINT4 则需要 80GB 4卡。
- 更多部署教程请参考get_started.
2. Block-wise FP8
加载 BF16 模型,将权重以 128X128 block-wise 的粒度在线量化为 FP8 数值类型。推理时,激活会动态、即时地做 token-wise FP8 量化。
- FP8规格:float8_e4m3fn
- 支持硬件:Hopper GPU 架构
- 支持结构:MoE 结构,Dense Linear
启动Block-wise FP8推理服务
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 --engine-worker-queue-port 8181 \
--cache-queue-port 8183 --metrics-port 8182 \
--tensor-parallel-size 8 \
--quantization block_wise_fp8 \
--max-model-len 32768 \
--max-num-seqs 32
- 通过指定
--model baidu/ERNIE-4.5-300B-A47B-Paddle可自动从AIStudio下载模型。FastDeploy依赖Paddle格式的模型,更多说明参考支持模型列表。 - 通过设置
--quantization为block_wise_fp8选择在线 Block-wise FP8 量化。 - 部署 ERNIE-4.5-300B-A47B-Paddle Block-wise FP8 最少需要 80G * 8卡。
- 更多部署教程请参考get_started