update docs (#3420)

This commit is contained in:
JYChen
2025-08-15 13:00:08 +08:00
committed by GitHub
parent cca96ab1e4
commit 562e01c979
12 changed files with 66 additions and 18 deletions

View File

@@ -25,12 +25,12 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following
### 2.1 Basic: Launching the Service
Start the service by following command:
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--tensor-parallel-size 1 \
--quantization wint4 \
--max-model-len 32768 \
--kv-cache-ratio 0.75 \
--max-num-seqs 128
```
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
@@ -77,8 +77,8 @@ Add the following lines to the startup parameters
```
Notes:
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions
2. When CUDAGraph is enabled, only single-card inference is supported, that is, `--tensor-parallel-size 1`
3. When CUDAGraph is enabled, it is not supported to enable `Chunked Prefill` and `Prefix Caching` at the same time
2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time.
3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
#### 2.2.6 Rejection Sampling
**Idea:**

View File

@@ -25,12 +25,12 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the followi
### 2.1 Basic: Launching the Service
Start the service by following command:
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-21B-A3B-Paddle \
--tensor-parallel-size 1 \
--quantization wint4 \
--max-model-len 32768 \
--kv-cache-ratio 0.75 \
--max-num-seqs 128
```
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
@@ -87,8 +87,8 @@ Add the following lines to the startup parameters
```
Notes:
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions
2. When CUDAGraph is enabled, only single-card inference is supported, that is, `--tensor-parallel-size 1`
3. When CUDAGraph is enabled, it is not supported to enable `Chunked Prefill` and `Prefix Caching` at the same time
2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time.
3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
#### 2.2.6 Rejection Sampling
**Idea:**
@@ -111,6 +111,7 @@ export INFERENCE_MSG_QUEUE_ID=1315
export FLAGS_max_partition_size=2048
export FD_ATTENTION_BACKEND=FLASH_ATTN
export FD_LOG_DIR="prefill_log"
export ENABLE_V1_KVCACHE_SCHEDULER=1
quant_type=block_wise_fp8
export FD_USE_DEEP_GEMM=0
@@ -120,7 +121,7 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
--max-num-seqs 20 \
--num-gpu-blocks-override 40000 \
--quantization ${quant_type} \
--gpu-memory-utilization 0.9 --kv-cache-ratio 0.9 \
--gpu-memory-utilization 0.9 \
--port 7012 --engine-worker-queue-port 7013 --metrics-port 7014 --tensor-parallel-size 4 \
--cache-queue-port 7015 \
--splitwise-role "prefill" \
@@ -131,6 +132,7 @@ export CUDA_VISIBLE_DEVICES=4,5,6,7
export INFERENCE_MSG_QUEUE_ID=1215
export FLAGS_max_partition_size=2048
export FD_LOG_DIR="decode_log"
export ENABLE_V1_KVCACHE_SCHEDULER=1
quant_type=block_wise_fp8
export FD_USE_DEEP_GEMM=0
@@ -139,7 +141,7 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
--max-model-len 131072 \
--max-num-seqs 20 \
--quantization ${quant_type} \
--gpu-memory-utilization 0.85 --kv-cache-ratio 0.1 \
--gpu-memory-utilization 0.85 \
--port 9012 --engine-worker-queue-port 8013 --metrics-port 8014 --tensor-parallel-size 4 \
--cache-queue-port 8015 \
--innode-prefill-ports 7013 \

View File

@@ -22,12 +22,12 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-300B-A47B` on the follo
### 2.1 Basic: Launching the Service
Start the service by following command:
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--tensor-parallel-size 8 \
--quantization wint4 \
--max-model-len 32768 \
--kv-cache-ratio 0.75 \
--max-num-seqs 128
```
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
@@ -99,6 +99,7 @@ export FD_SAMPLING_CLASS=rejection
**How to enable:** Take the deployment of a single machine with 8 GPUs and 1P1D (4 GPUs each) as an example. Compared with the default hybrid deployment method, `--splitwise-role` is required to specify the role of the node. And the GPUs and logs of the two nodes are isolated through the environment variables `FD_LOG_DIR` and `CUDA_VISIBLE_DEVICES`.
```
export FD_LOG_DIR="log_prefill"
export ENABLE_V1_KVCACHE_SCHEDULER=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
@@ -111,6 +112,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
```
```
export FD_LOG_DIR="log_decode"
export ENABLE_V1_KVCACHE_SCHEDULER=1
export CUDA_VISIBLE_DEVICES=4,5,6,7
# Note that innode-prefill-ports is specified as the Prefill serviceengine-worker-queue-port
python -m fastdeploy.entrypoints.openai.api_server \
@@ -124,5 +126,20 @@ python -m fastdeploy.entrypoints.openai.api_server \
--splitwise-role "decode"
```
#### 2.2.8 CUDAGraph
**Idea:**
CUDAGraph is a GPU computing acceleration technology provided by NVIDIA. It achieves efficient execution and optimization of GPU tasks by capturing CUDA operation sequences into a graph structure. The core idea of CUDAGraph is to encapsulate a series of GPU computing and memory operations into a re-executable graph, thereby reducing CPU-GPU communication overhead, reducing kernel startup latency, and improving overall computing performance.
**How to enable:**
Add the following lines to the startup parameters
```
--use-cudagraph
--enable-custom-all-reduce
```
Notes:
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions
2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time.
3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
## FAQ
If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).

View File

@@ -1,4 +1,7 @@
# Optimal Deployment
- [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md)
- [ERNIE-4.5-21B-A3B-Paddle.md](ERNIE-4.5-21B-A3B-Paddle.md)
- [ERNIE-4.5-300B-A47B-Paddle.md](ERNIE-4.5-300B-A47B-Paddle.md)
- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
- [ERNIE-4.5-VL-424B-A47B-Paddle](ERNIE-4.5-VL-424B-A47B-Paddle.md)

View File

@@ -21,6 +21,7 @@ Specify `--model baidu/ERNIE-4.5-300B-A47B-Paddle` during deployment to automati
Execute the following command to start the service. For configuration details, refer to the [Parameter Guide](../parameters.md):
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 --engine-worker-queue-port 8181 \

View File

@@ -16,6 +16,7 @@ For more information about how to install FastDeploy, refer to the [installation
After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8180 \

View File

@@ -25,12 +25,12 @@ ERNIE-4.5-0.3B 各量化精度,在下列硬件上部署所需要的最小卡
### 2.1 基础:启动服务
通过下列命令启动服务
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--tensor-parallel-size 1 \
--quantization wint4 \
--max-model-len 32768 \
--kv-cache-ratio 0.75 \
--max-num-seqs 128
```
其中:
@@ -77,8 +77,8 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操
```
注:
1. 通常情况下不需要额外设置其他参数但CUDAGraph会产生一些额外的显存开销在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明
2. 开启CUDAGraph时暂时只支持单卡推理,即`--tensor-parallel-size 1`
3. 开启CUDAGraph时暂时不支持同时开启`Chunked Prefill``Prefix Caching`
2. 开启CUDAGraph时如果是TP>1的多卡推理场景需要同时指定 `--enable-custom-all-reduce`
3. 开启CUDAGraph时暂时不支持`max-model-len > 32768`的场景。
#### 2.2.5 拒绝采样
**原理:**

View File

@@ -25,12 +25,12 @@ ERNIE-4.5-21B-A3B 各量化精度,在下列硬件上部署所需要的最小
### 2.1 基础:启动服务
通过下列命令启动服务
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-21B-A3B-Paddle \
--tensor-parallel-size 1 \
--quantization wint4 \
--max-model-len 32768 \
--kv-cache-ratio 0.75 \
--max-num-seqs 128
```
其中:
@@ -87,8 +87,8 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操
```
注:
1. 通常情况下不需要额外设置其他参数但CUDAGraph会产生一些额外的显存开销在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明
2. 开启CUDAGraph时暂时只支持单卡推理,即`--tensor-parallel-size 1`
3. 开启CUDAGraph时暂时不支持同时开启`Chunked Prefill``Prefix Caching`
2. 开启CUDAGraph时如果是TP>1的多卡推理场景需要同时指定 `--enable-custom-all-reduce`
3. 开启CUDAGraph时暂时不支持`max-model-len > 32768`的场景。
#### 2.2.6 拒绝采样
**原理:**
@@ -111,6 +111,7 @@ export INFERENCE_MSG_QUEUE_ID=1315
export FLAGS_max_partition_size=2048
export FD_ATTENTION_BACKEND=FLASH_ATTN
export FD_LOG_DIR="prefill_log"
export ENABLE_V1_KVCACHE_SCHEDULER=1
quant_type=block_wise_fp8
export FD_USE_DEEP_GEMM=0
@@ -120,7 +121,7 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
--max-num-seqs 20 \
--num-gpu-blocks-override 40000 \
--quantization ${quant_type} \
--gpu-memory-utilization 0.9 --kv-cache-ratio 0.9 \
--gpu-memory-utilization 0.9 \
--port 7012 --engine-worker-queue-port 7013 --metrics-port 7014 --tensor-parallel-size 4 \
--cache-queue-port 7015 \
--splitwise-role "prefill" \
@@ -131,6 +132,7 @@ export CUDA_VISIBLE_DEVICES=4,5,6,7
export INFERENCE_MSG_QUEUE_ID=1215
export FLAGS_max_partition_size=2048
export FD_LOG_DIR="decode_log"
export ENABLE_V1_KVCACHE_SCHEDULER=1
quant_type=block_wise_fp8
export FD_USE_DEEP_GEMM=0
@@ -139,7 +141,7 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
--max-model-len 131072 \
--max-num-seqs 20 \
--quantization ${quant_type} \
--gpu-memory-utilization 0.85 --kv-cache-ratio 0.1 \
--gpu-memory-utilization 0.85 \
--port 9012 --engine-worker-queue-port 8013 --metrics-port 8014 --tensor-parallel-size 4 \
--cache-queue-port 8015 \
--innode-prefill-ports 7013 \

View File

@@ -22,12 +22,12 @@ ERNIE-4.5-300B-A47B各量化精度在下列硬件上部署所需要的最小
### 2.1 基础:启动服务
通过下列命令启动服务
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--tensor-parallel-size 8 \
--quantization wint4 \
--max-model-len 32768 \
--kv-cache-ratio 0.75 \
--max-num-seqs 128
```
其中:
@@ -100,6 +100,7 @@ export FD_SAMPLING_CLASS=rejection
**启用方式:** 以单机8GPU1P1D各4GPU部署为例与默认的混合式部署方式相比 需要`--splitwise-role`指定节点的角色。并通过环境变量`FD_LOG_DIR``CUDA_VISIBLE_DEVICES`将两个节点的GPU 和日志隔离开
```
export FD_LOG_DIR="log_prefill"
export ENABLE_V1_KVCACHE_SCHEDULER=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
@@ -112,6 +113,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
```
```
export FD_LOG_DIR="log_decode"
export ENABLE_V1_KVCACHE_SCHEDULER=1
export CUDA_VISIBLE_DEVICES=4,5,6,7
# 注意innode-prefill-ports指定为Prefill服务的engine-worker-queue-port
python -m fastdeploy.entrypoints.openai.api_server \
@@ -125,5 +127,20 @@ python -m fastdeploy.entrypoints.openai.api_server \
--splitwise-role "decode"
```
#### 2.2.8 CUDAGraph
**原理:**
CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操作序列捕获capture为图结构graph实现 GPU 任务的高效执行和优化。CUDAGraph 的核心思想是将一系列 GPU 计算和内存操作封装为一个可重复执行的图,从而减少 CPU-GPU 通信开销、降低内核启动延迟,并提升整体计算性能。
**启用方式:**
在启动命令中增加
```
--use-cudagraph
--enable-custom-all-reduce
```
注:
1. 通常情况下不需要额外设置其他参数但CUDAGraph会产生一些额外的显存开销在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明
2. 开启CUDAGraph时如果是TP>1的多卡推理场景需要同时指定 `--enable-custom-all-reduce`
3. 开启CUDAGraph时暂时不支持`max-model-len > 32768`的场景。
## 三、常见问题FAQ
如果您在使用过程中遇到问题,可以在[FAQ](./FAQ.md)中查阅。

View File

@@ -1,4 +1,7 @@
# 最佳实践
- [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md)
- [ERNIE-4.5-21B-A3B-Paddle.md](ERNIE-4.5-21B-A3B-Paddle.md)
- [ERNIE-4.5-300B-A47B-Paddle.md](ERNIE-4.5-300B-A47B-Paddle.md)
- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
- [ERNIE-4.5-VL-424B-A47B-Paddle](ERNIE-4.5-VL-424B-A47B-Paddle.md)

View File

@@ -21,6 +21,7 @@
执行如下命令,启动服务,其中启动命令配置方式参考[参数说明](../parameters.md)。
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 --engine-worker-queue-port 8181 \

View File

@@ -17,6 +17,7 @@
安装FastDeploy后在终端执行如下命令启动服务其中启动命令配置方式参考[参数说明](../parameters.md)
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8180 \