update docs (#3420)

2025-10-05 00:33:03 +08:00 · 2025-08-15 13:00:08 +08:00
parent cca96ab1e4
commit 562e01c979
12 changed files with 66 additions and 18 deletions
--- a/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md
+++ b/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md
@@ -25,12 +25,12 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following
 ### 2.1 Basic: Launching the Service
 Start the service by following command:
 ```bash
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-0.3B-Paddle \
       --tensor-parallel-size 1 \
       --quantization wint4 \
       --max-model-len 32768 \
-       --kv-cache-ratio 0.75 \
       --max-num-seqs 128
 ```
 - `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
@@ -77,8 +77,8 @@ Add the following lines to the startup parameters
 ```
 Notes:
 1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions
-2. When CUDAGraph is enabled, only single-card inference is supported, that is, `--tensor-parallel-size 1`
-3. When CUDAGraph is enabled, it is not supported to enable `Chunked Prefill` and `Prefix Caching` at the same time
+2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time.
+3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.

 #### 2.2.6 Rejection Sampling
 **Idea:**
--- a/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
+++ b/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
@@ -25,12 +25,12 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the followi
 ### 2.1 Basic: Launching the Service
 Start the service by following command:
 ```bash
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-21B-A3B-Paddle \
       --tensor-parallel-size 1 \
       --quantization wint4 \
       --max-model-len 32768 \
-       --kv-cache-ratio 0.75 \
       --max-num-seqs 128
 ```
 - `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
@@ -87,8 +87,8 @@ Add the following lines to the startup parameters
 ```
 Notes:
 1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions
-2. When CUDAGraph is enabled, only single-card inference is supported, that is, `--tensor-parallel-size 1`
-3. When CUDAGraph is enabled, it is not supported to enable `Chunked Prefill` and `Prefix Caching` at the same time
+2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time.
+3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.

 #### 2.2.6 Rejection Sampling
 **Idea:**
@@ -111,6 +111,7 @@ export INFERENCE_MSG_QUEUE_ID=1315
 export FLAGS_max_partition_size=2048
 export FD_ATTENTION_BACKEND=FLASH_ATTN
 export FD_LOG_DIR="prefill_log"
+export ENABLE_V1_KVCACHE_SCHEDULER=1

 quant_type=block_wise_fp8
 export FD_USE_DEEP_GEMM=0
@@ -120,7 +121,7 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
    --max-num-seqs 20 \
    --num-gpu-blocks-override 40000 \
    --quantization ${quant_type} \
-    --gpu-memory-utilization 0.9 --kv-cache-ratio 0.9 \
+    --gpu-memory-utilization 0.9 \
    --port 7012 --engine-worker-queue-port 7013 --metrics-port 7014 --tensor-parallel-size 4 \
    --cache-queue-port 7015 \
    --splitwise-role "prefill" \
@@ -131,6 +132,7 @@ export CUDA_VISIBLE_DEVICES=4,5,6,7
 export INFERENCE_MSG_QUEUE_ID=1215
 export FLAGS_max_partition_size=2048
 export FD_LOG_DIR="decode_log"
+export ENABLE_V1_KVCACHE_SCHEDULER=1

 quant_type=block_wise_fp8
 export FD_USE_DEEP_GEMM=0
@@ -139,7 +141,7 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
    --max-model-len 131072 \
    --max-num-seqs 20 \
    --quantization ${quant_type} \
-    --gpu-memory-utilization 0.85 --kv-cache-ratio 0.1 \
+    --gpu-memory-utilization 0.85 \
    --port 9012 --engine-worker-queue-port 8013 --metrics-port 8014 --tensor-parallel-size 4 \
    --cache-queue-port 8015 \
    --innode-prefill-ports 7013 \
--- a/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md
+++ b/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md
@@ -22,12 +22,12 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-300B-A47B` on the follo
 ### 2.1 Basic: Launching the Service
 Start the service by following command:
 ```bash
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-300B-A47B-Paddle \
       --tensor-parallel-size 8 \
       --quantization wint4 \
       --max-model-len 32768 \
-       --kv-cache-ratio 0.75 \
       --max-num-seqs 128
 ```
 - `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
@@ -99,6 +99,7 @@ export FD_SAMPLING_CLASS=rejection
 **How to enable:** Take the deployment of a single machine with 8 GPUs and 1P1D (4 GPUs each) as an example. Compared with the default hybrid deployment method, `--splitwise-role` is required to specify the role of the node. And the GPUs and logs of the two nodes are isolated through the environment variables `FD_LOG_DIR` and `CUDA_VISIBLE_DEVICES`.
 ```
 export FD_LOG_DIR="log_prefill"
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 export CUDA_VISIBLE_DEVICES=0,1,2,3
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-300B-A47B-Paddle \
@@ -111,6 +112,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
 ```
 ```
 export FD_LOG_DIR="log_decode"
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 export CUDA_VISIBLE_DEVICES=4,5,6,7
 # Note that innode-prefill-ports is specified as the Prefill serviceengine-worker-queue-port
 python -m fastdeploy.entrypoints.openai.api_server \
@@ -124,5 +126,20 @@ python -m fastdeploy.entrypoints.openai.api_server \
       --splitwise-role "decode"
 ```

+#### 2.2.8 CUDAGraph
+**Idea:**
+CUDAGraph is a GPU computing acceleration technology provided by NVIDIA. It achieves efficient execution and optimization of GPU tasks by capturing CUDA operation sequences into a graph structure. The core idea of CUDAGraph is to encapsulate a series of GPU computing and memory operations into a re-executable graph, thereby reducing CPU-GPU communication overhead, reducing kernel startup latency, and improving overall computing performance.
+
+**How to enable:**
+Add the following lines to the startup parameters
+```
+--use-cudagraph
+--enable-custom-all-reduce
+```
+Notes:
+1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions
+2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time.
+3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
+
 ## FAQ
 If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).
--- a/docs/best_practices/README.md
+++ b/docs/best_practices/README.md
@@ -1,4 +1,7 @@
 # Optimal Deployment

+- [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md)
+- [ERNIE-4.5-21B-A3B-Paddle.md](ERNIE-4.5-21B-A3B-Paddle.md)
+- [ERNIE-4.5-300B-A47B-Paddle.md](ERNIE-4.5-300B-A47B-Paddle.md)
 - [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
 - [ERNIE-4.5-VL-424B-A47B-Paddle](ERNIE-4.5-VL-424B-A47B-Paddle.md)
--- a/docs/get_started/ernie-4.5.md
+++ b/docs/get_started/ernie-4.5.md
@@ -21,6 +21,7 @@ Specify `--model baidu/ERNIE-4.5-300B-A47B-Paddle` during deployment to automati
 Execute the following command to start the service. For configuration details, refer to the [Parameter Guide](../parameters.md):

 ```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-300B-A47B-Paddle \
       --port 8180 --engine-worker-queue-port 8181 \
--- a/docs/get_started/quick_start.md
+++ b/docs/get_started/quick_start.md
@@ -16,6 +16,7 @@ For more information about how to install FastDeploy, refer to the [installation
 After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)

 ```
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-0.3B-Paddle \
       --port 8180 \
--- a/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md
+++ b/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md
@@ -25,12 +25,12 @@ ERNIE-4.5-0.3B 各量化精度，在下列硬件上部署所需要的最小卡
 ### 2.1 基础：启动服务
 通过下列命令启动服务
 ```bash
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-0.3B-Paddle \
       --tensor-parallel-size 1 \
       --quantization wint4 \
       --max-model-len 32768 \
-       --kv-cache-ratio 0.75 \
       --max-num-seqs 128
 ```
 其中：
@@ -77,8 +77,8 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术，通过将 CUDA 操
 ```
 注：
 1. 通常情况下不需要额外设置其他参数，但CUDAGraph会产生一些额外的显存开销，在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明
-2. 开启CUDAGraph时，暂时只支持单卡推理，即`--tensor-parallel-size 1`
-3. 开启CUDAGraph时，暂时不支持同时开启`Chunked Prefill`和`Prefix Caching`
+2. 开启CUDAGraph时，如果是TP>1的多卡推理场景，需要同时指定 `--enable-custom-all-reduce`
+3. 开启CUDAGraph时，暂时不支持`max-model-len > 32768`的场景。

 #### 2.2.5 拒绝采样
 **原理：**
--- a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
+++ b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
@@ -25,12 +25,12 @@ ERNIE-4.5-21B-A3B 各量化精度，在下列硬件上部署所需要的最小
 ### 2.1 基础：启动服务
 通过下列命令启动服务
 ```bash
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-21B-A3B-Paddle \
       --tensor-parallel-size 1 \
       --quantization wint4 \
       --max-model-len 32768 \
-       --kv-cache-ratio 0.75 \
       --max-num-seqs 128
 ```
 其中：
@@ -87,8 +87,8 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术，通过将 CUDA 操
 ```
 注：
 1. 通常情况下不需要额外设置其他参数，但CUDAGraph会产生一些额外的显存开销，在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明
-2. 开启CUDAGraph时，暂时只支持单卡推理，即`--tensor-parallel-size 1`
-3. 开启CUDAGraph时，暂时不支持同时开启`Chunked Prefill`和`Prefix Caching`
+2. 开启CUDAGraph时，如果是TP>1的多卡推理场景，需要同时指定 `--enable-custom-all-reduce`
+3. 开启CUDAGraph时，暂时不支持`max-model-len > 32768`的场景。

 #### 2.2.6 拒绝采样
 **原理：**
@@ -111,6 +111,7 @@ export INFERENCE_MSG_QUEUE_ID=1315
 export FLAGS_max_partition_size=2048
 export FD_ATTENTION_BACKEND=FLASH_ATTN
 export FD_LOG_DIR="prefill_log"
+export ENABLE_V1_KVCACHE_SCHEDULER=1

 quant_type=block_wise_fp8
 export FD_USE_DEEP_GEMM=0
@@ -120,7 +121,7 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
    --max-num-seqs 20 \
    --num-gpu-blocks-override 40000 \
    --quantization ${quant_type} \
-    --gpu-memory-utilization 0.9 --kv-cache-ratio 0.9 \
+    --gpu-memory-utilization 0.9 \
    --port 7012 --engine-worker-queue-port 7013 --metrics-port 7014 --tensor-parallel-size 4 \
    --cache-queue-port 7015 \
    --splitwise-role "prefill" \
@@ -131,6 +132,7 @@ export CUDA_VISIBLE_DEVICES=4,5,6,7
 export INFERENCE_MSG_QUEUE_ID=1215
 export FLAGS_max_partition_size=2048
 export FD_LOG_DIR="decode_log"
+export ENABLE_V1_KVCACHE_SCHEDULER=1

 quant_type=block_wise_fp8
 export FD_USE_DEEP_GEMM=0
@@ -139,7 +141,7 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
    --max-model-len 131072 \
    --max-num-seqs 20 \
    --quantization ${quant_type} \
-    --gpu-memory-utilization 0.85 --kv-cache-ratio 0.1 \
+    --gpu-memory-utilization 0.85 \
    --port 9012 --engine-worker-queue-port 8013 --metrics-port 8014 --tensor-parallel-size 4 \
    --cache-queue-port 8015 \
    --innode-prefill-ports 7013 \
--- a/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md
+++ b/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md
@@ -22,12 +22,12 @@ ERNIE-4.5-300B-A47B各量化精度，在下列硬件上部署所需要的最小
 ### 2.1 基础：启动服务
 通过下列命令启动服务
 ```bash
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-300B-A47B-Paddle \
       --tensor-parallel-size 8 \
       --quantization wint4 \
       --max-model-len 32768 \
-       --kv-cache-ratio 0.75 \
       --max-num-seqs 128
 ```
 其中：
@@ -100,6 +100,7 @@ export FD_SAMPLING_CLASS=rejection
 **启用方式：** 以单机8GPU，1P1D（各4GPU）部署为例，与默认的混合式部署方式相比， 需要`--splitwise-role`指定节点的角色。并通过环境变量`FD_LOG_DIR`和`CUDA_VISIBLE_DEVICES`将两个节点的GPU 和日志隔离开
 ```
 export FD_LOG_DIR="log_prefill"
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 export CUDA_VISIBLE_DEVICES=0,1,2,3
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-300B-A47B-Paddle \
@@ -112,6 +113,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
 ```
 ```
 export FD_LOG_DIR="log_decode"
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 export CUDA_VISIBLE_DEVICES=4,5,6,7
 # 注意innode-prefill-ports指定为Prefill服务的engine-worker-queue-port
 python -m fastdeploy.entrypoints.openai.api_server \
@@ -125,5 +127,20 @@ python -m fastdeploy.entrypoints.openai.api_server \
       --splitwise-role "decode"
 ```

+#### 2.2.8 CUDAGraph
+**原理：**
+CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术，通过将 CUDA 操作序列捕获（capture）为图结构（graph），实现 GPU 任务的高效执行和优化。CUDAGraph 的核心思想是将一系列 GPU 计算和内存操作封装为一个可重复执行的图，从而减少 CPU-GPU 通信开销、降低内核启动延迟，并提升整体计算性能。
+
+**启用方式：**
+在启动命令中增加
+```
+--use-cudagraph
+--enable-custom-all-reduce
+```
+注：
+1. 通常情况下不需要额外设置其他参数，但CUDAGraph会产生一些额外的显存开销，在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明
+2. 开启CUDAGraph时，如果是TP>1的多卡推理场景，需要同时指定 `--enable-custom-all-reduce`
+3. 开启CUDAGraph时，暂时不支持`max-model-len > 32768`的场景。
+
 ## 三、常见问题FAQ
 如果您在使用过程中遇到问题，可以在[FAQ](./FAQ.md)中查阅。
--- a/docs/zh/best_practices/README.md
+++ b/docs/zh/best_practices/README.md
@@ -1,4 +1,7 @@
 # 最佳实践

+- [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md)
+- [ERNIE-4.5-21B-A3B-Paddle.md](ERNIE-4.5-21B-A3B-Paddle.md)
+- [ERNIE-4.5-300B-A47B-Paddle.md](ERNIE-4.5-300B-A47B-Paddle.md)
 - [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
 - [ERNIE-4.5-VL-424B-A47B-Paddle](ERNIE-4.5-VL-424B-A47B-Paddle.md)
--- a/docs/zh/get_started/ernie-4.5.md
+++ b/docs/zh/get_started/ernie-4.5.md
@@ -21,6 +21,7 @@
 执行如下命令，启动服务，其中启动命令配置方式参考[参数说明](../parameters.md)。

 ```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-300B-A47B-Paddle \
       --port 8180 --engine-worker-queue-port 8181 \
--- a/docs/zh/get_started/quick_start.md
+++ b/docs/zh/get_started/quick_start.md
@@ -17,6 +17,7 @@
 安装FastDeploy后，在终端执行如下命令，启动服务，其中启动命令配置方式参考[参数说明](../parameters.md)

 ```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-0.3B-Paddle \
       --port 8180 \