[Docs] Release 2.1 docs and fix some description (#3424)

2025-12-24 13:28:13 +08:00 · 2025-08-15 14:27:19 +08:00
parent fbb6dcb9e4
commit d4e3a20300
14 changed files with 73 additions and 29 deletions
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-English | [简体中文](README_CN.md) 
+English | [简体中文](README_CN.md)
 <p align="center">
  <a href="https://github.com/PaddlePaddle/FastDeploy/releases"><img src="https://github.com/user-attachments/assets/42b0039f-39e3-4279-afda-6d1865dfbffb" width="500"></a>
 </p>
@@ -23,9 +23,10 @@ English | [简体中文](README_CN.md)
 </p>

 --------------------------------------------------------------------------------
-# FastDeploy 2.0: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
+# FastDeploy 2.1: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

 ## News
+**[2025-08] 🔥 Released FastDeploy v2.1:** A brand-new KV Cache scheduling strategy has been introduced, and expanded support for PD separation and CUDA Graph across more models. Enhanced hardware support has been added for platforms like Kunlun and Hygon, along with comprehensive optimizations to improve the performance of both the service and inference engine.

 **[2025-07] 《FastDeploy2.0推理部署实测》专题活动已上线!** 完成文心4.5系列开源模型的推理部署等任务，即可获得骨瓷马克杯等FastDeploy2.0官方周边及丰富奖金！🎁 欢迎大家体验反馈～ 📌[报名地址](https://www.wjx.top/vm/meSsp3L.aspx#)   📌[活动详情](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)

@@ -75,13 +76,13 @@ Learn how to use FastDeploy through our documentation:

 | Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching |  MTP | CUDA Graph | Maximum Context Length |
 |:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
-|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| WIP |128K |
-|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| WIP | 128K |
+|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| ✅ |128K |
+|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| ✅ | 128K |
 |ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K |
 |ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K |
 |ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8  |  ❌ |  ✅ |  ✅ | ✅ | ✅|128K |
-|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8  |  ❌ |  ✅ |  ✅ | ❌ | ✅|128K |
-|ERNIE-4.5-0.3B | BF16/WINT8/FP8  |  ❌ |  ✅ |  ✅ | ❌ | ✅| 128K |
+|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8  |  ✅ |  ✅ |  ✅ | ❌  | ✅|128K |
+|ERNIE-4.5-0.3B | BF16/WINT8/FP8  |  ✅ |  ✅ |  ✅ | ❌ | ✅| 128K |

 ## Advanced Usage

--- a/README_CN.md
+++ b/README_CN.md
@@ -1,5 +1,4 @@
 [English](README.md) | 简体中文
-[English](README.md) | 简体中文
 <p align="center">
  <a href="https://github.com/PaddlePaddle/FastDeploy/releases"><img src="https://github.com/user-attachments/assets/42b0039f-39e3-4279-afda-6d1865dfbffb" width="500"></a>
 </p>
@@ -24,9 +23,10 @@
 </p>

 --------------------------------------------------------------------------------
-# FastDeploy 2.0：基于飞桨的大语言模型与视觉语言模型推理部署工具包
+# FastDeploy 2.1：基于飞桨的大语言模型与视觉语言模型推理部署工具包

 ## 最新活动
+**[2025-08] 🔥 FastDeploy v2.1 全新发布:** 全新的KV Cache调度策略，更多模型支持PD分离和CUDA Graph，昆仑、海光等更多硬件支持增强，全方面优化服务和推理引擎的性能。

 **[2025-07] 《FastDeploy2.0推理部署实测》专题活动已上线!** 完成文心4.5系列开源模型的推理部署等任务，即可获得骨瓷马克杯等FastDeploy2.0官方周边及丰富奖金！🎁 欢迎大家体验反馈～ 📌[报名地址](https://www.wjx.top/vm/meSsp3L.aspx#)   📌[活动详情](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)

@@ -41,7 +41,6 @@
 - ⏩ **高级加速技术**：推测解码、多令牌预测（MTP）及分块预填充
 - 🖥️ **多硬件支持**：NVIDIA GPU、昆仑芯XPU、海光DCU、昇腾NPU、天数智芯GPU、燧原GCU、沐曦GPU等

-
 ## 要求

 - 操作系统: Linux
@@ -73,13 +72,13 @@ FastDeploy 支持在**英伟达（NVIDIA）GPU**、**昆仑芯（Kunlunxin）XPU

 | Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching |  MTP | CUDA Graph | Maximum Context Length |
 |:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
-|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| WIP |128K |
-|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| WIP | 128K |
+|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| ✅ |128K |
+|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| ✅ | 128K |
 |ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K |
 |ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K |
 |ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8  |  ❌ |  ✅ |  ✅ | ✅ | ✅|128K |
-|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8  |  ❌ |  ✅ |  ✅ | ❌ | ✅|128K |
-|ERNIE-4.5-0.3B | BF16/WINT8/FP8  |  ❌ |  ✅ |  ✅ | ❌ | ✅| 128K |
+|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8  |  ✅ |  ✅ |  ✅ | ❌  | ✅|128K |
+|ERNIE-4.5-0.3B | BF16/WINT8/FP8  |  ✅ |  ✅ |  ✅ | ❌ | ✅| 128K |

 ## 进阶用法

--- a/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
+++ b/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -27,7 +27,6 @@ Installation process reference documentation [FastDeploy GPU Install](../get_sta
 **Example 1:** Deploying a 32K Context Service on a Single RTX 4090 GPU
 ```shell
 export ENABLE_V1_KVCACHE_SCHEDULER=1
-
 python -m fastdeploy.entrypoints.openai.api_server \
  --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
  --port 8180 \
@@ -47,7 +46,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
 **Example 2:** Deploying a 128K Context Service on Dual H800 GPUs
 ```shell
 export ENABLE_V1_KVCACHE_SCHEDULER=1
-
 python -m fastdeploy.entrypoints.openai.api_server \
  --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
  --port 8180 \
@@ -64,6 +62,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
  --quantization wint4 \
  --enable-mm
 ```
+
+> ⚠️ For versions 2.1 and above, the new scheduler needs to be enabled via an environment variable `ENABLE_V1_KVCACHE_SCHEDULER=1`. Otherwise, some requests may be truncated before reaching the maximum length or return empty results.
+
 An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
 ### 2.2 Advanced: How to Achieve Better Performance

@@ -109,6 +110,15 @@ An example is a set of configurations that can run stably while also delivering
  - If slightly higher precision is required, you may try WINT8.
  - Only consider using BFLOAT16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.

+#### 2.2.4  **Adjustable environment variables**
+> **Rejection sampling：**`FD_SAMPLING_CLASS=rejection`
+- **Description：** Rejection sampling involves generating samples from a proposal distribution that is easy to sample from, thereby avoiding explicit sorting and achieving an effect of improving sampling speed, which can enhance inference performance.
+- **Recommendation：** This is a relatively aggressive optimization strategy that affects the results, and we are still conducting comprehensive validation of its impact. If you have high performance requirements and can accept potential compromises in results, you may consider enabling this strategy.
+
+> **Attention Hyperparameter：**`FLAGS_max_partition_size=1024`
+- **Description：** The hyperparameters for the Append Attention (default) backend have been tested on commonly used datasets, and our results show that setting it to 1024 can significantly improve decoding speed, especially in long-text scenarios.
+- **Recommendation：** In the future, it will be modified to an automatic adjustment mechanism. If you have high performance requirements, you may consider enabling it.
+
 ## 3. FAQ
 **Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.

--- a/docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
+++ b/docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
@@ -24,7 +24,6 @@ Installation process reference documentation [FastDeploy GPU Install](../get_sta
 **Example 1:** Deploying a 128K context service on 8x H800 GPUs.
 ```shell
 export ENABLE_V1_KVCACHE_SCHEDULER=1
-
 python -m fastdeploy.entrypoints.openai.api_server \
  --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
  --port 8180 \
@@ -42,6 +41,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
  --enable-mm
 ```

+> ⚠️ For versions 2.1 and above, the new scheduler needs to be enabled via an environment variable `ENABLE_V1_KVCACHE_SCHEDULER=1`. Otherwise, some requests may be truncated before reaching the maximum length or return empty results.
+
 An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
 ### 2.2 Advanced: How to Achieve Better Performance

@@ -87,6 +88,15 @@ An example is a set of configurations that can run stably while also delivering
  - If slightly higher precision is required, you may try wint8.
  - Only consider using bfloat16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.

+#### 2.2.4  **Adjustable environment variables**
+> **Rejection sampling：**`FD_SAMPLING_CLASS=rejection`
+- **Description：** Rejection sampling involves generating samples from a proposal distribution that is easy to sample from, thereby avoiding explicit sorting and achieving an effect of improving sampling speed, which can enhance inference performance.
+- **Recommendation：** This is a relatively aggressive optimization strategy that affects the results, and we are still conducting comprehensive validation of its impact. If you have high performance requirements and can accept potential compromises in results, you may consider enabling this strategy.
+
+> **Attention Hyperparameter：**`FLAGS_max_partition_size=1024`
+- **Description：** The hyperparameters for the Append Attention (default) backend have been tested on commonly used datasets, and our results show that setting it to 1024 can significantly improve decoding speed, especially in long-text scenarios.
+- **Recommendation：** In the future, it will be modified to an automatic adjustment mechanism. If you have high performance requirements, you may consider enabling it.
+
 ## 3. FAQ
 **Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.

--- a/docs/get_started/ernie-4.5-vl.md
+++ b/docs/get_started/ernie-4.5-vl.md
@@ -23,6 +23,7 @@ Execute the following command to start the service. For parameter configurations
 >💡 **Note**: Since the model parameter size is 424B-A47B, on an 80G * 8 GPU machine, specify ```--quantization wint4``` (wint8 is also supported).

 ```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
       --port 8180 --engine-worker-queue-port 8181 \
--- a/docs/get_started/installation/README.md
+++ b/docs/get_started/installation/README.md
@@ -3,6 +3,7 @@
 FastDeploy currently supports installation on the following hardware platforms:

 - [NVIDIA GPU Installation](nvidia_gpu.md)
+- [Hygon DCU Installation](hygon_dcu.md)
 - [Kunlun XPU Installation](kunlunxin_xpu.md)
 - [Enflame S60 GCU Installation](Enflame_gcu.md)
 - [Iluvatar GPU Installation](iluvatar_gpu.md)
--- a/docs/get_started/installation/nvidia_gpu.md
+++ b/docs/get_started/installation/nvidia_gpu.md
@@ -20,7 +20,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12

 First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
 ```shell
-python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
 ```

 Then install fastdeploy. **Do not install from PyPI**. Use the following methods instead:
@@ -58,7 +58,7 @@ docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .

 First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
 ```shell
-python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
 ```

 Then clone the source code and build:
--- a/docs/get_started/quick_start_vl.md
+++ b/docs/get_started/quick_start_vl.md
@@ -19,6 +19,7 @@ For more information about how to install FastDeploy, refer to the [installation
 After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)

 ```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
       --port 8180 \
--- a/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
+++ b/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -9,9 +9,9 @@
 |:----------:|:----------:|:------:| :------:|
 | A30 [24G] | 2 | 2 | 4 |
 | L20 [48G] | 1 | 1 | 2 |
-| H20 [144G] | 1 | 1 |  1 |
-| A100 [80G] | 1 | 1 |  1 |
-| H800 [80G] | 1 | 1 |  1 |
+| H20 [144G] | 1 | 1 | 1 |
+| A100 [80G] | 1 | 1 | 1 |
+| H800 [80G] | 1 | 1 | 1 |

 ### 1.2 安装fastdeploy

@@ -26,7 +26,6 @@
 **示例1：** 4090上单卡部署32K上下文的服务
 ```shell
 export ENABLE_V1_KVCACHE_SCHEDULER=1
-
 python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
    --port 8180 \
@@ -46,7 +45,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
 **示例2：** H800上双卡部署128K上下文的服务
 ```shell
 export ENABLE_V1_KVCACHE_SCHEDULER=1
-
 python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
    --port 8180 \
@@ -63,6 +61,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
    --quantization wint4 \
    --enable-mm
 ```
+> ⚠️ 2.1及以上版本需要通过环境变量开启新调度器 `ENABLE_V1_KVCACHE_SCHEDULER=1`，否则可能会有部分请求最大长度前截断或返空。
+
 示例是可以稳定运行的一组配置，同时也能得到比较好的性能。
 如果对精度、性能有进一步的要求，请继续阅读下面的内容。
 ### 2.2 进阶：如何获取更优性能
@@ -110,6 +110,15 @@ python -m fastdeploy.entrypoints.openai.api_server \
  - 若需要稍高的精度，可尝试WINT8。
  - 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16，因为它需要更多显存。

+#### 2.2.4  **可调整的环境变量**
+> **拒绝采样：**`FD_SAMPLING_CLASS=rejection`
+- **描述**：拒绝采样即从一个易于采样的提议分布（proposal distribution）中生成样本，避免显式排序从而达到提升采样速度的效果，可以提升推理性能。
+- **推荐**：这是一种影响效果的较为激进的优化策略，我们还在全面验证影响。如果对性能有较高要求，也可以接受对效果的影响时可以尝试开启。
+
+> **Attention超参：**`FLAGS_max_partition_size=1024`
+- **描述**：Append Attntion(默认)后端的超参，我们在常用数据集上的测试结果表明，设置为1024后可以大幅提升解码速度，尤其是长文场景。
+- **推荐**：未来会修改为自动调整的机制。如果对性能有较高要求可以尝试开启。
+
 ## 三、常见问题FAQ
 **注意：** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。

--- a/docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
+++ b/docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
@@ -23,8 +23,6 @@
 ### 2.1 基础：启动服务
 **示例1：** H800上8卡部署128K上下文的服务
 ```shell
-export ENABLE_V1_KVCACHE_SCHEDULER=1
-
 python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
    --port 8180 \
@@ -41,6 +39,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
    --quantization wint4 \
    --enable-mm
 ```
+> ⚠️ 2.1及以上版本需要通过环境变量开启新调度器 `ENABLE_V1_KVCACHE_SCHEDULER=1`，否则可能会有部分请求最大长度前截断或返空。
+
 示例是可以稳定运行的一组配置，同时也能得到比较好的性能。
 如果对精度、性能有进一步的要求，请继续阅读下面的内容。
 ### 2.2 进阶：如何获取更优性能
@@ -87,6 +87,15 @@ python -m fastdeploy.entrypoints.openai.api_server \
  - 若需要稍高的精度，可尝试WINT8。
  - 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16，因为它需要更多显存。

+#### 2.2.4  **可调整的环境变量**
+> **拒绝采样：**`FD_SAMPLING_CLASS=rejection`
+- **描述**：拒绝采样即从一个易于采样的提议分布（proposal distribution）中生成样本，避免显式排序从而达到提升采样速度的效果，可以提升推理性能。
+- **推荐**：这是一种影响效果的较为激进的优化策略，我们还在全面验证影响。如果对性能有较高要求，也可以接受对效果的影响时可以尝试开启。
+
+> **Attention超参：**`FLAGS_max_partition_size=1024`
+- **描述**：Append Attntion(默认)后端的超参，我们在常用数据集上的测试结果表明，设置为1024后可以大幅提升解码速度，尤其是长文场景。
+- **推荐**：未来会修改为自动调整的机制。如果对性能有较高要求可以尝试开启。
+
 ## 三、常见问题FAQ
 **注意：** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。

--- a/docs/zh/get_started/ernie-4.5-vl.md
+++ b/docs/zh/get_started/ernie-4.5-vl.md
@@ -23,6 +23,7 @@
 **注意**： 由于模型参数量为424B-A47B，在80G * 8卡的机器上，需指定```--quantization wint4```(wint8也可部署)。

 ```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
       --port 8180 --engine-worker-queue-port 8181 \
--- a/docs/zh/get_started/installation/README.md
+++ b/docs/zh/get_started/installation/README.md
@@ -1,8 +1,9 @@
-# FastDeploy Installation Guide
+# FastDeploy 安装

-FastDeploy currently supports installation on the following hardware platforms:
+FastDeploy支持如下硬件平台:

 - [NVIDIA GPU Installation](nvidia_gpu.md)
+- [Hygon DCU Installation](hygon_dcu.md)
 - [Kunlunxin XPU Installation](kunlunxin_xpu.md)
 - [Enflame S60 GCU Installation](Enflame_gcu.md)
 - [Iluvatar GPU Installation](iluvatar_gpu.md)
--- a/docs/zh/get_started/installation/nvidia_gpu.md
+++ b/docs/zh/get_started/installation/nvidia_gpu.md
@@ -23,7 +23,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12
 首先安装 paddlepaddle-gpu，详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)

 ``` shell
-python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
 ```

 再安装 fastdeploy，**注意不要通过pypi源安装**，需要通过如下方式安装
@@ -64,7 +64,7 @@ docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .
 首先安装 paddlepaddle-gpu，详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/)

 ``` shell
-python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
 ```

 接着克隆源代码，编译安装
--- a/docs/zh/get_started/quick_start_vl.md
+++ b/docs/zh/get_started/quick_start_vl.md
@@ -19,6 +19,7 @@
 安装FastDeploy后，在终端执行如下命令，启动服务，其中启动命令配置方式参考[参数说明](../parameters.md)

 ```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
 python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
       --port 8180 \