[Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction (#4944)

* [Docs] Improve reasoning_out docs * [Docs] Improve reasoning_out docs * [Docs] Improve reasoning_out docs * [Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction * [Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction * [Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction * [Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction --------- Co-authored-by: liqinrui <liqinrui@baidu.com>
2025-12-24 13:28:13 +08:00 · 2025-11-11 11:40:52 +08:00
parent c0a4e2b63b
commit 75294bcfb1
4 changed files with 247 additions and 2 deletions
--- a/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md
+++ b/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md
@@ -0,0 +1,122 @@
+[简体中文](../zh/best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md)
+
+# ERNIE-4.5-VL-28B-A3B-Thinking
+
+## 1. Environment Preparation
+### 1.1 Support Status
+
+The minimum number of cards required for deployment on the following hardware is as follows:
+
+| Device [GPU Mem] | WINT4 | WINT8 | BFLOAT16 |
+|:----------:|:----------:|:------:| :------:|
+| A30 [24G] | 2 | 2 | 4 |
+| L20 [48G] | 1 | 1 | 2 |
+| H20 [144G] | 1 | 1 |  1 |
+| A100 [80G] | 1 | 1 |  1 |
+| H800 [80G] | 1 | 1 |  1 |
+
+### 1.2 Install Fastdeploy
+
+Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md)
+
+## 2.How to Use
+### 2.1 Basic: Launching the Service
+**Example 1:** Deploying a 32K Context Service on a Single RTX 4090 GPU
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
+    --port 8180 \
+    --metrics-port 8181 \
+    --engine-worker-queue-port 8182 \
+    --tensor-parallel-size 1 \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+    --reasoning-parser ernie-45-vl-thinking \
+    --tool-call-parser ernie-45-vl-thinking \
+    --mm-processor-kwargs '{"image_max_pixels": 12845056 }'
+    --gpu-memory-utilization 0.9 \
+    --max-num-batched-tokens 16384 \
+    --quantization wint4
+```
+**Example 2:** Deploying a 128K Context Service on Dual H800 GPUs
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
+    --port 8180 \
+    --metrics-port 8181 \
+    --engine-worker-queue-port 8182 \
+    --tensor-parallel-size 2 \
+    --max-model-len 131072 \
+    --max-num-seqs 128 \
+    --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+    --reasoning-parser ernie-45-vl-thinking \
+    --tool-call-parser ernie-45-vl-thinking \
+    --mm-processor-kwargs '{"image_max_pixels": 12845056 }'
+    --gpu-memory-utilization 0.9 \
+    --max-num-batched-tokens 16384 \
+    --quantization wint4
+```
+
+An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
+### 2.2 Advanced: How to Achieve Better Performance
+
+#### 2.2.1 Evaluating Application Scenarios and Setting Parameters Correctly
+> **Context Length**
+- **Parameters：** `--max-model-len`
+- **Description：** Controls the maximum context length that the model can process.
+- **Recommendation：** Longer context lengths may reduce throughput. Adjust based on actual needs, with a maximum supported context length of **128k** (131,072).
+
+   ⚠️ Note: Longer context lengths will significantly increase GPU memory requirements. Ensure your hardware resources are sufficient before setting a longer context.
+> **Maximum sequence count**
+- **Parameters：** `--max-num-seqs`
+- **Description：** Controls the maximum number of sequences the service can handle, supporting a range of 1 to 256.
+- **Recommendation：** If you are unsure of the average number of sequences per request in your actual application scenario, we recommend setting it to **256**. If the average number of sequences per request in your application is significantly fewer than 256, we suggest setting it to a slightly higher value than the average to further reduce GPU memory usage and optimize service performance.
+
+> **Multi-image and multi-video input**
+- **Parameters**：`--limit-mm-per-prompt`
+- **Description**：Our model supports multi-image and multi-video input in a single prompt. Please use this **Parameters** setting to limit the number of images/videos per request, ensuring efficient resource utilization.
+- **Recommendation**：We recommend setting the number of images and videos in a single prompt to **100 each** to balance performance and memory usage.
+
+> **Available GPU memory ratio during initialization**
+- **Parameters：** `--gpu-memory-utilization`
+- **Description：** Controls the available GPU memory for FastDeploy service initialization. The default value is 0.9, meaning 10% of the memory is reserved for backup.
+- **Recommendation：** It is recommended to use the default value of 0.9. If an "out of memory" error occurs during stress testing, you may attempt to reduce this value.
+
+#### 2.2.2 Chunked Prefill
+- **Parameters：** `--enable-chunked-prefill`
+- **Description：** Enabling `chunked prefill` can reduce peak GPU memory usage and improve service throughput. Version 2.2 has **enabled by default**; for versions prior to 2.2, you need to enable it manually—refer to the best practices documentation for 2.1.
+- **Relevant configurations**:
+
+    `--max-num-batched-tokens`：Limit the maximum number of tokens per chunk, with a recommended setting of 384.
+
+#### 2.2.3  **Quantization precision**
+- **Parameters：** `--quantization`
+
+- **Supported precision types：**
+  - WINT4 (Suitable for most users)
+  - WINT8
+  - BFLOAT16 (When the `--quantization` parameter is not set, BFLOAT16 is used by default.)
+
+- **Recommendation：**
+  - Unless you have extremely stringent precision requirements, we strongly recommend using WINT4 quantization. This will significantly reduce memory consumption and increase throughput.
+  - If slightly higher precision is required, you may try WINT8.
+  - Only consider using BFLOAT16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.
+
+#### 2.2.4  **Adjustable environment variables**
+> **Rejection sampling：**`FD_SAMPLING_CLASS=rejection`
+- **Description：** Rejection sampling involves generating samples from a proposal distribution that is easy to sample from, thereby avoiding explicit sorting and achieving an effect of improving sampling speed, which can enhance inference performance.
+- **Recommendation：** This is a relatively aggressive optimization strategy that affects the results, and we are still conducting comprehensive validation of its impact. If you have high performance requirements and can accept potential compromises in results, you may consider enabling this strategy.
+
+## 3. FAQ
+
+### 3.1 Out of Memory
+If the service prompts "Out of Memory" during startup, please try the following solutions:
+1. Ensure no other processes are occupying GPU memory;
+2. Use WINT4/WINT8 quantization and enable chunked prefill;
+3. Reduce context length and maximum sequence count as needed;
+4. Increase the number of GPU cards for deployment (e.g., 2 or 4 cards) by modifying the parameter `--tensor-parallel-size 2` or `--tensor-parallel-size 4`.
+
+If the service starts normally but later reports insufficient memory, try:
+1. Adjust the initial GPU memory utilization ratio by modifying `--gpu-memory-utilization`;
+2. Increase the number of deployment cards (parameter adjustment as above).
--- a/docs/supported_models.md
+++ b/docs/supported_models.md
@@ -50,7 +50,7 @@ These models accept multi-modal inputs (e.g., images and text).

 |Models|DataType|Example HF Model|
 |-|-|-|
-| ERNIE-VL  |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle<br>&emsp;[quick start](./get_started/ernie-4.5-vl.md) &emsp; [best practice](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Paddle<br>&emsp;[quick start](./get_started/quick_start_vl.md) &emsp; [best practice](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;|
+| ERNIE-VL  |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle<br>&emsp;[quick start](./get_started/ernie-4.5-vl.md) &emsp; [best practice](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Paddle<br>&emsp;[quick start](./get_started/quick_start_vl.md) &emsp; [best practice](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Thinking<br>&emsp;[quick start](./get_started/ernie-4.5-vl-thinking.md) &emsp; [best practice](./best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md);
 | PaddleOCR-VL  |BF16/WINT4/WINT8| PaddlePaddle/PaddleOCR-VL<br>&emsp; [best practice](./best_practices/PaddleOCR-VL-0.9B.md) ;|
 | QWEN-VL  |BF16/WINT4/FP8| Qwen/Qwen2.5-VL-72B-Instruct;<br>Qwen/Qwen2.5-VL-32B-Instruct;<br>Qwen/Qwen2.5-VL-7B-Instruct;<br>Qwen/Qwen2.5-VL-3B-Instruct|

--- a/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md
+++ b/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md
@@ -0,0 +1,123 @@
+[English](../../best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md)
+
+# ERNIE-4.5-VL-28B-A3B-Thinking
+
+## 一、环境准备
+### 1.1 支持情况
+在下列硬件上部署所需要的最小卡数如下：
+
+| 设备[显存] | WINT4 | WINT8 | BFLOAT16 |
+|:----------:|:----------:|:------:| :------:|
+| A30 [24G] | 2 | 2 | 4 |
+| L20 [48G] | 1 | 1 | 2 |
+| H20 [144G] | 1 | 1 | 1 |
+| A100 [80G] | 1 | 1 | 1 |
+| H800 [80G] | 1 | 1 | 1 |
+
+### 1.2 安装fastdeploy
+
+安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)
+
+## 二、如何使用
+### 2.1 基础：启动服务
+ **示例1：** 4090上单卡部署32K上下文的服务
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
+    --port 8180 \
+    --metrics-port 8181 \
+    --engine-worker-queue-port 8182 \
+    --tensor-parallel-size 1 \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+    --reasoning-parser ernie-45-vl-thinking \
+    --tool-call-parser ernie-45-vl-thinking \
+    --mm-processor-kwargs '{"image_max_pixels": 12845056 }'
+    --gpu-memory-utilization 0.9 \
+    --max-num-batched-tokens 16384 \
+    --quantization wint4
+```
+ **示例2：** H800上双卡部署128K上下文的服务
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
+    --port 8180 \
+    --metrics-port 8181 \
+    --engine-worker-queue-port 8182 \
+    --tensor-parallel-size 2 \
+    --max-model-len 131072 \
+    --max-num-seqs 128 \
+    --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+    --reasoning-parser ernie-45-vl-thinking \
+    --tool-call-parser ernie-45-vl-thinking \
+    --mm-processor-kwargs '{"image_max_pixels": 12845056 }'
+    --gpu-memory-utilization 0.9 \
+    --max-num-batched-tokens 16384 \
+    --quantization wint4
+```
+
+示例是可以稳定运行的一组配置，同时也能得到比较好的性能。
+如果对精度、性能有进一步的要求，请继续阅读下面的内容。
+### 2.2 进阶：如何获取更优性能
+
+#### 2.2.1 评估应用场景，正确设置参数
+> **上下文长度**
+- **参数：** `--max-model-len`
+- **描述：** 控制模型可处理的最大上下文长度。
+- **推荐：** 更长的上下文会导致吞吐降低，根据实际情况设置，`ERNIE-4.5-VL-28B-A3B-Thinking`最长支持**128k**（131072）长度的上下文。
+
+   ⚠️ 注：更长的上下文会显著增加GPU显存需求，设置更长的上下文之前确保硬件资源是满足的。
+> **最大序列数量**
+- **参数：** `--max-num-seqs`
+- **描述：** 控制服务可以处理的最大序列数量，支持1～256。
+- **推荐：** 如果您不知道实际应用场景中请求的平均序列数量是多少，我们建议设置为**256**。如果您的应用场景中请求的平均序列数量明显少于256，我们建议设置为一个略大于平均值的较小值，以进一步降低显存占用，优化服务性能。
+
+> **多图、多视频输入**
+- **参数**：`--limit-mm-per-prompt`
+- **描述**：我们的模型支持单次提示词（prompt）中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量，以确保资源高效利用。
+- **推荐**：我们建议将单次提示词（prompt）中的图片和视频数量均设置为100个，以平衡性能与内存占用。
+
+> **初始化时可用的显存比例**
+- **参数：** `--gpu-memory-utilization`
+- **用处：** 用于控制 FastDeploy 初始化服务的可用显存，默认0.9，即预留10%的显存备用。
+- **推荐：** 推荐使用默认值0.9。如果服务压测时提示显存不足，可以尝试调低该值。
+
+#### 2.2.2 Chunked Prefill
+- **参数：** `--enable-chunked-prefill`
+- **用处：** 开启 `chunked prefill` 可降低显存峰值并提升服务吞吐。2.2版本已经**默认开启**，2.2之前需要手动开启，参考2.1的最佳实践文档。
+
+- **相关配置**:
+
+    `--max-num-batched-tokens`：限制每个chunk的最大token数量。多模场景下每个chunk会向上取整保持图片的完整性，因此实际每次推理的总token数会大于该值。我们推荐设置为384。
+
+#### 2.2.3  **量化精度**
+- **参数：** `--quantization`
+
+- **已支持的精度类型：**
+  - WINT4 (适合大多数用户)
+  - WINT8
+  - BFLOAT16 (未设置 `--quantization` 参数时，默认使用BFLOAT16)
+
+- **推荐：**
+  - 除非您有极其严格的精度要求，否则我们建议使用WINT4量化。这将显著降低内存占用并提升吞吐量。
+  - 若需要稍高的精度，可尝试WINT8。
+  - 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16，因为它需要更多显存。
+
+#### 2.2.4  **可调整的环境变量**
+> **拒绝采样：**`FD_SAMPLING_CLASS=rejection`
+- **描述**：拒绝采样即从一个易于采样的提议分布（proposal distribution）中生成样本，避免显式排序从而达到提升采样速度的效果，可以提升推理性能。
+- **推荐**：这是一种影响效果的较为激进的优化策略，我们还在全面验证影响。如果对性能有较高要求，也可以接受对效果的影响时可以尝试开启。
+
+## 三、常见问题FAQ
+
+### 3.1 显存不足(OOM)
+如果服务启动时提示显存不足，请尝试以下方法：
+1. 确保无其他进程占用显卡显存；
+2. 使用WINT4/WINT8量化，开启chunked prefill；
+3. 酌情降低上下文长度和最大序列数量；
+4. 增加部署卡数，使用2卡或4卡部署，即修改参数 `--tensor-parallel-size 2` 或 `--tensor-parallel-size 4`。
+
+如果可以服务可以正常启动，运行时提示显存不足，请尝试以下方法：
+1. 酌情降低初始化时可用的显存比例，即调整参数 `--gpu-memory-utilization` 的值；
+2. 增加部署卡数，参数修改同上。
--- a/docs/zh/supported_models.md
+++ b/docs/zh/supported_models.md
@@ -48,7 +48,7 @@ python -m fastdeploy.entrypoints.openai.api_server \

 |模型|DataType|模型案例|
 |-|-|-|
-| ERNIE-VL  |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle<br>&emsp;[快速部署](./get_started/ernie-4.5-vl.md) &emsp; [最佳实践](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Paddle<br>&emsp;[快速部署](./get_started/quick_start_vl.md) &emsp; [最佳实践](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;|
+| ERNIE-VL  |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle<br>&emsp;[快速部署](./get_started/ernie-4.5-vl.md) &emsp; [最佳实践](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Paddle<br>&emsp;[快速部署](./get_started/quick_start_vl.md) &emsp; [最佳实践](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Thinking<br>&emsp;[快速部署](./get_started/ernie-4.5-vl-thinking.md)&emsp; [最佳实践](./best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md) ;
 | PaddleOCR-VL  |BF16/WINT4/WINT8| PaddlePaddle/PaddleOCR-VL<br>&emsp; [最佳实践](./best_practices/PaddleOCR-VL-0.9B.md) ;|
 | QWEN-VL  |BF16/WINT4/FP8| Qwen/Qwen2.5-VL-72B-Instruct;<br>Qwen/Qwen2.5-VL-32B-Instruct;<br>Qwen/Qwen2.5-VL-7B-Instruct;<br>Qwen/Qwen2.5-VL-3B-Instruct|