diff --git a/README.md b/README.md index 8ddb61add..0e635bf46 100644 --- a/README.md +++ b/README.md @@ -68,6 +68,7 @@ Learn how to use FastDeploy through our documentation: - [Offline Inference Development](./docs/offline_inference.md) - [Online Service Deployment](./docs/online_serving/README.md) - [Full Supported Models List](./docs/supported_models.md) +- [Optimal Deployment](./docs/optimal_deployment/README.md) ## Supported Models diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md new file mode 100644 index 000000000..e79a7158c --- /dev/null +++ b/docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md @@ -0,0 +1,123 @@ + +# ERNIE-4.5-VL-28B-A3B-Paddle + +## 1. Environment Preparation +### 1.1 Support Status + +The minimum number of cards required for deployment on the following hardware is as follows: +| Device [GPU Mem] | WINT4 | WINT8 | BFLOAT16 | +|:----------:|:----------:|:------:| :------:| +| A30 [24G] | 2 | 2 | 4 | +| L20 [48G] | 1 | 1 | 2 | +| H20 [144G] | 1 | 1 | 1 | +| A100 [80G] | 1 | 1 | 1 | +| H800 [80G] | 1 | 1 | 1 | + +### 1.2 Install Fastdeploy + +Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md) + +> ⚠️ Precautions: +> - FastDeploy only supports models in Paddle format – please ensure to download models with the `-Paddle` file extension. +> - The model name will trigger an automatic download. If the model has already been downloaded, you can directly use the absolute path to the model's download location. + +## 2.How to Use +### 2.1 Basic: Launching the Service +**Example 1:** Deploying a 32K Context Service on a Single RTX 4090 GPU +```shell +export ENABLE_V1_KVCACHE_SCHEDULER=1 + +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --tensor-parallel-size 1 \ + --max-model-len 32768 \ + --max-num-seqs 256 \ + --limit-mm-per-prompt '{"image": 100, "video": 100}' \ + --reasoning-parser ernie-45-vl \ + --gpu-memory-utilization 0.9 \ + --enable-chunked-prefill \ + --max-num-batched-tokens 384 \ + --quantization wint4 \ + --enable-mm +``` +**Example 2:** Deploying a 128K Context Service on Dual H800 GPUs +```shell +export ENABLE_V1_KVCACHE_SCHEDULER=1 + +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --tensor-parallel-size 2 \ + --max-model-len 131072 \ + --max-num-seqs 256 \ + --limit-mm-per-prompt '{"image": 100, "video": 100}' \ + --reasoning-parser ernie-45-vl \ + --gpu-memory-utilization 0.9 \ + --enable-chunked-prefill \ + --max-num-batched-tokens 384 \ + --quantization wint4 \ + --enable-mm +``` +An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below. +### 2.2 Advanced: How to Achieve Better Performance + +#### 2.2.1 Evaluating Application Scenarios and Setting Parameters Correctly +> **Context Length** +- **Parameters:** `--max-model-len` +- **Description:** Controls the maximum context length that the model can process. +- **Recommendation:** Longer context lengths may reduce throughput. Adjust based on actual needs, with a maximum supported context length of **128k** (131,072). + + ⚠️ Note: Longer context lengths will significantly increase GPU memory requirements. Ensure your hardware resources are sufficient before setting a longer context. +> **Maximum sequence count** +- **Parameters:** `--max-num-seqs` +- **Description:** Controls the maximum number of sequences the service can handle, supporting a range of 1 to 256. +- **Recommendation:** If you are unsure of the average number of sequences per request in your actual application scenario, we recommend setting it to **256**. If the average number of sequences per request in your application is significantly fewer than 256, we suggest setting it to a slightly higher value than the average to further reduce GPU memory usage and optimize service performance. + +> **Multi-image and multi-video input** +- **Parameters**:`--limit-mm-per-prompt` +- **Description**:Our model supports multi-image and multi-video input in a single prompt. Please use this **Parameters** setting to limit the number of images/videos per request, ensuring efficient resource utilization. +- **Recommendation**:We recommend setting the number of images and videos in a single prompt to **100 each** to balance performance and memory usage. + +> **Available GPU memory ratio during initialization** +- **Parameters:** `--gpu-memory-utilization` +- **Description:** Controls the available GPU memory for FastDeploy service initialization. The default value is 0.9, meaning 10% of the memory is reserved for backup. +- **Recommendation:** It is recommended to use the default value of 0.9. If an "out of memory" error occurs during stress testing, you may attempt to reduce this value. + +#### 2.2.2 Chunked Prefill +- **Parameters:** `--enable-chunked-prefill` +- **Description:** Enabling `chunked prefill` can **reduce peak GPU memory usage** and **improve service throughput**. +- **Other relevant configurations**: + + `--max-num-batched-tokens`:Limit the maximum number of tokens per chunk, with a recommended setting of 384. + +#### 2.2.3 **Quantization precision** +- **Parameters:** `--quantization` + +- **Supported precision types:** + - WINT4 (Suitable for most users) + - WINT8 + - BFLOAT16 (When the `--quantization` parameter is not set, BFLOAT16 is used by default.) + +- **Recommendation:** + - Unless you have extremely stringent precision requirements, we strongly recommend using WINT4 quantization. This will significantly reduce memory consumption and increase throughput. + - If slightly higher precision is required, you may try WINT8. + - Only consider using BFLOAT16 if your application scenario demands extreme precision, as it requires significantly more GPU memory. + +## 3. FAQ +**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`. + +### 3.1 Out of Memory +If the service prompts "Out of Memory" during startup, please try the following solutions: +1. Ensure no other processes are occupying GPU memory; +2. Use WINT4/WINT8 quantization and enable chunked prefill; +3. Reduce context length and maximum sequence count as needed; +4. Increase the number of GPU cards for deployment (e.g., 2 or 4 cards) by modifying the parameter `--tensor-parallel-size 2` or `--tensor-parallel-size 4`. + +If the service starts normally but later reports insufficient memory, try: +1. Adjust the initial GPU memory utilization ratio by modifying `--gpu-memory-utilization`; +2. Increase the number of deployment cards (parameter adjustment as above). diff --git a/docs/optimal_deployment/ERNIE-4.5-VL-424B-A47B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-VL-424B-A47B-Paddle.md new file mode 100644 index 000000000..899ce425a --- /dev/null +++ b/docs/optimal_deployment/ERNIE-4.5-VL-424B-A47B-Paddle.md @@ -0,0 +1,99 @@ + +# ERNIE-4.5-VL-424B-A47B-Paddle + +## 1. Environment Preparation +### 1.1 Support Status +The minimum number of cards required for deployment on the following hardware is as follows: +| Device [GPU Mem] | WINT4 | WINT8 | BFLOAT16 | +|:----------:|:----------:|:------:| :------:| +| H20 [144G] | 8 | 8 | 8 | +| A100 [80G] | 8 | 8 | - | +| H800 [80G] | 8 | 8 | - | + +### 1.2 Install Fastdeploy + +Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md) + +> ⚠️ Precautions: +> - FastDeploy only supports models in Paddle format – please ensure to download models with the `-Paddle` file extension. +> - The model name will trigger an automatic download. If the model has already been downloaded, you can directly use the absolute path to the model's download location. + +## 2.How to Use +### 2.1 Basic: Launching the Service +**Example 1:** Deploying a 128K context service on 8x H800 GPUs. +```shell +export ENABLE_V1_KVCACHE_SCHEDULER=1 + +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --tensor-parallel-size 8 \ + --max-model-len 131072 \ + --max-num-seqs 16 \ + --limit-mm-per-prompt '{"image": 100, "video": 100}' \ + --reasoning-parser ernie-45-vl \ + --gpu-memory-utilization 0.8 \ + --enable-chunked-prefill \ + --max-num-batched-tokens 384 \ + --quantization wint4 \ + --enable-mm +``` + +An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below. +### 2.2 Advanced: How to Achieve Better Performance + +#### 2.2.1 Evaluating Application Scenarios and Setting Parameters Correctly +> **Context Length** +- **Parameters:** `--max-model-len` +- **Description:** Controls the maximum context length that the model can process. +- **Recommendation:** Longer context lengths may reduce throughput. Adjust based on actual needs, with a maximum supported context length of **128k** (131,072). + + ⚠️ Note: Longer context lengths will significantly increase GPU memory requirements. Ensure your hardware resources are sufficient before setting a longer context. +> **Maximum sequence count** +- **Parameters:** `--max-num-seqs` +- **Description:** Controls the maximum number of sequences the service can handle, supporting a range of 1 to 256. +- **Recommendation:** If you are unsure of the average number of sequences per request in your actual application scenario, we recommend setting it to **256**. If the average number of sequences per request in your application is significantly fewer than 256, we suggest setting it to a slightly higher value than the average to further reduce GPU memory usage and optimize service performance. + +> **Multi-image and multi-video input** +- **Parameters**:`--limit-mm-per-prompt` +- **Description**:Our model supports multi-image and multi-video input in a single prompt. Please use this **Parameters** setting to limit the number of images/videos per request, ensuring efficient resource utilization. +- **Recommendation**:We recommend setting the number of images and videos in a single prompt to **100 each** to balance performance and memory usage. + +> **Available GPU memory ratio during initialization** +- **Parameters:** `--gpu-memory-utilization` +- **Description:** Controls the available GPU memory for FastDeploy service initialization. The default value is 0.9, meaning 10% of the memory is reserved for backup. +- **Recommendation:** It is recommended to use the default value of 0.9. If an "out of memory" error occurs during stress testing, you may attempt to reduce this value. + +#### 2.2.2 Chunked Prefill +- **Parameters:** `--enable-chunked-prefill` +- **Description:** Enabling `chunked prefill` can **reduce peak GPU memory usage** and **improve service throughput**. +- **Other relevant configurations**: + + `--max-num-batched-tokens`:Limit the maximum number of tokens per chunk, with a recommended setting of 384. + +#### 2.2.3 **Quantization precision** +- **Parameters:** `--quantization` + +- **Supported precision types:** + - wint4 (Suitable for most users) + - wint8 + - bfloat16 (When the `--quantization` parameter is not set, bfloat16 is used by default.) + +- **Recommendation:** + - Unless you have extremely stringent precision requirements, we strongly recommend using wint4 quantization. This will significantly reduce memory consumption and increase throughput. + - If slightly higher precision is required, you may try wint8. + - Only consider using bfloat16 if your application scenario demands extreme precision, as it requires significantly more GPU memory. + +## 3. FAQ +**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`. + +### 3.1 Out of Memory +If the service prompts "Out of Memory" during startup, please try the following solutions: +1. Ensure no other processes are occupying GPU memory; +2. Use wint4/wint8 quantization and enable chunked prefill; +3. Reduce context length and maximum sequence count as needed. + +If the service starts normally but later reports insufficient memory, try: +1. Adjust the initial GPU memory utilization ratio by modifying `--gpu-memory-utilization`. diff --git a/docs/optimal_deployment/README.md b/docs/optimal_deployment/README.md new file mode 100644 index 000000000..c3ab9875c --- /dev/null +++ b/docs/optimal_deployment/README.md @@ -0,0 +1,4 @@ +# Optimal Deployment + +- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md) +- [ERNIE-4.5-VL-424B-A47B-Paddle](ERNIE-4.5-VL-424B-A47B-Paddle.md) diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md new file mode 100644 index 000000000..5888fd6d7 --- /dev/null +++ b/docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md @@ -0,0 +1,124 @@ + +# ERNIE-4.5-VL-28B-A3B-Paddle + +## 一、环境准备 +### 1.1 支持情况 +在下列硬件上部署所需要的最小卡数如下: +| 设备[显存] | WINT4 | WINT8 | BFLOAT16 | +|:----------:|:----------:|:------:| :------:| +| A30 [24G] | 2 | 2 | 4 | +| L20 [48G] | 1 | 1 | 2 | +| H20 [144G] | 1 | 1 | 1 | +| A100 [80G] | 1 | 1 | 1 | +| H800 [80G] | 1 | 1 | 1 | + +### 1.2 安装fastdeploy + +安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md) + +> ⚠️ 注意事项 +> - FastDeploy只支持Paddle格式的模型,注意下载Paddle后缀的模型 +> - 使用模型名称会自动下载模型,如果已经下载过模型,可以直接使用模型下载位置的绝对路径 + +## 二、如何使用 +### 2.1 基础:启动服务 + **示例1:** 4090上单卡部署32K上下文的服务 +```shell +export ENABLE_V1_KVCACHE_SCHEDULER=1 + +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --tensor-parallel-size 1 \ + --max-model-len 32768 \ + --max-num-seqs 32 \ + --limit-mm-per-prompt '{"image": 100, "video": 100}' \ + --reasoning-parser ernie-45-vl \ + --gpu-memory-utilization 0.9 \ + --enable-chunked-prefill \ + --max-num-batched-tokens 384 \ + --quantization wint4 \ + --enable-mm +``` + **示例2:** H800上双卡部署128K上下文的服务 +```shell +export ENABLE_V1_KVCACHE_SCHEDULER=1 + +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --tensor-parallel-size 2 \ + --max-model-len 131072 \ + --max-num-seqs 128 \ + --limit-mm-per-prompt '{"image": 100, "video": 100}' \ + --reasoning-parser ernie-45-vl \ + --gpu-memory-utilization 0.9 \ + --enable-chunked-prefill \ + --max-num-batched-tokens 384 \ + --quantization wint4 \ + --enable-mm +``` +示例是可以稳定运行的一组配置,同时也能得到比较好的性能。 +如果对精度、性能有进一步的要求,请继续阅读下面的内容。 +### 2.2 进阶:如何获取更优性能 + +#### 2.2.1 评估应用场景,正确设置参数 +> **上下文长度** +- **参数:** `--max-model-len` +- **描述:** 控制模型可处理的最大上下文长度。 +- **推荐:** 更长的上下文会导致吞吐降低,根据实际情况设置,`ERNIE-4.5-VL-28B-A3B-Paddle`最长支持**128k**(131072)长度的上下文。 + + ⚠️ 注:更长的上下文会显著增加GPU显存需求,设置更长的上下文之前确保硬件资源是满足的。 +> **最大序列数量** +- **参数:** `--max-num-seqs` +- **描述:** 控制服务可以处理的最大序列数量,支持1~256。 +- **推荐:** 如果您不知道实际应用场景中请求的平均序列数量是多少,我们建议设置为**256**。如果您的应用场景中请求的平均序列数量明显少于256,我们建议设置为一个略大于平均值的较小值,以进一步降低显存占用,优化服务性能。 + +> **多图、多视频输入** +- **参数**:`--limit-mm-per-prompt` +- **描述**:我们的模型支持单次提示词(prompt)中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量,以确保资源高效利用。 +- **推荐**:我们建议将单次提示词(prompt)中的图片和视频数量均设置为100个,以平衡性能与内存占用。 + +> **初始化时可用的显存比例** +- **参数:** `--gpu-memory-utilization` +- **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。 +- **推荐:** 推荐使用默认值0.9。如果服务压测时提示显存不足,可以尝试调低该值。 + +#### 2.2.2 Chunked Prefill +- **参数:** `--enable-chunked-prefill` +- **用处:** 开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。 + +- **其他相关配置**: + + `--max-num-batched-tokens`:限制每个chunk的最大token数量。多模场景下每个chunk会向上取整保持图片的完整性,因此实际每次推理的总token数会大于该值。我们推荐设置为384。 + +#### 2.2.3 **量化精度** +- **参数:** `--quantization` + +- **已支持的精度类型:** + - WINT4 (适合大多数用户) + - WINT8 + - BFLOAT16 (未设置 `--quantization` 参数时,默认使用BFLOAT16) + +- **推荐:** + - 除非您有极其严格的精度要求,否则我们建议使用WINT4量化。这将显著降低内存占用并提升吞吐量。 + - 若需要稍高的精度,可尝试WINT8。 + - 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16,因为它需要更多显存。 + +## 三、常见问题FAQ +**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。 + +### 3.1 显存不足(OOM) +如果服务启动时提示显存不足,请尝试以下方法: +1. 确保无其他进程占用显卡显存; +2. 使用WINT4/WINT8量化,开启chunked prefill; +3. 酌情降低上下文长度和最大序列数量; +4. 增加部署卡数,使用2卡或4卡部署,即修改参数 `--tensor-parallel-size 2` 或 `--tensor-parallel-size 4`。 + +如果可以服务可以正常启动,运行时提示显存不足,请尝试以下方法: +1. 酌情降低初始化时可用的显存比例,即调整参数 `--gpu-memory-utilization` 的值; +2. 增加部署卡数,参数修改同上。 diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-VL-424B-A47B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-VL-424B-A47B-Paddle.md new file mode 100644 index 000000000..032e7d37a --- /dev/null +++ b/docs/zh/optimal_deployment/ERNIE-4.5-VL-424B-A47B-Paddle.md @@ -0,0 +1,99 @@ + +# ERNIE-4.5-VL-424B-A47B-Paddle + +## 一、环境准备 +### 1.1 支持情况 +在下列硬件上部署所需要的最小卡数如下: +| 设备[显存] | WINT4 | WINT8 | BFLOAT16 | +|:----------:|:----------:|:------:| :------:| +| H20 [144G] | 8 | 8 | 8 | +| A100 [80G] | 8 | 8 | - | +| H800 [80G] | 8 | 8 | - | + +### 1.2 安装fastdeploy + +安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md) + +> ⚠️ 注意事项 +> - FastDeploy只支持Paddle格式的模型,注意下载Paddle后缀的模型 +> - 使用模型名称会自动下载模型,如果已经下载过模型,可以直接使用模型下载位置的绝对路径 + +## 二、如何使用 +### 2.1 基础:启动服务 + **示例1:** H800上8卡部署128K上下文的服务 +```shell +export ENABLE_V1_KVCACHE_SCHEDULER=1 + +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --tensor-parallel-size 8 \ + --max-model-len 131072 \ + --max-num-seqs 16 \ + --limit-mm-per-prompt '{"image": 100, "video": 100}' \ + --reasoning-parser ernie-45-vl \ + --gpu-memory-utilization 0.8 \ + --enable-chunked-prefill \ + --max-num-batched-tokens 384 \ + --quantization wint4 \ + --enable-mm +``` +示例是可以稳定运行的一组配置,同时也能得到比较好的性能。 +如果对精度、性能有进一步的要求,请继续阅读下面的内容。 +### 2.2 进阶:如何获取更优性能 + +#### 2.2.1 评估应用场景,正确设置参数 +> **上下文长度** +- **参数:** `--max-model-len` +- **描述:** 控制模型可处理的最大上下文长度。 +- **推荐:** 更长的上下文会导致吞吐降低,根据实际情况设置,`ERNIE-4.5-VL-424B-A47B-Paddle` 最长支持**128k**(131072)长度的上下文。 + +> **最大序列数量** +- **参数:** `--max-num-seqs` +- **描述:** 控制服务可以处理的最大序列数量,支持1~256。 +- **推荐:** 128k场景下,80G显存的单机我们建议设置为**16**。 + +> **多图、多视频输入** +- **参数**:`--limit-mm-per-prompt` +- **描述**:我们的模型支持单次提示词(prompt)中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量,以确保资源高效利用。 +- **推荐**:我们建议将单次提示词(prompt)中的图片和视频数量均设置为100个,以平衡性能与内存占用。 + +> **初始化时可用的显存比例** +- **参数:** `--gpu-memory-utilization` +- **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。 +- **推荐:** 128k长度的上下文时推荐使用0.8。如果服务压测时提示显存不足,可以尝试调低该值。 + +#### 2.2.2 Chunked Prefill +- **参数:** `--enable-chunked-prefill` +- **用处:** 开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。 + +- **其他相关配置**: + + `--max-num-batched-tokens`:限制每个chunk的最大token数量。多模场景下每个chunk会向上取整保持图片的完整性,因此实际每次推理的总token数会大于该值。推荐设置为384。 + +#### 2.2.3 **量化精度** +- **参数:** `--quantization` + +- **已支持的精度类型:** + - WINT4 (适合大多数用户) + - WINT8 + - BFLOAT16 (未设置 `--quantization` 参数时,默认使用BFLOAT16) + +- **推荐:** + - 除非您有极其严格的精度要求,否则我们建议使用WINT4量化。这将显著降低内存占用并提升吞吐量。 + - 若需要稍高的精度,可尝试WINT8。 + - 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16,因为它需要更多显存。 + +## 三、常见问题FAQ +**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。 + +### 3.1 显存不足(OOM) +如果服务启动时提示显存不足,请尝试以下方法: +1. 确保无其他进程占用显卡显存; +2. 使用WINT4/WINT8量化,开启chunked prefill; +3. 酌情降低上下文长度和最大序列数量。 + +如果可以服务可以正常启动,运行时提示显存不足,请尝试以下方法: +1. 酌情降低初始化时可用的显存比例,即调整参数 `--gpu-memory-utilization` 的值。 diff --git a/docs/zh/optimal_deployment/README.md b/docs/zh/optimal_deployment/README.md new file mode 100644 index 000000000..b4e7401a0 --- /dev/null +++ b/docs/zh/optimal_deployment/README.md @@ -0,0 +1,4 @@ +# 最佳实践 + +- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md) +- [ERNIE-4.5-VL-424B-A47B-Paddle](ERNIE-4.5-VL-424B-A47B-Paddle.md)