From 1ef38b1563dec0962a124b214c88c7c518223862 Mon Sep 17 00:00:00 2001 From: JYChen Date: Thu, 31 Jul 2025 17:21:55 +0800 Subject: [PATCH] [doc] best practice for eb45 text models (#3002) * [doc] best practice for eb45 text models * fix docs --- .../ERNIE-4.5-0.3B-Paddle.md | 93 +++++++++++ .../ERNIE-4.5-21B-A3B-Paddle.md | 149 ++++++++++++++++++ .../ERNIE-4.5-300B-A47B-Paddle.md | 127 +++++++++++++++ docs/optimal_deployment/FAQ.md | 37 +++++ .../ERNIE-4.5-0.3B-Paddle.md | 93 +++++++++++ .../ERNIE-4.5-21B-A3B-Paddle.md | 149 ++++++++++++++++++ .../ERNIE-4.5-300B-A47B-Paddle.md | 128 +++++++++++++++ docs/zh/optimal_deployment/FAQ.md | 37 +++++ 8 files changed, 813 insertions(+) create mode 100644 docs/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md create mode 100644 docs/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md create mode 100644 docs/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md create mode 100644 docs/optimal_deployment/FAQ.md create mode 100644 docs/zh/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md create mode 100644 docs/zh/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md create mode 100644 docs/zh/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md create mode 100644 docs/zh/optimal_deployment/FAQ.md diff --git a/docs/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md new file mode 100644 index 000000000..66cbb8a16 --- /dev/null +++ b/docs/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md @@ -0,0 +1,93 @@ +# ERNIE-4.5-0.3B +## Environmental Preparation +### 1.1 Hardware requirements +The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following hardware for each quantization is as follows: +| | WINT8 | WINT4 | FP8 | +|-----|-----|-----|-----| +|H800 80GB| 1 | 1 | 1 | +|A800 80GB| 1 | 1 | / | +|H20 96GB| 1 | 1 | 1 | +|L20 48GB| 1 | 1 | 1 | +|A30 40GB| 1 | 1 | / | +|A10 24GB| 1 | 1 | / | + +**Tips:** +1. To modify the number of deployment GPUs, specify `--tensor-parallel-size 2` in starting command. +2. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory. + +### 1.2 Install fastdeploy +- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md). + +- Model Download,For detail, please refer to [Supported Models](../supported_models.md). **Please note that models with Paddle suffix need to be used for Fastdeploy**: + +## 2.How to Use +### 2.1 Basic: Launching the Service +Start the service by following command: +```bash +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-0.3B-Paddle \ + --tensor-parallel-size 1 \ + --quantization wint4 \ + --max-model-len 32768 \ + --kv-cache-ratio 0.75 \ + --max-num-seqs 128 +``` +- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed). +- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency. + +For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。 + +### 2.2 Advanced: How to get better performance +#### 2.2.1 Correctly set parameters that match the application scenario +Evaluate average input length, average output length, and maximum context length +- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768 +- **Enable the service management global block** + +``` +export ENABLE_V1_KVCACHE_SCHEDULER=1 +``` + +#### 2.2.2 Prefix Caching +**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md) + +**How to enable:** +Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. +``` +--enable-prefix-caching +--swap-space 50 +``` + +#### 2.2.3 Chunked Prefill +**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md) + +**How to enable:** Add the following lines to the startup parameters +``` +--enable-chunked-prefill +``` + +#### 2.2.4 CudaGraph +**Idea:** +CUDAGraph is a GPU computing acceleration technology provided by NVIDIA. It achieves efficient execution and optimization of GPU tasks by capturing CUDA operation sequences into a graph structure. The core idea of CUDAGraph is to encapsulate a series of GPU computing and memory operations into a re-executable graph, thereby reducing CPU-GPU communication overhead, reducing kernel startup latency, and improving overall computing performance. + +**How to enable:** +Add the following lines to the startup parameters +``` +--use-cudagraph +``` +Notes: +1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions +2. When CUDAGraph is enabled, only single-card inference is supported, that is, `--tensor-parallel-size 1` +3. When CUDAGraph is enabled, it is not supported to enable `Chunked Prefill` and `Prefix Caching` at the same time + +#### 2.2.6 Rejection Sampling +**Idea:** +Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models. + +**How to enable:** +Add the following environment variables before starting +``` +export FD_SAMPLING_CLASS=rejection +``` + +## FAQ +If you encounter any problems during use, you can refer to [FAQ](./FAQ.md). diff --git a/docs/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md new file mode 100644 index 000000000..50029db81 --- /dev/null +++ b/docs/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md @@ -0,0 +1,149 @@ +# ERNIE-4.5-21B-A3B +## Environmental Preparation +### 1.1 Hardware requirements +The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the following hardware for each quantization is as follows: +| | WINT8 | WINT4 | FP8 | +|-----|-----|-----|-----| +|H800 80GB| 1 | 1 | 1 | +|A800 80GB| 1 | 1 | / | +|H20 96GB| 1 | 1 | 1 | +|L20 48GB| 1 | 1 | 1 | +|A30 40GB| 2 | 1 | / | +|A10 24GB| 2 | 1 | / | + +**Tips:** +1. To modify the number of deployment GPUs, specify `--tensor-parallel-size 2` in starting command. +2. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory. + +### 1.2 Install fastdeploy and prepare the model +- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md). + +- Model Download,For detail, please refer to [Supported Models](../supported_models.md). **Please note that models with Paddle suffix need to be used for Fastdeploy**: + +## 2.How to Use +### 2.1 Basic: Launching the Service +Start the service by following command: +```bash +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-21B-A3B-Paddle \ + --tensor-parallel-size 1 \ + --quantization wint4 \ + --max-model-len 32768 \ + --kv-cache-ratio 0.75 \ + --max-num-seqs 128 +``` +- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed). +- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency. + +For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。 + +### 2.2 Advanced: How to get better performance +#### 2.2.1 Correctly set parameters that match the application scenario +Evaluate average input length, average output length, and maximum context length +- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768 +- **Enable the service management global block** + +``` +export ENABLE_V1_KVCACHE_SCHEDULER=1 +``` + +#### 2.2.2 Prefix Caching +**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md) + +**How to enable:** +Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. +``` +--enable-prefix-caching +--swap-space 50 +``` + +#### 2.2.3 Chunked Prefill +**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md) + +**How to enable:** Add the following lines to the startup parameters +``` +--enable-chunked-prefill +``` + +#### 2.2.4 MTP (Multi-Token Prediction) +**Idea:** +By predicting multiple tokens at once, the number of decoding steps is reduced to significantly speed up the generation speed, while maintaining the generation quality through certain strategies. For details, please refer to [Speculative Decoding](../features/speculative_decoding.md)。 + +**How to enable:** +Add the following lines to the startup parameters +``` +--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' +``` + +#### 2.2.5 CUDAGraph +**Idea:** +CUDAGraph is a GPU computing acceleration technology provided by NVIDIA. It achieves efficient execution and optimization of GPU tasks by capturing CUDA operation sequences into a graph structure. The core idea of CUDAGraph is to encapsulate a series of GPU computing and memory operations into a re-executable graph, thereby reducing CPU-GPU communication overhead, reducing kernel startup latency, and improving overall computing performance. + +**How to enable:** +Add the following lines to the startup parameters +``` +--use-cudagraph +``` +Notes: +1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions +2. When CUDAGraph is enabled, only single-card inference is supported, that is, `--tensor-parallel-size 1` +3. When CUDAGraph is enabled, it is not supported to enable `Chunked Prefill` and `Prefix Caching` at the same time + +#### 2.2.6 Rejection Sampling +**Idea:** +Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models. + +**How to enable:** +Add the following environment variables before starting +``` +export FD_SAMPLING_CLASS=rejection +``` + +#### 2.2.7 Disaggregated Deployment +**Idea:** Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency. + +**How to enable:** Take the deployment of a single machine with 8 GPUs and 1P1D (4 GPUs each) as an example. Compared with the default hybrid deployment method, `--splitwise-role` is required to specify the role of the node. And the GPUs and logs of the two nodes are isolated through the environment variables `FD_LOG_DIR` and `CUDA_VISIBLE_DEVICES`. +``` +# prefill +export CUDA_VISIBLE_DEVICES=0,1,2,3 +export INFERENCE_MSG_QUEUE_ID=1315 +export FLAGS_max_partition_size=2048 +export FD_ATTENTION_BACKEND=FLASH_ATTN +export FD_LOG_DIR="prefill_log" + +quant_type=block_wise_fp8 +export FD_USE_DEEP_GEMM=0 + +python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A3B-Paddle \ + --max-model-len 131072 \ + --max-num-seqs 20 \ + --num-gpu-blocks-override 40000 \ + --quantization ${quant_type} \ + --gpu-memory-utilization 0.9 --kv-cache-ratio 0.9 \ + --port 7012 --engine-worker-queue-port 7013 --metrics-port 7014 --tensor-parallel-size 4 \ + --cache-queue-port 7015 \ + --splitwise-role "prefill" \ +``` +``` +# decode +export CUDA_VISIBLE_DEVICES=4,5,6,7 +export INFERENCE_MSG_QUEUE_ID=1215 +export FLAGS_max_partition_size=2048 +export FD_LOG_DIR="decode_log" + +quant_type=block_wise_fp8 +export FD_USE_DEEP_GEMM=0 + +python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A3B-Paddle \ + --max-model-len 131072 \ + --max-num-seqs 20 \ + --quantization ${quant_type} \ + --gpu-memory-utilization 0.85 --kv-cache-ratio 0.1 \ + --port 9012 --engine-worker-queue-port 8013 --metrics-port 8014 --tensor-parallel-size 4 \ + --cache-queue-port 8015 \ + --innode-prefill-ports 7013 \ + --splitwise-role "decode" +``` + +## FAQ +If you encounter any problems during use, you can refer to [FAQ](./FAQ.md). diff --git a/docs/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md b/docs/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md new file mode 100644 index 000000000..a7eb9499c --- /dev/null +++ b/docs/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md @@ -0,0 +1,127 @@ +# ERNIE-4.5-300B-A47B +## Environmental Preparation +### 1.1 Hardware requirements +The minimum number of GPUs required to deploy `ERNIE-4.5-300B-A47B` on the following hardware for each quantization is as follows: +| | WINT8 | WINT4 | FP8 | WINT2 | W4A8 | +|-----|-----|-----|-----|-----|-----| +|H800 80GB| 8 | 4 | 8 | 2 | 4 | +|A800 80GB| 8 | 4 | / | 2 | 4 | + +**Tips:** +1. To modify the number of deployment GPUs, specify `--tensor-parallel-size 4` in starting command. +2. Since only 4-GPSs quantization scale is provided, the W4A8 model needs to be deployed on 4 GPUs. +3. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory. + +### 1.2 Install fastdeploy +- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md). + +- Model Download,For detail, please refer to [Supported Models](../supported_models.md). **Please note that models with Paddle suffix need to be used for Fastdeploy**: + +## 2.How to Use +### 2.1 Basic: Launching the Service +Start the service by following command: +```bash +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-300B-A47B-Paddle \ + --tensor-parallel-size 8 \ + --quantization wint4 \ + --max-model-len 32768 \ + --kv-cache-ratio 0.75 \ + --max-num-seqs 128 +``` +- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed). +- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency. + +For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。 + +### 2.2 Advanced: How to get better performance +#### 2.2.1 Correctly set parameters that match the application scenario +Evaluate average input length, average output length, and maximum context length +- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768 +- **Enable the service management global block** + +``` +export ENABLE_V1_KVCACHE_SCHEDULER=1 +``` + +#### 2.2.2 Prefix Caching +**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md) + +**How to enable:** +Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. +``` +--enable-prefix-caching +--swap-space 50 +``` + +#### 2.2.3 Chunked Prefill +**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md) + +**How to enable:** Add the following lines to the startup parameters +``` +--enable-chunked-prefill +``` + +#### 2.2.4 MTP (Multi-Token Prediction) +**Idea:** +By predicting multiple tokens at once, the number of decoding steps is reduced to significantly speed up the generation speed, while maintaining the generation quality through certain strategies. For details, please refer to [Speculative Decoding](../features/speculative_decoding.md)。 + +**How to enable:** +Add the following lines to the startup parameters +``` +--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' +``` + +#### 2.2.5 W4A8C8 Quantization +**Idea:** +Quantization can achieve model compression, reduce GPU memory usage and speed up inference. To achieve better inference results, per-channel symmetric 4-bit quantization is used for MoE weights. static per-tensor symmetric 8-bit quantization is used for activation. And static per-channel symmetric 8-bit quantization is used for KVCache. + +**How to enable:** +Just specify the corresponding model name in the startup command, `baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle` +``` +--model baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle +``` + +#### 2.2.6 Rejection Sampling +**Idea:** +Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models. + +**How to enable:** +Add the following environment variables before starting +``` +export FD_SAMPLING_CLASS=rejection +``` + +#### 2.2.7 Disaggregated Deployment +**Idea:** Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency. + +**How to enable:** Take the deployment of a single machine with 8 GPUs and 1P1D (4 GPUs each) as an example. Compared with the default hybrid deployment method, `--splitwise-role` is required to specify the role of the node. And the GPUs and logs of the two nodes are isolated through the environment variables `FD_LOG_DIR` and `CUDA_VISIBLE_DEVICES`. +``` +export FD_LOG_DIR="log_prefill" +export CUDA_VISIBLE_DEVICES=0,1,2,3 +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-300B-A47B-Paddle \ + --port 8180 --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --cache-queue-port 8183 \ + --tensor-parallel-size 4 \ + --quantization wint4 \ + --splitwise-role "prefill" +``` +``` +export FD_LOG_DIR="log_decode" +export CUDA_VISIBLE_DEVICES=4,5,6,7 +# Note that innode-prefill-ports is specified as the Prefill serviceengine-worker-queue-port +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-300B-A47B-Paddle\ + --port 8184 --metrics-port 8185 \ + --engine-worker-queue-port 8186 \ + --cache-queue-port 8187 \ + --tensor-parallel-size 4 \ + --quantization wint4 \ + --innode-prefill-ports 8182 \ + --splitwise-role "decode" +``` + +## FAQ +If you encounter any problems during use, you can refer to [FAQ](./FAQ.md). diff --git a/docs/optimal_deployment/FAQ.md b/docs/optimal_deployment/FAQ.md new file mode 100644 index 000000000..71e80ce05 --- /dev/null +++ b/docs/optimal_deployment/FAQ.md @@ -0,0 +1,37 @@ +# FAQ +## 1.CUDA out of memory +1. when starting the service: +- Check the minimum number of deployment GPUs corresponding to the model and quantification method. If it is not met, increase the number of deployment GPUs. +- If CUDAGraph is enabled, try to reserve more GPU memory for CUDAGraph by lowering `gpu_memory_utilization`, or reduce the GPU memory usage of CUDAGraph by reducing `max_num_seqs` and setting `cudagraph_capture_sizes`。 + +2. during service operation: +- Check whether there is information similar to the following in the log. If so, it is usually caused by insufficient output blocks. You need to reduce `kv-cache-ratio` +``` +need_block_len: 1, free_list_len: 0 +step max_id: 2, max_num: 133, encoder block len: 24 +recover seq_id: 2, free_list_len: 144, used_list_len: 134 +need_block_len: 1, free_list_len: 0 +step max_id: 2, max_num: 144, encoder_block_len: 24 +``` + +It is recommended to enable the service management global block. You need add environment variables before starting the service. +``` +export ENABLE_V1_KVCACHE_SCHEDULER=1 +``` + +## 2.Poor model performance +1. First, check whether the output length meets expectations and whether it is caused by excessive decoding length. If the output is long, please check whether there is similar information as follows in the log. If so, it is usually caused by insufficient output blocks and you need to reduce `kv-cache-ratio` +``` +need_block_len: 1, free_list_len: 0 +step max_id: 2, max_num: 133, encoder block len: 24 +recover seq_id: 2, free_list_len: 144, used_list_len: 134 +need_block_len: 1, free_list_len: 0 +step max_id: 2, max_num: 144, encoder_block_len: 24 +``` + +It is also recommended to enable the service management global block. You need add environment variables before starting the service. +``` +export ENABLE_V1_KVCACHE_SCHEDULER=1 +``` + +2. Check whether the KVCache blocks allocated by the automatic profile are as expected. If the automatic profile is affected by the fluctuation of video memory and may result in less allocation, you can manually set the `num_gpu_blocks_override` parameter to expand the KVCache block. diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md new file mode 100644 index 000000000..4533a6fee --- /dev/null +++ b/docs/zh/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md @@ -0,0 +1,93 @@ +# ERNIE-4.5-0.3B +## 一、环境准备 +### 1.1 支持情况 +ERNIE-4.5-0.3B 各量化精度,在下列硬件上部署所需要的最小卡数如下: +| | WINT8 | WINT4 | FP8 | +|-----|-----|-----|-----| +|H800 80GB| 1 | 1 | 1 | +|A800 80GB| 1 | 1 | / | +|H20 96GB| 1 | 1 | 1 | +|L20 48GB| 1 | 1 | 1 | +|A30 40GB| 1 | 1 | / | +|A10 24GB| 1 | 1 | / | + +**注:** +1. 在启动命令后指定`--tensor-parallel-size 1` 即可修改部署卡数 +2. 表格中未列出的硬件,可根据显存大小进行预估是否可以部署 + +### 1.2 安装fastdeploy +- 安装请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。 + +- 模型下载,请参考[支持模型列表](../supported_models.md)。**请注意使用Fastdeploy部署需要Paddle后缀的模型** + +## 二、如何使用 +### 2.1 基础:启动服务 +通过下列命令启动服务 +```bash +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-0.3B-Paddle \ + --tensor-parallel-size 1 \ + --quantization wint4 \ + --max-model-len 32768 \ + --kv-cache-ratio 0.75 \ + --max-num-seqs 128 +``` +其中: +- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。 +- `--max-model-len`:表示当前部署的服务所支持的最长Token数量。设置得越大,模型可支持的上下文长度也越大,但相应占用的显存也越多,可能影响并发数。 + +更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。 + +### 2.2 进阶:如何获取更优性能 +#### 2.2.1 评估应用场景,正确设置参数 +结合应用场景,评估平均输入长度、平均输出长度、最大上下文长度。例如,平均输入长度为1000,输出长度为30000,那么建议设置为 32768 +- 根据最大上下文长度,设置`max-model-len` +- **启用服务管理全局 Block** +``` +export ENABLE_V1_KVCACHE_SCHEDULER=1 +``` + +#### 2.2.2 Prefix Caching +**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果(KV Cache),避免重复计算,从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md) + +**启用方式:** +在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上,额外开启CPU缓存,大小为GB,应根据机器实际情况调整。 +``` +--enable-prefix-caching +--swap-space 50 +``` + +#### 2.2.3 Chunked Prefill +**原理:** 采用分块策略,将预填充(Prefill)阶段请求拆解为小规模子任务,与解码(Decode)请求混合批处理执行。可以更好地平衡计算密集型(Prefill)和访存密集型(Decode)操作,优化GPU资源利用率,减少单次Prefill的计算量和显存占用,从而降低显存峰值,避免显存不足的问题。 具体请参考[Chunked Prefill](../features/chunked_prefill.md) + +**启用方式:** 在启动参数下增加即可 +``` +--enable-chunked-prefill +``` + +#### 2.2.4 CUDAGraph +**原理:** +CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操作序列捕获(capture)为图结构(graph),实现 GPU 任务的高效执行和优化。CUDAGraph 的核心思想是将一系列 GPU 计算和内存操作封装为一个可重复执行的图,从而减少 CPU-GPU 通信开销、降低内核启动延迟,并提升整体计算性能。 + +**启用方式:** +在启动命令中增加 +``` +--use-cudagraph +``` +注: +1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明 +2. 开启CUDAGraph时,暂时只支持单卡推理,即`--tensor-parallel-size 1` +3. 开启CUDAGraph时,暂时不支持同时开启`Chunked Prefill`和`Prefix Caching` + +#### 2.2.5 拒绝采样 +**原理:** +拒绝采样即从一个易于采样的提议分布(proposal distribution)中生成样本,避免显式排序从而达到提升采样速度的效果,对小尺寸的模型有较明显的提升。 + +**启用方式:** +启动前增加下列环境变量 +``` +export FD_SAMPLING_CLASS=rejection +``` + +## 三、常见问题FAQ +如果您在使用过程中遇到问题,可以在[FAQ](./FAQ.md)中查阅。 diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md new file mode 100644 index 000000000..9c975662f --- /dev/null +++ b/docs/zh/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md @@ -0,0 +1,149 @@ +# ERNIE-4.5-21B-A3B +## 一、环境准备 +### 1.1 支持情况 +ERNIE-4.5-21B-A3B 各量化精度,在下列硬件上部署所需要的最小卡数如下: +| | WINT8 | WINT4 | FP8 | +|-----|-----|-----|-----| +|H800 80GB| 1 | 1 | 1 | +|A800 80GB| 1 | 1 | / | +|H20 96GB| 1 | 1 | 1 | +|L20 48GB| 1 | 1 | 1 | +|A30 40GB| 2 | 1 | / | +|A10 24GB| 2 | 1 | / | + +**注:** +1. 在启动命令后指定`--tensor-parallel-size 2` 即可修改部署卡数 +2. 表格中未列出的硬件,可根据显存大小进行预估是否可以部署 + +### 1.2 安装fastdeploy +- 安装,请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。 + +- 模型下载,请参考[支持模型列表](../supported_models.md)。**请注意使用Fastdeploy部署需要Paddle后缀的模型** + +## 二、如何使用 +### 2.1 基础:启动服务 +通过下列命令启动服务 +```bash +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-21B-A3B-Paddle \ + --tensor-parallel-size 1 \ + --quantization wint4 \ + --max-model-len 32768 \ + --kv-cache-ratio 0.75 \ + --max-num-seqs 128 +``` +其中: +- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。 +- `--max-model-len`:表示当前部署的服务所支持的最长Token数量。设置得越大,模型可支持的上下文长度也越大,但相应占用的显存也越多,可能影响并发数。 + +更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。 + +### 2.2 进阶:如何获取更优性能 +#### 2.2.1 评估应用场景,正确设置参数 +结合应用场景,评估平均输入长度、平均输出长度、最大上下文长度。例如,平均输入长度为1000,输出长度为30000,那么建议设置为 32768 +- 根据最大上下文长度,设置`max-model-len` +- **启用服务管理全局 Block** +``` +export ENABLE_V1_KVCACHE_SCHEDULER=1 +``` + +#### 2.2.2 Prefix Caching +**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果(KV Cache),避免重复计算,从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md) + +**启用方式:** +在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上,额外开启CPU缓存,大小为GB,应根据机器实际情况调整。 +``` +--enable-prefix-caching +--swap-space 50 +``` + +#### 2.2.3 Chunked Prefill +**原理:** 采用分块策略,将预填充(Prefill)阶段请求拆解为小规模子任务,与解码(Decode)请求混合批处理执行。可以更好地平衡计算密集型(Prefill)和访存密集型(Decode)操作,优化GPU资源利用率,减少单次Prefill的计算量和显存占用,从而降低显存峰值,避免显存不足的问题。 具体请参考[Chunked Prefill](../features/chunked_prefill.md) + +**启用方式:** 在启动参数下增加即可 +``` +--enable-chunked-prefill +``` + +#### 2.2.4 MTP (Multi-Token Prediction) +**原理:** +通过一次性预测多个Token,减少解码步数,以显著加快生成速度,同时通过一定策略保持生成质量。具体请参考[投机解码](../features/speculative_decoding.md)。 + +**启用方式:** +在启动参数下增加即可 +``` +--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' +``` + +#### 2.2.5 CUDAGraph +**原理:** +CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操作序列捕获(capture)为图结构(graph),实现 GPU 任务的高效执行和优化。CUDAGraph 的核心思想是将一系列 GPU 计算和内存操作封装为一个可重复执行的图,从而减少 CPU-GPU 通信开销、降低内核启动延迟,并提升整体计算性能。 + +**启用方式:** +在启动命令中增加 +``` +--use-cudagraph +``` +注: +1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明 +2. 开启CUDAGraph时,暂时只支持单卡推理,即`--tensor-parallel-size 1` +3. 开启CUDAGraph时,暂时不支持同时开启`Chunked Prefill`和`Prefix Caching` + +#### 2.2.6 拒绝采样 +**原理:** +拒绝采样即从一个易于采样的提议分布(proposal distribution)中生成样本,避免显式排序从而达到提升采样速度的效果,对小尺寸的模型有较明显的提升。 + +**启用方式:** +启动前增加下列环境变量 +``` +export FD_SAMPLING_CLASS=rejection +``` + +#### 2.2.7 分离式部署 +**原理:** 分离式部署的核心思想是将Prefill 和 Decode 分开部署,在一定场景下可以提高硬件利用率,有效提高吞吐,降低整句时延。具体请参考分离式部署 + +**启用方式:** 以单机8GPU,1P1D(各4GPU)部署为例,与默认的混合式部署方式相比, 需要`--splitwise-role`指定节点的角色。并通过环境变量`FD_LOG_DIR`和`CUDA_VISIBLE_DEVICES`将两个节点的GPU 和日志隔离开 +``` +# prefill +export CUDA_VISIBLE_DEVICES=0,1,2,3 +export INFERENCE_MSG_QUEUE_ID=1315 +export FLAGS_max_partition_size=2048 +export FD_ATTENTION_BACKEND=FLASH_ATTN +export FD_LOG_DIR="prefill_log" + +quant_type=block_wise_fp8 +export FD_USE_DEEP_GEMM=0 + +python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A3B-Paddle \ + --max-model-len 131072 \ + --max-num-seqs 20 \ + --num-gpu-blocks-override 40000 \ + --quantization ${quant_type} \ + --gpu-memory-utilization 0.9 --kv-cache-ratio 0.9 \ + --port 7012 --engine-worker-queue-port 7013 --metrics-port 7014 --tensor-parallel-size 4 \ + --cache-queue-port 7015 \ + --splitwise-role "prefill" \ +``` +``` +# decode +export CUDA_VISIBLE_DEVICES=4,5,6,7 +export INFERENCE_MSG_QUEUE_ID=1215 +export FLAGS_max_partition_size=2048 +export FD_LOG_DIR="decode_log" + +quant_type=block_wise_fp8 +export FD_USE_DEEP_GEMM=0 + +python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A3B-Paddle \ + --max-model-len 131072 \ + --max-num-seqs 20 \ + --quantization ${quant_type} \ + --gpu-memory-utilization 0.85 --kv-cache-ratio 0.1 \ + --port 9012 --engine-worker-queue-port 8013 --metrics-port 8014 --tensor-parallel-size 4 \ + --cache-queue-port 8015 \ + --innode-prefill-ports 7013 \ + --splitwise-role "decode" +``` + +## 三、常见问题FAQ +如果您在使用过程中遇到问题,可以在[FAQ](./FAQ.md)中查阅。 diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md b/docs/zh/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md new file mode 100644 index 000000000..e91d9b176 --- /dev/null +++ b/docs/zh/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md @@ -0,0 +1,128 @@ +# ERNIE-4.5-300B-A47B +## 一、环境准备 +### 1.1 支持情况 +ERNIE-4.5-300B-A47B各量化精度,在下列硬件上部署所需要的最小卡数如下: +| | WINT8 | WINT4 | FP8 | WINT2 | W4A8 | +|-----|-----|-----|-----|-----|-----| +|H800 80GB| 8 | 4 | 8 | 2 | 4 | +|A800 80GB| 8 | 4 | / | 2 | 4 | + +**注:** +1. 在启动命令后指定`--tensor-parallel-size 4`即可修改部署卡数 +2. 由于仅提供4卡量化scale,W4A8模型需部署在4卡 +3. 表格中未列出的硬件,可根据显存大小进行预估是否可以部署 + +### 1.2 安装fastdeploy +- 安装,请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。 + +- 模型下载,请参考[支持模型列表](../supported_models.md)。**请注意使用Fastdeploy部署需要Paddle后缀的模型** + +## 二、如何使用 +### 2.1 基础:启动服务 +通过下列命令启动服务 +```bash +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-300B-A47B-Paddle \ + --tensor-parallel-size 8 \ + --quantization wint4 \ + --max-model-len 32768 \ + --kv-cache-ratio 0.75 \ + --max-num-seqs 128 +``` +其中: +- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。 +- `--max-model-len`:表示当前部署的服务所支持的最长Token数量。设置得越大,模型可支持的上下文长度也越大,但相应占用的显存也越多,可能影响并发数。 + +更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。 + +### 2.2 进阶:如何获取更优性能 +#### 2.2.1 评估应用场景,正确设置参数 +结合应用场景,评估平均输入长度、平均输出长度、最大上下文长度 +- 根据最大上下文长度,设置`max-model-len`。例如,平均输入长度为1000,输出长度为30000,那么建议设置为 32768 +- **启用服务管理全局 Block** + +``` +export ENABLE_V1_KVCACHE_SCHEDULER=1 +``` + +#### 2.2.2 Prefix Caching +**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果(KV Cache),避免重复计算,从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md) + +**启用方式:** +在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上,额外开启CPU缓存,大小为GB,应根据机器实际情况调整。 +``` +--enable-prefix-caching +--swap-space 50 +``` + +#### 2.2.3 Chunked Prefill +**原理:** 采用分块策略,将预填充(Prefill)阶段请求拆解为小规模子任务,与解码(Decode)请求混合批处理执行。可以更好地平衡计算密集型(Prefill)和访存密集型(Decode)操作,优化GPU资源利用率,减少单次Prefill的计算量和显存占用,从而降低显存峰值,避免显存不足的问题。 具体请参考[Chunked Prefill](../features/chunked_prefill.md) + +**启用方式:** 在启动参数下增加即可 +``` +--enable-chunked-prefill +``` + +#### 2.2.4 MTP (Multi-Token Prediction) +**原理:** +通过一次性预测多个Token,减少解码步数,以显著加快生成速度,同时通过一定策略保持生成质量。具体请参考[投机解码](../features/speculative_decoding.md)。 + +**启用方式:** +在启动参数下增加即可 +``` +--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' +``` + +#### 2.2.5 W4A8C8量化 +**原理:** +量化可以实现模型的压缩,减少显存占用并加快推理计算速度。对模型MOE部分权重使用per-channel对称4比特量化,激活使用静态per-tensor对称8比特量化,KVCache使用静态per-channel对称8比特量化。以实现更优的推理效果。 + +**启用方式:** +需要在启动命令中指定对应的模型名称,`baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle` +``` +--model baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle +``` + +#### 2.2.6 拒绝采样 +**原理:** +拒绝采样即从一个易于采样的提议分布(proposal distribution)中生成样本,避免显式排序从而达到提升采样速度的效果,对小尺寸的模型有较明显的提升。 + +**启用方式:** +启动前增加下列环境变量 +``` +export FD_SAMPLING_CLASS=rejection +``` + +#### 2.2.7 分离式部署 +**原理:** 分离式部署的核心思想是将Prefill 和 Decode 分开部署,在一定场景下可以提高硬件利用率,有效提高吞吐,降低整句时延。具体请参考分离式部署 + +**启用方式:** 以单机8GPU,1P1D(各4GPU)部署为例,与默认的混合式部署方式相比, 需要`--splitwise-role`指定节点的角色。并通过环境变量`FD_LOG_DIR`和`CUDA_VISIBLE_DEVICES`将两个节点的GPU 和日志隔离开 +``` +export FD_LOG_DIR="log_prefill" +export CUDA_VISIBLE_DEVICES=0,1,2,3 +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-300B-A47B-Paddle \ + --port 8180 --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --cache-queue-port 8183 \ + --tensor-parallel-size 4 \ + --quantization wint4 \ + --splitwise-role "prefill" +``` +``` +export FD_LOG_DIR="log_decode" +export CUDA_VISIBLE_DEVICES=4,5,6,7 +# 注意innode-prefill-ports指定为Prefill服务的engine-worker-queue-port +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-300B-A47B-Paddle\ + --port 8184 --metrics-port 8185 \ + --engine-worker-queue-port 8186 \ + --cache-queue-port 8187 \ + --tensor-parallel-size 4 \ + --quantization wint4 \ + --innode-prefill-ports 8182 \ + --splitwise-role "decode" +``` + +## 三、常见问题FAQ +如果您在使用过程中遇到问题,可以在[FAQ](./FAQ.md)中查阅。 diff --git a/docs/zh/optimal_deployment/FAQ.md b/docs/zh/optimal_deployment/FAQ.md new file mode 100644 index 000000000..6cf65552c --- /dev/null +++ b/docs/zh/optimal_deployment/FAQ.md @@ -0,0 +1,37 @@ +# 常见问题FAQ +## 1.显存不足 +1. 启动服务时显存不足: +- 核对模型和量化方式对应的部署最小卡数,如果不满足则需要增加部署卡数 +- 如果开启了CUDAGraph,尝试通过降低 `gpu_memory_utilization`来为CUDAGraph留存更多的显存,或通过减少 `max_num_seqs`,设置`cudagraph_capture_sizes`来减少CUDAGraph的显存占用。 + +2. 服务运行期间显存不足: +- 检查log中是否有类似如下信息,如有,通常是输出block不足导致,需要减小`kv-cache-ratio` +``` +need_block_len: 1, free_list_len: 0 +step max_id: 2, max_num: 133, encoder block len: 24 +recover seq_id: 2, free_list_len: 144, used_list_len: 134 +need_block_len: 1, free_list_len: 0 +step max_id: 2, max_num: 144, encoder_block_len: 24 +``` + +建议启用服务管理全局 Block功能,在启动服务前,加入环境变量 +``` +export ENABLE_V1_KVCACHE_SCHEDULER=1 +``` + +## 2.模型性能差 +1. 首先检查输出长度是否符合预期,是否是解码过长导致。 +如果场景输出本身较长,请检查log中是否有类似如下信息,如有,通常是输出block不足导致,需要减小`kv-cache-ratio` +``` +need_block_len: 1, free_list_len: 0 +step max_id: 2, max_num: 133, encoder block len: 24 +recover seq_id: 2, free_list_len: 144, used_list_len: 134 +need_block_len: 1, free_list_len: 0 +step max_id: 2, max_num: 144, encoder_block_len: 24 +``` +同样建议启用服务管理全局 Block功能,在启动服务前,加入环境变量 +``` +export ENABLE_V1_KVCACHE_SCHEDULER=1 +``` + +2. 检查自动profile分配的KVCache block是否符合预期,如果自动profile中受到显存波动影响可能导致分配偏少,可以通过手工设置`num_gpu_blocks_override`参数扩大KVCache block。