[docs] update best practice docs (#3969)

* update best practice docs

* add version and v1 loader info
This commit is contained in:
JYChen
2025-09-08 17:39:38 +08:00
committed by GitHub
parent 319a4bf75f
commit 1f056a7469
6 changed files with 85 additions and 67 deletions

View File

@@ -19,22 +19,23 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following
### 1.2 Install fastdeploy
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
- Model DownloadFor detail, please refer to [Supported Models](../supported_models.md). **Please note that models with Paddle suffix need to be used for Fastdeploy**
- Model DownloadFor detail, please refer to [Supported Models](../supported_models.md).
## 2.How to Use
### 2.1 Basic: Launching the Service
Start the service by following command:
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--tensor-parallel-size 1 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 128
--max-num-seqs 128 \
--load_choices "default_v1"
```
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
- `--load_choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage.
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
@@ -42,17 +43,14 @@ For more parameter meanings and default settings, see [FastDeploy Parameter Docu
#### 2.2.1 Correctly set parameters that match the application scenario
Evaluate average input length, average output length, and maximum context length
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
- **Enable the service management global block**
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
#### 2.2.2 Prefix Caching
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
**How to enable:**
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine.
Since version 2.2 (including the develop branch), Prefix Caching has been enabled by default.
For versions 2.1 and earlier, you need to enable it manually by adding following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
```
--enable-prefix-caching
--swap-space 50
@@ -61,7 +59,10 @@ Add the following lines to the startup parameters, where `--enable-prefix-cachin
#### 2.2.3 Chunked Prefill
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
**How to enable:** Add the following lines to the startup parameters
**How to enable:**
Since version 2.2 (including the develop branch), Chunked Prefill has been enabled by default.
For versions 2.1 and earlier, you need to enable it manually by adding
```
--enable-chunked-prefill
```
@@ -79,7 +80,7 @@ Notes:
- Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
#### 2.2.6 Rejection Sampling
#### 2.2.5 Rejection Sampling
**Idea:**
Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models.

View File

@@ -19,22 +19,23 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the followi
### 1.2 Install fastdeploy and prepare the model
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
- Model DownloadFor detail, please refer to [Supported Models](../supported_models.md). **Please note that models with Paddle suffix need to be used for Fastdeploy**
- Model DownloadFor detail, please refer to [Supported Models](../supported_models.md).
## 2.How to Use
### 2.1 Basic: Launching the Service
Start the service by following command:
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-21B-A3B-Paddle \
--tensor-parallel-size 1 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 128
--max-num-seqs 128 \
--load_choices "default_v1"
```
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
- `--load_choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage.
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
@@ -42,17 +43,14 @@ For more parameter meanings and default settings, see [FastDeploy Parameter Docu
#### 2.2.1 Correctly set parameters that match the application scenario
Evaluate average input length, average output length, and maximum context length
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
- **Enable the service management global block**
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
#### 2.2.2 Prefix Caching
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
**How to enable:**
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
Since version 2.2 (including the develop branch), Prefix Caching has been enabled by default.
For versions 2.1 and earlier, you need to enable it manually by adding following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
```
--enable-prefix-caching
--swap-space 50
@@ -61,7 +59,10 @@ Add the following lines to the startup parameters, where `--enable-prefix-cachin
#### 2.2.3 Chunked Prefill
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
**How to enable:** Add the following lines to the startup parameters
**How to enable:**
Since version 2.2 (including the develop branch), Chunked Prefill has been enabled by default.
For versions 2.1 and earlier, you need to enable it manually by adding
```
--enable-chunked-prefill
```
@@ -77,7 +78,9 @@ Add the following lines to the startup parameters
```
Notes:
1. MTP currently does not support simultaneous use with Prefix Caching, Chunked Prefill, and CUDAGraph.
2. MTP currently does not support service management global blocks, i.e. do not run with `export ENABLE_V1_KVCACHE_SCHEDULER=1`
- Use `export FD_DISABLE_CHUNKED_PREFILL=1` to disable Chunked Prefill.
- When setting `speculative-config`, Prefix Caching will be automatically disabled.
2. MTP currently does not support service management global blocks, When setting `speculative-config`, service management global blocks will be automatically disabled.
3. MTP currently does not support rejection sampling, i.e. do not run with `export FD_SAMPLING_CLASS=rejection`
#### 2.2.5 CUDAGraph
@@ -110,7 +113,6 @@ export FD_SAMPLING_CLASS=rejection
# prefill
export CUDA_VISIBLE_DEVICES=0,1,2,3
export INFERENCE_MSG_QUEUE_ID=1315
export FLAGS_max_partition_size=2048
export FD_ATTENTION_BACKEND=FLASH_ATTN
export FD_LOG_DIR="prefill_log"
@@ -130,7 +132,6 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
# decode
export CUDA_VISIBLE_DEVICES=4,5,6,7
export INFERENCE_MSG_QUEUE_ID=1215
export FLAGS_max_partition_size=2048
export FD_LOG_DIR="decode_log"
quant_type=block_wise_fp8

View File

@@ -16,22 +16,23 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-300B-A47B` on the follo
### 1.2 Install fastdeploy
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
- Model DownloadFor detail, please refer to [Supported Models](../supported_models.md). **Please note that models with Paddle suffix need to be used for Fastdeploy**
- Model DownloadFor detail, please refer to [Supported Models](../supported_models.md).
## 2.How to Use
### 2.1 Basic: Launching the Service
Start the service by following command:
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--tensor-parallel-size 8 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 128
--max-num-seqs 128 \
--load_choices "default_v1"
```
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
- `--load_choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage.
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
@@ -39,17 +40,14 @@ For more parameter meanings and default settings, see [FastDeploy Parameter Docu
#### 2.2.1 Correctly set parameters that match the application scenario
Evaluate average input length, average output length, and maximum context length
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
- **Enable the service management global block**
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
#### 2.2.2 Prefix Caching
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
**How to enable:**
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
Since version 2.2 (including the develop branch), Prefix Caching has been enabled by default.
For versions 2.1 and earlier, you need to enable it manually by adding following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
```
--enable-prefix-caching
--swap-space 50
@@ -58,7 +56,10 @@ Add the following lines to the startup parameters, where `--enable-prefix-cachin
#### 2.2.3 Chunked Prefill
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
**How to enable:** Add the following lines to the startup parameters
**How to enable:**
Since version 2.2 (including the develop branch), Chunked Prefill has been enabled by default.
For versions 2.1 and earlier, you need to enable it manually by adding
```
--enable-chunked-prefill
```
@@ -74,7 +75,9 @@ Add the following lines to the startup parameters
```
Notes:
1. MTP currently does not support simultaneous use with Prefix Caching, Chunked Prefill, and CUDAGraph.
2. MTP currently does not support service management global blocks, i.e. do not run with `export ENABLE_V1_KVCACHE_SCHEDULER=1`
- Use `export FD_DISABLE_CHUNKED_PREFILL=1` to disable Chunked Prefill.
- When setting `speculative-config`, Prefix Caching will be automatically disabled.
2. MTP currently does not support service management global blocks, When setting `speculative-config`, service management global blocks will be automatically disabled.
3. MTP currently does not support rejection sampling, i.e. do not run with `export FD_SAMPLING_CLASS=rejection`
#### 2.2.5 W4A8C8 Quantization
@@ -87,6 +90,9 @@ Just specify the corresponding model name in the startup command, `baidu/ERNIE-4
--model baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle
```
Note:
- W4A8C8 quantized models are not supported when loaded via `--load_choices "default_v1"`.
#### 2.2.6 Rejection Sampling
**Idea:**
Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models.

View File

@@ -19,23 +19,24 @@ ERNIE-4.5-0.3B 各量化精度,在下列硬件上部署所需要的最小卡
### 1.2 安装fastdeploy
- 安装请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。
- 模型下载,请参考[支持模型列表](../supported_models.md)。**请注意使用Fastdeploy部署需要Paddle后缀的模型**
- 模型下载,请参考[支持模型列表](../supported_models.md)。
## 二、如何使用
### 2.1 基础:启动服务
通过下列命令启动服务
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--tensor-parallel-size 1 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 128
--max-num-seqs 128 \
--load_choices "default_v1"
```
其中:
- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。
- `--max-model-len`表示当前部署的服务所支持的最长Token数量。设置得越大模型可支持的上下文长度也越大但相应占用的显存也越多可能影响并发数。
- `--load_choices`: 表示loader的版本"default_v1"表示启用v1版本的loader具有更快的加载速度和更少的内存使用。
更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。
@@ -43,16 +44,14 @@ python -m fastdeploy.entrypoints.openai.api_server \
#### 2.2.1 评估应用场景,正确设置参数
结合应用场景评估平均输入长度、平均输出长度、最大上下文长度。例如平均输入长度为1000输出长度为30000那么建议设置为 32768
- 根据最大上下文长度,设置`max-model-len`
- **启用服务管理全局 Block**
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
#### 2.2.2 Prefix Caching
**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果KV Cache避免重复计算从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md)
**启用方式:**
在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。建议取值为`(机器总内存 - 模型大小) * 20%`。如果因为其他程序占用内存等原因导致服务启动失败,可以尝试减小`--swap-space`的值
自2.2版本开始包括develop分支Prefix Caching已经默认开启
对于2.1及更早的版本,需要手动开启。其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。建议取值为`(机器总内存 - 模型大小) * 20%`。如果因为其他程序占用内存等原因导致服务启动失败,可以尝试减小`--swap-space`的值。
```
--enable-prefix-caching
--swap-space 50
@@ -61,7 +60,10 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
#### 2.2.3 Chunked Prefill
**原理:** 采用分块策略将预填充Prefill阶段请求拆解为小规模子任务与解码Decode请求混合批处理执行。可以更好地平衡计算密集型Prefill和访存密集型Decode操作优化GPU资源利用率减少单次Prefill的计算量和显存占用从而降低显存峰值避免显存不足的问题。 具体请参考[Chunked Prefill](../features/chunked_prefill.md)
**启用方式:** 在启动参数下增加即可
**启用方式:**
自2.2版本开始包括develop分支Chunked Prefill已经默认开启。
对于2.1及更早的版本,需要手动开启。
```
--enable-chunked-prefill
```

View File

@@ -19,23 +19,24 @@ ERNIE-4.5-21B-A3B 各量化精度,在下列硬件上部署所需要的最小
### 1.2 安装fastdeploy
- 安装,请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。
- 模型下载,请参考[支持模型列表](../supported_models.md)。**请注意使用Fastdeploy部署需要Paddle后缀的模型**
- 模型下载,请参考[支持模型列表](../supported_models.md)。
## 二、如何使用
### 2.1 基础:启动服务
通过下列命令启动服务
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-21B-A3B-Paddle \
--tensor-parallel-size 1 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 128
--max-num-seqs 128 \
--load_choices "default_v1"
```
其中:
- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。
- `--max-model-len`表示当前部署的服务所支持的最长Token数量。设置得越大模型可支持的上下文长度也越大但相应占用的显存也越多可能影响并发数。
- `--load_choices`: 表示loader的版本"default_v1"表示启用v1版本的loader具有更快的加载速度和更少的内存使用。
更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。
@@ -43,16 +44,14 @@ python -m fastdeploy.entrypoints.openai.api_server \
#### 2.2.1 评估应用场景,正确设置参数
结合应用场景评估平均输入长度、平均输出长度、最大上下文长度。例如平均输入长度为1000输出长度为30000那么建议设置为 32768
- 根据最大上下文长度,设置`max-model-len`
- **启用服务管理全局 Block**
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
#### 2.2.2 Prefix Caching
**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果KV Cache避免重复计算从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md)
**启用方式:**
在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。建议取值为`(机器总内存 - 模型大小) * 20%`。如果因为其他程序占用内存等原因导致服务启动失败,可以尝试减小`--swap-space`的值
自2.2版本开始包括develop分支Prefix Caching已经默认开启
对于2.1及更早的版本,需要手动开启。其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。建议取值为`(机器总内存 - 模型大小) * 20%`。如果因为其他程序占用内存等原因导致服务启动失败,可以尝试减小`--swap-space`的值。
```
--enable-prefix-caching
--swap-space 50
@@ -61,7 +60,10 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
#### 2.2.3 Chunked Prefill
**原理:** 采用分块策略将预填充Prefill阶段请求拆解为小规模子任务与解码Decode请求混合批处理执行。可以更好地平衡计算密集型Prefill和访存密集型Decode操作优化GPU资源利用率减少单次Prefill的计算量和显存占用从而降低显存峰值避免显存不足的问题。 具体请参考[Chunked Prefill](../features/chunked_prefill.md)
**启用方式:** 在启动参数下增加即可
**启用方式:**
自2.2版本开始包括develop分支Chunked Prefill已经默认开启。
对于2.1及更早的版本,需要手动开启。
```
--enable-chunked-prefill
```
@@ -78,7 +80,9 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
注:
1. MTP当前暂不支持与Prefix Caching 、Chunked Prefill 、CUDAGraph同时使用。
2. MTP当前暂不支持服务管理全局 Block 即不要开启`export ENABLE_V1_KVCACHE_SCHEDULER=1`
- 需要通过指定`export FD_DISABLE_CHUNKED_PREFILL=1` 关闭Chunked Prefill。
- 指定`speculative-config`会自动关闭Prefix Caching功能。
2. MTP当前暂不支持服务管理全局 Block 指定`speculative-config`会自动关闭全局Block调度器。
3. MTP当前暂不支持和拒绝采样同时使用即不要开启`export FD_SAMPLING_CLASS=rejection`
#### 2.2.5 CUDAGraph
@@ -111,7 +115,6 @@ export FD_SAMPLING_CLASS=rejection
# prefill
export CUDA_VISIBLE_DEVICES=0,1,2,3
export INFERENCE_MSG_QUEUE_ID=1315
export FLAGS_max_partition_size=2048
export FD_ATTENTION_BACKEND=FLASH_ATTN
export FD_LOG_DIR="prefill_log"
@@ -131,7 +134,6 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
# decode
export CUDA_VISIBLE_DEVICES=4,5,6,7
export INFERENCE_MSG_QUEUE_ID=1215
export FLAGS_max_partition_size=2048
export FD_LOG_DIR="decode_log"
quant_type=block_wise_fp8

View File

@@ -16,23 +16,24 @@ ERNIE-4.5-300B-A47B各量化精度在下列硬件上部署所需要的最小
### 1.2 安装fastdeploy
- 安装,请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。
- 模型下载,请参考[支持模型列表](../supported_models.md)。**请注意使用Fastdeploy部署需要Paddle后缀的模型**
- 模型下载,请参考[支持模型列表](../supported_models.md)。
## 二、如何使用
### 2.1 基础:启动服务
通过下列命令启动服务
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--tensor-parallel-size 8 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 128
--max-num-seqs 128 \
--load_choices "default_v1"
```
其中:
- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。
- `--max-model-len`表示当前部署的服务所支持的最长Token数量。设置得越大模型可支持的上下文长度也越大但相应占用的显存也越多可能影响并发数。
- `--load_choices`: 表示loader的版本"default_v1"表示启用v1版本的loader具有更快的加载速度和更少的内存使用。
更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。
@@ -40,17 +41,14 @@ python -m fastdeploy.entrypoints.openai.api_server \
#### 2.2.1 评估应用场景,正确设置参数
结合应用场景,评估平均输入长度、平均输出长度、最大上下文长度
- 根据最大上下文长度,设置`max-model-len`。例如平均输入长度为1000输出长度为30000那么建议设置为 32768
- **启用服务管理全局 Block**
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
#### 2.2.2 Prefix Caching
**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果KV Cache避免重复计算从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md)
**启用方式:**
在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。建议取值为`(机器总内存 - 模型大小) * 20%`。如果因为其他程序占用内存等原因导致服务启动失败,可以尝试减小`--swap-space`的值
自2.2版本开始包括develop分支Prefix Caching已经默认开启
对于2.1及更早的版本,需要手动开启。其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。建议取值为`(机器总内存 - 模型大小) * 20%`。如果因为其他程序占用内存等原因导致服务启动失败,可以尝试减小`--swap-space`的值。
```
--enable-prefix-caching
--swap-space 50
@@ -59,7 +57,10 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
#### 2.2.3 Chunked Prefill
**原理:** 采用分块策略将预填充Prefill阶段请求拆解为小规模子任务与解码Decode请求混合批处理执行。可以更好地平衡计算密集型Prefill和访存密集型Decode操作优化GPU资源利用率减少单次Prefill的计算量和显存占用从而降低显存峰值避免显存不足的问题。 具体请参考[Chunked Prefill](../features/chunked_prefill.md)
**启用方式:** 在启动参数下增加即可
**启用方式:**
自2.2版本开始包括develop分支Chunked Prefill已经默认开启。
对于2.1及更早的版本,需要手动开启。
```
--enable-chunked-prefill
```
@@ -75,7 +76,9 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
```
注:
1. MTP当前暂不支持与Prefix Caching 、Chunked Prefill 、CUDAGraph同时使用。
2. MTP当前暂不支持服务管理全局 Block 即不要开启`export ENABLE_V1_KVCACHE_SCHEDULER=1`
- 需要通过指定`export FD_DISABLE_CHUNKED_PREFILL=1` 关闭Chunked Prefill。
- 指定`speculative-config`会自动关闭Prefix Caching功能。
2. MTP当前暂不支持服务管理全局 Block 指定`speculative-config`会自动关闭全局Block调度器。
3. MTP当前暂不支持和拒绝采样同时使用即不要开启`export FD_SAMPLING_CLASS=rejection`
#### 2.2.5 W4A8C8量化
@@ -88,6 +91,9 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
--model baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle
```
注:
- W4A8C8量化的模型不支持通过`--load_choices "default_v1"`载入。
#### 2.2.6 拒绝采样
**原理:**
拒绝采样即从一个易于采样的提议分布proposal distribution中生成样本避免显式排序从而达到提升采样速度的效果对小尺寸的模型有较明显的提升。