[docs] Update best practice doc (#3539)

* fix some docs error

* [docs] x1 best-practice

* update docs

* fix docs
This commit is contained in:
JYChen
2025-08-27 15:45:30 +08:00
committed by GitHub
parent afb9f327ef
commit e645db348b
6 changed files with 29 additions and 17 deletions

View File

@@ -76,8 +76,8 @@ Add the following lines to the startup parameters
--use-cudagraph
```
Notes:
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
2. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
- Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
#### 2.2.6 Rejection Sampling
**Idea:**

View File

@@ -52,7 +52,7 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
**How to enable:**
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine.
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
```
--enable-prefix-caching
--swap-space 50
@@ -75,6 +75,10 @@ Add the following lines to the startup parameters
```
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
```
Notes:
1. MTP currently does not support simultaneous use with Prefix Caching, Chunked Prefill, and CUDAGraph.
2. MTP currently does not support service management global blocks, i.e. do not run with `export ENABLE_V1_KVCACHE_SCHEDULER=1`
3. MTP currently does not support rejection sampling, i.e. do not run with `export FD_SAMPLING_CLASS=rejection`
#### 2.2.5 CUDAGraph
**Idea:**
@@ -86,8 +90,7 @@ Add the following lines to the startup parameters
--use-cudagraph
```
Notes:
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
2. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
- Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
#### 2.2.6 Rejection Sampling
**Idea:**

View File

@@ -49,7 +49,7 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
**How to enable:**
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine.
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
```
--enable-prefix-caching
--swap-space 50
@@ -72,6 +72,10 @@ Add the following lines to the startup parameters
```
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
```
Notes:
1. MTP currently does not support simultaneous use with Prefix Caching, Chunked Prefill, and CUDAGraph.
2. MTP currently does not support service management global blocks, i.e. do not run with `export ENABLE_V1_KVCACHE_SCHEDULER=1`
3. MTP currently does not support rejection sampling, i.e. do not run with `export FD_SAMPLING_CLASS=rejection`
#### 2.2.5 W4A8C8 Quantization
**Idea:**
@@ -134,8 +138,7 @@ Add the following lines to the startup parameters
--use-cudagraph
```
Notes:
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
2. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
- Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
## FAQ
If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).

View File

@@ -52,7 +52,7 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果KV Cache避免重复计算从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md)
**启用方式:**
在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。
在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。建议取值为`(机器总内存 - 模型大小) * 20%`。如果因为其他程序占用内存等原因导致服务启动失败,可以尝试减小`--swap-space`的值。
```
--enable-prefix-caching
--swap-space 50
@@ -76,8 +76,7 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操
--use-cudagraph
```
注:
1. 通常情况下不需要额外设置其他参数但CUDAGraph会产生一些额外的显存开销在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
2. 开启CUDAGraph时暂时不支持`max-model-len > 32768`的场景。
- 通常情况下不需要额外设置其他参数但CUDAGraph会产生一些额外的显存开销在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
#### 2.2.5 拒绝采样
**原理:**

View File

@@ -52,7 +52,7 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果KV Cache避免重复计算从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md)
**启用方式:**
在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。
在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。建议取值为`(机器总内存 - 模型大小) * 20%`。如果因为其他程序占用内存等原因导致服务启动失败,可以尝试减小`--swap-space`的值。
```
--enable-prefix-caching
--swap-space 50
@@ -76,6 +76,11 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
```
注:
1. MTP当前暂不支持与Prefix Caching 、Chunked Prefill 、CUDAGraph同时使用。
2. MTP当前暂不支持服务管理全局 Block 即不要开启`export ENABLE_V1_KVCACHE_SCHEDULER=1`
3. MTP当前暂不支持和拒绝采样同时使用即不要开启`export FD_SAMPLING_CLASS=rejection`
#### 2.2.5 CUDAGraph
**原理:**
CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操作序列捕获capture为图结构graph实现 GPU 任务的高效执行和优化。CUDAGraph 的核心思想是将一系列 GPU 计算和内存操作封装为一个可重复执行的图,从而减少 CPU-GPU 通信开销、降低内核启动延迟,并提升整体计算性能。
@@ -86,8 +91,7 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操
--use-cudagraph
```
注:
1. 通常情况下不需要额外设置其他参数但CUDAGraph会产生一些额外的显存开销在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
2. 开启CUDAGraph时暂时不支持`max-model-len > 32768`的场景。
- 通常情况下不需要额外设置其他参数但CUDAGraph会产生一些额外的显存开销在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
#### 2.2.6 拒绝采样
**原理:**

View File

@@ -50,7 +50,7 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果KV Cache避免重复计算从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md)
**启用方式:**
在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。
在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。建议取值为`(机器总内存 - 模型大小) * 20%`。如果因为其他程序占用内存等原因导致服务启动失败,可以尝试减小`--swap-space`的值。
```
--enable-prefix-caching
--swap-space 50
@@ -73,6 +73,10 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1
```
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
```
注:
1. MTP当前暂不支持与Prefix Caching 、Chunked Prefill 、CUDAGraph同时使用。
2. MTP当前暂不支持服务管理全局 Block 即不要开启`export ENABLE_V1_KVCACHE_SCHEDULER=1`
3. MTP当前暂不支持和拒绝采样同时使用即不要开启`export FD_SAMPLING_CLASS=rejection`
#### 2.2.5 W4A8C8量化
**原理:**
@@ -135,8 +139,7 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操
--use-cudagraph
```
注:
1. 通常情况下不需要额外设置其他参数但CUDAGraph会产生一些额外的显存开销在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
2. 开启CUDAGraph时暂时不支持`max-model-len > 32768`的场景。
- 通常情况下不需要额外设置其他参数但CUDAGraph会产生一些额外的显存开销在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
## 三、常见问题FAQ
如果您在使用过程中遇到问题,可以在[FAQ](./FAQ.md)中查阅。