diff --git a/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md b/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md index 5acac0d8f..911057b0a 100644 --- a/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md @@ -77,8 +77,7 @@ Add the following lines to the startup parameters ``` Notes: 1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions -2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time. -3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported. +2. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported. #### 2.2.6 Rejection Sampling **Idea:** diff --git a/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md b/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md index 34d3caaa1..902d92fcf 100644 --- a/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md @@ -87,8 +87,7 @@ Add the following lines to the startup parameters ``` Notes: 1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions -2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time. -3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported. +2. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported. #### 2.2.6 Rejection Sampling **Idea:** diff --git a/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md b/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md index a95fb1ed2..5eafc8ffa 100644 --- a/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md @@ -132,12 +132,10 @@ CUDAGraph is a GPU computing acceleration technology provided by NVIDIA. It achi Add the following lines to the startup parameters ``` --use-cudagraph ---enable-custom-all-reduce ``` Notes: 1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions -2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time. -3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported. +2. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported. ## FAQ If you encounter any problems during use, you can refer to [FAQ](./FAQ.md). diff --git a/docs/features/graph_optimization.md b/docs/features/graph_optimization.md index fabb1c709..ff335b66b 100644 --- a/docs/features/graph_optimization.md +++ b/docs/features/graph_optimization.md @@ -18,7 +18,7 @@ FastDeploy's `GraphOptimizationBackend` design architecture is as follows, **som ## 1. GraphOptimizationBackend Current usage restrictions In the CUDAGraph multi-device inference task, you need to use the Custom all-reduce operator to perform multi-card all-reduce. -Before version 2.2, neither the CUDAGraph nor the Custom all-reduce operators were enabled by default. You need to add `--enable-custom-all-reduce` to the startup command to manually enable it. +Before version 2.2, the CUDAGraph was not enabled by default. the Custom all-reduce operators was enabled by default. ### 1.1 The multi-device scene needs to be enabled Custom all-reduce The `FLAGS_max_partition_size` environment variable controls the `gridDim` execution configuration of Kernel in CascadeAppend Attention, and dynamic execution configuration will cause CUDAGraph execution to fail. diff --git a/docs/parameters.md b/docs/parameters.md index 176d530bc..b54f4d829 100644 --- a/docs/parameters.md +++ b/docs/parameters.md @@ -37,7 +37,7 @@ When using FastDeploy to deploy models (including offline inference and service | ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output | | ```use_cudagraph``` | `bool` | Whether to use cuda graph, default False. It is recommended to read [graph_optimization.md](./features/graph_optimization.md) carefully before opening. Custom all-reduce needs to be enabled at the same time in multi-card scenarios. | | ```graph_optimization_config``` | `dict[str]` | Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }',Detailed description reference [graph_optimization.md](./features/graph_optimization.md)| -| ```enable_custom_all_reduce``` | `bool` | Enable Custom all-reduce, default: False | +| ```disable_custom_all_reduce``` | `bool` | Disable Custom all-reduce, default: False | | ```splitwise_role``` | `str` | Whether to enable splitwise inference, default value: mixed, supported parameters: ["mixed", "decode", "prefill"] | | ```innode_prefill_ports``` | `str` | Internal engine startup ports for prefill instances (only required for single-machine PD separation), default: None | | ```guided_decoding_backend``` | `str` | Specify the guided decoding backend to use, supports `auto`, `xgrammar`, `off`, default: `off` | diff --git a/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md index 761ec3a30..e4b514225 100644 --- a/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md @@ -77,8 +77,7 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操 ``` 注: 1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明 -2. 开启CUDAGraph时,如果是TP>1的多卡推理场景,需要同时指定 `--enable-custom-all-reduce` -3. 开启CUDAGraph时,暂时不支持`max-model-len > 32768`的场景。 +2. 开启CUDAGraph时,暂时不支持`max-model-len > 32768`的场景。 #### 2.2.5 拒绝采样 **原理:** diff --git a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md index efe6f3cba..7634f96bf 100644 --- a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md @@ -87,8 +87,7 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操 ``` 注: 1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明 -2. 开启CUDAGraph时,如果是TP>1的多卡推理场景,需要同时指定 `--enable-custom-all-reduce` -3. 开启CUDAGraph时,暂时不支持`max-model-len > 32768`的场景。 +2. 开启CUDAGraph时,暂时不支持`max-model-len > 32768`的场景。 #### 2.2.6 拒绝采样 **原理:** diff --git a/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md index cbe4ae727..066f49736 100644 --- a/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md @@ -133,12 +133,10 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操 在启动命令中增加 ``` --use-cudagraph ---enable-custom-all-reduce ``` 注: 1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明 -2. 开启CUDAGraph时,如果是TP>1的多卡推理场景,需要同时指定 `--enable-custom-all-reduce` -3. 开启CUDAGraph时,暂时不支持`max-model-len > 32768`的场景。 +2. 开启CUDAGraph时,暂时不支持`max-model-len > 32768`的场景。 ## 三、常见问题FAQ 如果您在使用过程中遇到问题,可以在[FAQ](./FAQ.md)中查阅。 diff --git a/docs/zh/features/graph_optimization.md b/docs/zh/features/graph_optimization.md index 0b6ca21d7..f25a2d302 100644 --- a/docs/zh/features/graph_optimization.md +++ b/docs/zh/features/graph_optimization.md @@ -19,7 +19,7 @@ FastDeploy 的 `GraphOptimizationBackend` 设计架构如下,**部分功能仍 ### 1.1 多卡场景需要开启 Custom all-reduce 在 CUDAGraph 多卡推理任务中需要使用 Custom all-reduce 算子进行多卡 all-reduce, -在 2.2 版本之前,CUDAGraph 和 Custom all-reduce 算子都未默认开启,需要在启动命令中添加 `--enable-custom-all-reduce` 手动开启。 +在 2.2 版本之前,CUDAGraph 未默认开启,Custom all-reduce 算子默认开启。 ### 1.2 FLAGS_max_partition_size 相关的 Kernel 的动态执行配置导致 CUDAGraph 执行失败 `FLAGS_max_partition_size` 环境变量控制了 CascadeAppend Attention 中 Kernel 的`gridDim` 执行配置 , 而动态的执行配置会导致 CUDAGraph 执行失败。 diff --git a/docs/zh/parameters.md b/docs/zh/parameters.md index f2b518779..035c6ed3e 100644 --- a/docs/zh/parameters.md +++ b/docs/zh/parameters.md @@ -35,7 +35,7 @@ | ```reasoning_parser``` | `str` | 指定要使用的推理解析器,以便从模型输出中提取推理内容 | | ```use_cudagraph``` | `bool` | 是否使用cuda graph,默认False。开启前建议仔细阅读 [graph_optimization.md](./features/graph_optimization.md),在多卡场景需要同时开启 Custom all-reduce。 | | ```graph_optimization_config``` | `dict[str]` | 可以配置计算图优化相关的参数,默认值为'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }',详细说明参考 [graph_optimization.md](./features/graph_optimization.md)| -| ```enable_custom_all_reduce``` | `bool` | 开启Custom all-reduce,默认False | +| ```disable_custom_all_reduce``` | `bool` | 关闭Custom all-reduce,默认False | | ```splitwise_role``` | `str` | 是否开启splitwise推理,默认值mixed, 支持参数为["mixed", "decode", "prefill"] | | ```innode_prefill_ports``` | `str` | prefill 实例内部引擎启动端口 (仅单机PD分离需要),默认值None | | ```guided_decoding_backend``` | `str` | 指定要使用的guided decoding后端,支持 `auto`、`xgrammar`、`off`, 默认为 `off` | diff --git a/fastdeploy/config.py b/fastdeploy/config.py index b6d79b129..fda77607b 100644 --- a/fastdeploy/config.py +++ b/fastdeploy/config.py @@ -278,7 +278,7 @@ class ParallelConfig: self.disable_any_whitespace: bool = True self.pod_ip: str = None # enable the custom all-reduce kernel and fall back to NCCL(dist.all_reduce). - self.enable_custom_all_reduce: bool = False + self.disable_custom_all_reduce: bool = False for key, value in args.items(): if hasattr(self, key): setattr(self, key, value) diff --git a/fastdeploy/engine/args_utils.py b/fastdeploy/engine/args_utils.py index 359ad6ba6..8bb5695e7 100644 --- a/fastdeploy/engine/args_utils.py +++ b/fastdeploy/engine/args_utils.py @@ -188,7 +188,7 @@ class EngineArgs: Flag to enable prefix caching. """ - enable_custom_all_reduce: bool = False + disable_custom_all_reduce: bool = False """ Flag to enable the custom all-reduce kernel. """ @@ -571,10 +571,10 @@ class EngineArgs: help="Degree of tensor parallelism.", ) parallel_group.add_argument( - "--enable-custom-all-reduce", + "--disable-custom-all-reduce", action="store_true", - default=EngineArgs.enable_custom_all_reduce, - help="Flag to enable custom all-reduce.", + default=EngineArgs.disable_custom_all_reduce, + help="Flag to disable custom all-reduce.", ) parallel_group.add_argument( "--max-num-seqs", @@ -947,10 +947,6 @@ class EngineArgs: early_stop_cfg = self.create_early_stop_config() early_stop_cfg.update_enable_early_stop(self.enable_early_stop) - assert not ( - self.tensor_parallel_size <= 1 and self.enable_custom_all_reduce - ), "enable_custom_all_reduce must be used with tensor_parallel_size>1" - assert is_port_available( "0.0.0.0", self.engine_worker_queue_port ), f"The parameter `engine_worker_queue_port`:{self.engine_worker_queue_port} is already in use." diff --git a/fastdeploy/engine/engine.py b/fastdeploy/engine/engine.py index 49f776577..d09f02122 100644 --- a/fastdeploy/engine/engine.py +++ b/fastdeploy/engine/engine.py @@ -1118,7 +1118,7 @@ class LLMEngine: "do_profile": self.do_profile, "dynamic_load_weight": self.cfg.load_config.dynamic_load_weight, "disable_any_whitespace": self.cfg.disable_any_whitespace, - "enable_custom_all_reduce": self.cfg.parallel_config.enable_custom_all_reduce, + "disable_custom_all_reduce": self.cfg.parallel_config.disable_custom_all_reduce, "enable_logprob": self.cfg.model_config.enable_logprob, } for worker_flag, value in worker_append_flag.items(): diff --git a/fastdeploy/worker/gpu_worker.py b/fastdeploy/worker/gpu_worker.py index bfdc92f1d..e7b1adb4b 100644 --- a/fastdeploy/worker/gpu_worker.py +++ b/fastdeploy/worker/gpu_worker.py @@ -69,7 +69,7 @@ class GpuWorker(WorkerBase): gc.collect() paddle.device.cuda.empty_cache() if ( - self.parallel_config.enable_custom_all_reduce + not self.parallel_config.disable_custom_all_reduce and self.parallel_config.tensor_parallel_size > 1 and paddle.is_compiled_with_cuda() ): diff --git a/fastdeploy/worker/worker_process.py b/fastdeploy/worker/worker_process.py index d6f332a7a..6ac885081 100644 --- a/fastdeploy/worker/worker_process.py +++ b/fastdeploy/worker/worker_process.py @@ -516,7 +516,7 @@ def parse_args(): help="enable prefix cache", ) parser.add_argument( - "--enable_custom_all_reduce", + "--disable_custom_all_reduce", action="store_true", help="enable custom all-reduce", )