mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-10-06 00:57:33 +08:00
Update docs for reasoing-parser
This commit is contained in:
@@ -34,7 +34,7 @@ When using FastDeploy to deploy models (including offline inference and service
|
|||||||
| ```max_long_partial_prefills``` | `int` | When Chunked Prefill is enabled, maximum number of long requests in concurrent partial prefill batches, default: 1 |
|
| ```max_long_partial_prefills``` | `int` | When Chunked Prefill is enabled, maximum number of long requests in concurrent partial prefill batches, default: 1 |
|
||||||
| ```long_prefill_token_threshold``` | `int` | When Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04 |
|
| ```long_prefill_token_threshold``` | `int` | When Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04 |
|
||||||
| ```static_decode_blocks``` | `int` | During inference, each request is forced to allocate corresponding number of blocks from Prefill's KVCache for Decode use, default: 2 |
|
| ```static_decode_blocks``` | `int` | During inference, each request is forced to allocate corresponding number of blocks from Prefill's KVCache for Decode use, default: 2 |
|
||||||
| ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output |
|
| ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output, refer [reasoning output](features/reasoning_output.md) for more details |
|
||||||
| ```use_cudagraph``` | `bool` | Whether to use cuda graph, default False. It is recommended to read [graph_optimization.md](./features/graph_optimization.md) carefully before opening. Custom all-reduce needs to be enabled at the same time in multi-card scenarios. |
|
| ```use_cudagraph``` | `bool` | Whether to use cuda graph, default False. It is recommended to read [graph_optimization.md](./features/graph_optimization.md) carefully before opening. Custom all-reduce needs to be enabled at the same time in multi-card scenarios. |
|
||||||
| ```graph_optimization_config``` | `dict[str]` | Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }',Detailed description reference [graph_optimization.md](./features/graph_optimization.md)|
|
| ```graph_optimization_config``` | `dict[str]` | Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }',Detailed description reference [graph_optimization.md](./features/graph_optimization.md)|
|
||||||
| ```enable_custom_all_reduce``` | `bool` | Enable Custom all-reduce, default: False |
|
| ```enable_custom_all_reduce``` | `bool` | Enable Custom all-reduce, default: False |
|
||||||
|
@@ -32,7 +32,7 @@
|
|||||||
| ```max_long_partial_prefills``` | `int` | 开启Chunked Prefill时,Prefill阶段并发中包启的最多长请求数,默认1 |
|
| ```max_long_partial_prefills``` | `int` | 开启Chunked Prefill时,Prefill阶段并发中包启的最多长请求数,默认1 |
|
||||||
| ```long_prefill_token_threshold``` | `int` | 开启Chunked Prefill时,请求Token数超过此值的请求被视为长请求,默认为max_model_len*0.04 |
|
| ```long_prefill_token_threshold``` | `int` | 开启Chunked Prefill时,请求Token数超过此值的请求被视为长请求,默认为max_model_len*0.04 |
|
||||||
| ```static_decode_blocks``` | `int` | 推理过程中,每条请求强制从Prefill的KVCache分配对应块数给Decode使用,默认2|
|
| ```static_decode_blocks``` | `int` | 推理过程中,每条请求强制从Prefill的KVCache分配对应块数给Decode使用,默认2|
|
||||||
| ```reasoning_parser``` | `str` | 指定要使用的推理解析器,以便从模型输出中提取推理内容 |
|
| ```reasoning_parser``` | `str` | 指定要使用的推理解析器,以便从模型输出中提取推理内容,详见[思考链内容](features/reasoning_output.md) |
|
||||||
| ```use_cudagraph``` | `bool` | 是否使用cuda graph,默认False。开启前建议仔细阅读 [graph_optimization.md](./features/graph_optimization.md),在多卡场景需要同时开启 Custom all-reduce。 |
|
| ```use_cudagraph``` | `bool` | 是否使用cuda graph,默认False。开启前建议仔细阅读 [graph_optimization.md](./features/graph_optimization.md),在多卡场景需要同时开启 Custom all-reduce。 |
|
||||||
| ```graph_optimization_config``` | `dict[str]` | 可以配置计算图优化相关的参数,默认值为'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }',详细说明参考 [graph_optimization.md](./features/graph_optimization.md)|
|
| ```graph_optimization_config``` | `dict[str]` | 可以配置计算图优化相关的参数,默认值为'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }',详细说明参考 [graph_optimization.md](./features/graph_optimization.md)|
|
||||||
| ```enable_custom_all_reduce``` | `bool` | 开启Custom all-reduce,默认False |
|
| ```enable_custom_all_reduce``` | `bool` | 开启Custom all-reduce,默认False |
|
||||||
|
2
setup.py
2
setup.py
@@ -181,7 +181,7 @@ def get_name():
|
|||||||
|
|
||||||
cmdclass_dict = {"bdist_wheel": CustomBdistWheel}
|
cmdclass_dict = {"bdist_wheel": CustomBdistWheel}
|
||||||
cmdclass_dict["build_ext"] = CMakeBuild
|
cmdclass_dict["build_ext"] = CMakeBuild
|
||||||
FASTDEPLOY_VERSION = os.environ.get("FASTDEPLOY_VERSION", "2.1.0")
|
FASTDEPLOY_VERSION = os.environ.get("FASTDEPLOY_VERSION", "2.1.1")
|
||||||
cmdclass_dict["build_optl"] = PostInstallCommand
|
cmdclass_dict["build_optl"] = PostInstallCommand
|
||||||
|
|
||||||
setup(
|
setup(
|
||||||
|
Reference in New Issue
Block a user