[Docs] Update parameters documentation with latest code defaults and new parameters (#5709)

* Initial plan

* Update parameters documentation with correct default values and new parameters

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
This commit is contained in:
Copilot
2025-12-23 17:31:44 +08:00
committed by GitHub
parent c1aa66df02
commit e9f5397bc9
2 changed files with 44 additions and 30 deletions

View File

@@ -9,11 +9,11 @@ When using FastDeploy to deploy models (including offline inference and service
| Parameter Name | Type | Description |
|:--------------|:----|:-----------|
| ```port``` | `int` | Only required for service deployment, HTTP service port number, default: 8000 |
| ```metrics_port``` | `int` | Only required for service deployment, metrics monitoring port number, default: 8001 |
| ```metrics_port``` | `int` | Only required for service deployment, metrics monitoring port number, default: None (shares port with main service) |
| ```max_waiting_time``` | `int` | Only required for service deployment, maximum wait time for establishing a connection upon service request. Default: -1 (indicates no wait time limit).|
| ```max_concurrency``` | `int` | Only required for service deployment, the actual number of connections established by the service, default 512 |
| ```engine_worker_queue_port``` | `int` | FastDeploy internal engine communication port, default: 8002 |
| ```cache_queue_port``` | `int` | FastDeploy internal KVCache process communication port, default: 8003 |
| ```engine_worker_queue_port``` | `list[int]` | FastDeploy internal engine communication port list, auto-allocated based on data_parallel_size |
| ```cache_queue_port``` | `list[int]` | FastDeploy internal KVCache process communication port list, auto-allocated based on data_parallel_size |
| ```max_model_len``` | `int` | Default maximum supported context length for inference, default: 2048 |
| ```tensor_parallel_size``` | `int` | Default tensor parallelism degree for model, default: 1 |
| ```data_parallel_size``` | `int` | Default data parallelism degree for model, default: 1 |
@@ -21,15 +21,15 @@ When using FastDeploy to deploy models (including offline inference and service
| ```max_num_seqs``` | `int` | Maximum concurrent number in Decode phase, default: 8 |
| ```mm_processor_kwargs``` | `dict[str]` | Multimodal processor parameter configuration, e.g.: {"image_min_pixels": 3136, "video_fps": 2} |
| ```tokenizer``` | `str` | Tokenizer name or path, defaults to model path |
| ```use_warmup``` | `int` | Whether to perform warmup at startup, will automatically generate maximum length data for warmup, enabled by default when automatically calculating KV Cache |
| ```use_warmup``` | `int` | Whether to perform warmup at startup, will automatically generate maximum length data for warmup, default: 0 (disabled) |
| ```limit_mm_per_prompt``` | `dict[str]` | Limit the amount of multimodal data per prompt, e.g.: {"image": 10, "video": 3}, default: 1 for all |
| ```enable_mm``` | `bool` | __[DEPRECATED]__ Whether to support multimodal data (for multimodal models only), default: False |
| ```enable_mm``` | `bool` | __[DEPRECATED]__ Whether to support multimodal data (for multimodal models only), model architecture automatically detects multimodal models, no manual setting needed |
| ```quantization``` | `str` | Model quantization strategy, when loading BF16 CKPT, specifying wint4 or wint8 supports lossless online 4bit/8bit quantization |
| ```gpu_memory_utilization``` | `float` | GPU memory utilization, default: 0.9 |
| ```num_gpu_blocks_override``` | `int` | Preallocated KVCache blocks, this parameter can be automatically calculated by FastDeploy based on memory situation, no need for user configuration, default: None |
| ```max_num_batched_tokens``` | `int` | Maximum batch token count in Prefill phase, default: None (same as max_model_len) |
| ```kv_cache_ratio``` | `float` | KVCache blocks are divided between Prefill phase and Decode phase according to kv_cache_ratio ratio, default: 0.75 |
| ```enable_prefix_caching``` | `bool` | Whether to enable Prefix Caching, default: False |
| ```enable_prefix_caching``` | `bool` | Whether to enable Prefix Caching, default: True (on GPU/XPU/HPU platforms), False on other platforms |
| ```swap_space``` | `float` | When Prefix Caching is enabled, CPU memory size for KVCache swapping, unit: GB, default: None |
| ```enable_chunked_prefill``` | `bool` | Enable Chunked Prefill, default: False |
| ```max_num_partial_prefills``` | `int` | When Chunked Prefill is enabled, maximum concurrent number of partial prefill batches, default: 1 |
@@ -37,7 +37,7 @@ When using FastDeploy to deploy models (including offline inference and service
| ```long_prefill_token_threshold``` | `int` | When Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04 |
| ```static_decode_blocks``` | `int` | During inference, each request is forced to allocate corresponding number of blocks from Prefill's KVCache for Decode use, default: 2 |
| ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output |
| ```use_cudagraph``` | `bool` | __[DEPRECATED]__ CUDAGraph is enabled by default since version 2.3. It is recommended to read [graph_optimization.md](./features/graph_optimization.md) carefully before opening. |
| ```use_cudagraph``` | `bool` | __[DEPRECATED since version 2.3]__ CUDAGraph is enabled by default. Now controlled via `use_cudagraph` parameter in `graph_optimization_config`, see [graph_optimization.md](./features/graph_optimization.md) for details |
| ```graph_optimization_config``` | `dict[str]` | Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":true, "graph_opt_level":0}'Detailed description reference [graph_optimization.md](./features/graph_optimization.md)|
| ```disable_custom_all_reduce``` | `bool` | Disable Custom all-reduce, default: False |
| ```use_internode_ll_two_stage``` | `bool` | Use two stage communication in deepep moe, default: False |
@@ -47,20 +47,27 @@ When using FastDeploy to deploy models (including offline inference and service
| ```guided_decoding_backend``` | `str` | Specify the guided decoding backend to use, supports `auto`, `xgrammar`, `guidance`, `off`, default: `off` |
| ```guided_decoding_disable_any_whitespace``` | `bool` | Whether to disable whitespace generation during guided decoding, default: False |
| ```speculative_config``` | `dict[str]` | Speculative decoding configuration, only supports standard format JSON string, default: None |
| ```dynamic_load_weight``` | `int` | Whether to enable dynamic weight loading, default: 0 |
| ```enable_expert_parallel``` | `bool` | Whether to enable expert parallel |
| ```enable_logprob``` | `bool` | Whether to enable return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message.If logrpob is not used, this parameter can be omitted when starting |
| ```logprobs_mode``` | `str` | Indicates the content returned in the logprobs. Supported mode: `raw_logprobs`, `processed_logprobs`, `raw_logits`, `processed_logits`. Raw means the values before applying logit processors, like bad words. Processed means the values after applying such processors. |
| ```dynamic_load_weight``` | `bool` | Whether to enable dynamic weight loading, default: False |
| ```enable_expert_parallel``` | `bool` | Whether to enable expert parallel, default: False |
| ```enable_logprob``` | `bool` | Whether to enable return log probabilities of the output tokens, default: False. If logprob is not used, this parameter can be omitted when starting |
| ```logprobs_mode``` | `str` | Specifies the content returned in logprobs, default: `raw_logprobs`. Supported modes: `raw_logprobs`, `processed_logprobs`, `raw_logits`, `processed_logits`. Processed means values after applying logit processors (temperature, penalties, bad words) |
| ```max_logprobs``` | `int` | Maximum number of log probabilities to return, default: 20. -1 means vocab_size. |
| ```served_model_name```| `str`| The model name used in the API. If not specified, the model name will be the same as the --model argument |
| ```revision``` | `str` | The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version. |
| ```chat_template``` | `str` | Specify the template used for model concatenation, It supports both string input and file path input. The default value is None. If not specified, the model's default template will be used. |
| ```tool_call_parser``` | `str` | Specify the function call parser to be used for extracting function call content from the model's output. |
| ```tool_parser_plugin``` | `str` | Specify the file path of the tool parser to be registered, so as to register parsers that are not in the code repository. The code format within these parsers must adhere to the format used in the code repository. |
| ```load_choices``` | `str` | By default, the "default" loader is used for weight loading. To load Torch weights or enable weight acceleration, "default_v1" must be used.|
| ```max_encoder_cache``` | `int` | Maximum number of tokens in the encoder cache (use 0 to disable). |
| ```max_processor_cache``` | `int` | Maximum number of bytes(in GiB) in the processor cache (use 0 to disable). |
| ```api_key``` |`dict[str]`| Validate API keys in the service request headers, supporting multiple key inputs|
| ```load_choices``` | `str` | Weight loader selection, default: "default_v1". Supports "default" and "default_v1", latter for loading torch weights and weight acceleration|
| ```max_encoder_cache``` | `int` | Maximum number of tokens in the encoder cache (use 0 to disable), default: -1 (auto-calculated) |
| ```max_processor_cache``` | `float` | Maximum number of bytes(in GiB) in the processor cache (use 0 to disable), default: -1 (auto-calculated) |
| ```api_key``` |`list[str]`| Validate API keys in the service request headers, supporting multiple key inputs. Same effect as environment variable `FD_API_KEY`, with higher priority|
| ```enable_output_caching``` | `bool` | Whether to enable KV cache for output tokens, only valid in V1 scheduler (ENABLE_V1_KVCACHE_SCHEDULER=1), default: True |
| ```workers``` | `int` | Only required for service deployment, number of API server worker processes, default: 1 |
| ```timeout``` | `int` | Only required for service deployment, worker silent timeout (seconds), set to 0 to disable timeout, default: 0 |
| ```timeout_graceful_shutdown``` | `int` | Only required for service deployment, graceful shutdown timeout (seconds), set to 0 for infinite timeout, default: 0 |
| ```router``` | `str` | Router server URL for request routing in splitwise deployment, e.g., `http://127.0.0.1:8000` |
| ```disable_chunked_mm_input``` | `bool` | Disable chunked processing for multimodal inputs, default: False |
| ```logits_processors``` | `list[str]` | List of fully qualified class names (FQCN) of logits processors supported by the service, e.g., `fastdeploy.model_executor.logits_processor:LogitBiasLogitsProcessor` |
## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```?

View File

@@ -7,11 +7,11 @@
| 参数名 | 类型 | 说明 |
|:-----------------------------------|:----------| :----- |
| ```port``` | `int` | 仅服务化部署需配置服务HTTP请求端口号默认8000 |
| ```metrics_port``` | `int` | 仅服务化部署需配置服务监控Metrics端口号默认8001 |
| ```metrics_port``` | `int` | 仅服务化部署需配置服务监控Metrics端口号默认None与主端口共用 |
| ```max_waiting_time``` | `int` | 仅服务化部署需配置,服务请求建立连接最大等待时间,默认-1 表示无等待时间限制|
| ```max_concurrency``` | `int` | 仅服务化部署需配置服务实际建立连接数目默认512 |
| ```engine_worker_queue_port``` | `int` | FastDeploy内部引擎进程通信端口, 默认8002 |
| ```cache_queue_port``` | `int` | FastDeploy内部KVCache进程通信端口, 默认8003 |
| ```engine_worker_queue_port``` | `list[int]` | FastDeploy内部引擎进程通信端口列表会根据data_parallel_size自动分配 |
| ```cache_queue_port``` | `list[int]` | FastDeploy内部KVCache进程通信端口列表会根据data_parallel_size自动分配 |
| ```max_model_len``` | `int` | 推理默认最大支持上下文长度默认2048 |
| ```tensor_parallel_size``` | `int` | 模型默认张量并行数默认1 |
| ```data_parallel_size``` | `int` | 模型默认数据并行数默认1 |
@@ -19,15 +19,15 @@
| ```max_num_seqs``` | `int` | Decode阶段最大的并发数默认为8 |
| ```mm_processor_kwargs``` | `dict[str]` | 多模态处理器参数配置,如:{"image_min_pixels": 3136, "video_fps": 2} |
| ```tokenizer``` | `str` | tokenizer 名或路径,默认为模型路径 |
| ```use_warmup``` | `int` | 是否在启动时进行warmup会自动生成极限长度数据进行warmup默认自动计算KV Cache时会使用 |
| ```use_warmup``` | `int` | 是否在启动时进行warmup会自动生成极限长度数据进行warmup默认0不启用 |
| ```limit_mm_per_prompt``` | `dict[str]` | 限制每个prompt中多模态数据的数量{"image": 10, "video": 3}默认都为1 |
| ```enable_mm``` | `bool` | __[已废弃]__ 是否支持多模态数据(仅针对多模模型),默认False |
| ```enable_mm``` | `bool` | __[已废弃]__ 是否支持多模态数据(仅针对多模模型),模型架构会自动检测是否为多模态模型,无需手动设置 |
| ```quantization``` | `str` | 模型量化策略当在加载BF16 CKPT时指定wint4或wint8时支持无损在线4bit/8bit量化 |
| ```gpu_memory_utilization``` | `float` | GPU显存利用率默认0.9 |
| ```num_gpu_blocks_override``` | `int` | 预分配KVCache块数此参数可由FastDeploy自动根据显存情况计算无需用户配置默认为None |
| ```max_num_batched_tokens``` | `int` | Prefill阶段最大Batch的Token数量默认为None(与max_model_len一致) |
| ```kv_cache_ratio``` | `float` | KVCache块按kv_cache_ratio比例分给Prefill阶段和Decode阶段, 默认0.75 |
| ```enable_prefix_caching``` | `bool` | 是否开启Prefix Caching默认False |
| ```enable_prefix_caching``` | `bool` | 是否开启Prefix Caching默认TrueGPU/XPU/HPU平台其他平台默认False |
| ```swap_space``` | `float` | 开启Prefix Caching时用于swap KVCache的CPU内存大小单位GB默认None |
| ```enable_chunked_prefill``` | `bool` | 开启Chunked Prefill默认False |
| ```max_num_partial_prefills``` | `int` | 开启Chunked Prefill时Prefill阶段的最大并发数默认1 |
@@ -35,7 +35,7 @@
| ```long_prefill_token_threshold``` | `int` | 开启Chunked Prefill时请求Token数超过此值的请求被视为长请求默认为max_model_len*0.04 |
| ```static_decode_blocks``` | `int` | 推理过程中每条请求强制从Prefill的KVCache分配对应块数给Decode使用默认2|
| ```reasoning_parser``` | `str` | 指定要使用的推理解析器,以便从模型输出中提取推理内容 |
| ```use_cudagraph``` | `bool` | __[已废弃]__ 2.3版本开始 CUDAGraph 默认开启,详细说明参考 [graph_optimization.md](./features/graph_optimization.md) |
| ```use_cudagraph``` | `bool` | __[已废弃,从2.3版本开始]__ CUDAGraph 默认开启,现在通过 `graph_optimization_config` 中的 `use_cudagraph` 参数控制,详细说明参考 [graph_optimization.md](./features/graph_optimization.md) |
| ```graph_optimization_config``` | `dict[str]` | 可以配置计算图优化相关的参数,默认值为'{"use_cudagraph":true, "graph_opt_level":0}',详细说明参考 [graph_optimization.md](./features/graph_optimization.md)|
| ```disable_custom_all_reduce``` | `bool` | 关闭Custom all-reduce默认False |
| ```use_internode_ll_two_stage``` | `bool` | 是否在DeepEP MoE中使用两阶段通信, default: False |
@@ -45,20 +45,27 @@
| ```guided_decoding_backend``` | `str` | 指定要使用的guided decoding后端支持 `auto`、`xgrammar`、 `guidance`、`off`, 默认为 `off` |
| ```guided_decoding_disable_any_whitespace``` | `bool` | guided decoding期间是否禁止生成空格默认False |
| ```speculative_config``` | `dict[str]` | 投机解码配置仅支持标准格式json字符串默认为None |
| ```dynamic_load_weight``` | `int` | 是否动态加载权重,默认0 |
| ```enable_expert_parallel``` | `bool` | 是否启用专家并行 |
| ```enable_logprob``` | `bool` | 是否启用输出token返回logprob。如果未使用 logrpob则在启动时可以省略此参数 |
| ```logprobs_mode``` | `str` | 指定logprobs中返回的内容。支持的模式`raw_logprobs`、`processed_logprobs'、`raw_logits`,`processed_logits'。processed表示logits应用温度、惩罚、禁止词处理后计算的logprobs|
| ```dynamic_load_weight``` | `bool` | 是否动态加载权重,默认False |
| ```enable_expert_parallel``` | `bool` | 是否启用专家并行默认False |
| ```enable_logprob``` | `bool` | 是否启用输出token返回logprob默认False。如果不需要使用logprob则在启动时可以省略此参数 |
| ```logprobs_mode``` | `str` | 指定logprobs中返回的内容,默认`raw_logprobs`。支持的模式:`raw_logprobs`、`processed_logprobs`、`raw_logits`、`processed_logits`。processed表示logits应用温度、惩罚、禁止词处理后计算的logprobs|
| ```max_logprobs``` | `int` | 服务支持返回的最大logprob数量默认20。-1表示词表大小。 |
| ```served_model_name``` | `str` | API 中使用的模型名称,如果未指定,模型名称将与--model参数相同 |
| ```revision``` | `str` | 自动下载模型时用于指定模型的Git版本分支名或tag |
| ```chat_template``` | `str` | 指定模型拼接使用的模板支持字符串与文件路径默认为None如未指定则使用模型默认模板 |
| ```tool_call_parser``` | `str` | 指定要使用的function call解析器以便从模型输出中抽取 function call内容|
| ```tool_parser_plugin``` | `str` | 指定要注册的tool parser文件路径以便注册不在代码库中的parserparser中代码格式需遵循代码库中格式|
| ```load_choices``` | `str` | 默认使用"default" loader进行权重加载加载torch权重/权重加速需开启 "default_v1"|
| ```max_encoder_cache``` | `int` | 编码器缓存的最大token数使用0表示禁用|
| ```max_processor_cache``` | `int` | 处理器缓存的最大字节数以GiB为单位使用0表示禁用|
| ```api_key``` |`dict[str]`| 校验服务请求头中的API密钥支持传入多个密钥与环境变量`FD_API_KEY`中的值效果相同,且优先级高于环境变量配置|
| ```load_choices``` | `str` | 权重加载器选择,默认使用"default_v1"。支持"default"和"default_v1",后者用于加载torch权重权重加速|
| ```max_encoder_cache``` | `int` | 编码器缓存的最大token数使用0表示禁用,默认-1自动计算|
| ```max_processor_cache``` | `float` | 处理器缓存的最大字节数以GiB为单位使用0表示禁用,默认-1自动计算|
| ```api_key``` |`list[str]`| 校验服务请求头中的API密钥支持传入多个密钥与环境变量`FD_API_KEY`中的值效果相同,且优先级高于环境变量配置|
| ```enable_output_caching``` | `bool` | 是否启用输出tokens的KV缓存仅在V1调度器ENABLE_V1_KVCACHE_SCHEDULER=1下有效默认True |
| ```workers``` | `int` | 仅服务化部署需配置API服务器worker进程数量默认1 |
| ```timeout``` | `int` | 仅服务化部署需配置worker静默超时时间设置为0表示禁用超时默认0 |
| ```timeout_graceful_shutdown``` | `int` | 仅服务化部署需配置优雅关闭超时时间设置为0表示无限超时默认0 |
| ```router``` | `str` | 路由服务器URL用于splitwise部署中的请求路由如 `http://127.0.0.1:8000` |
| ```disable_chunked_mm_input``` | `bool` | 禁用多模态输入的分块处理默认False |
| ```logits_processors``` | `list[str]` | 服务支持的logits处理器的完全限定类名FQCN列表如 `fastdeploy.model_executor.logits_processor:LogitBiasLogitsProcessor` |
## 1. KVCache分配与```num_gpu_blocks_override```、```block_size```的关系?