[Docs] release docs 2.3 (#4951)
Some checks failed
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled

* [Docs] release docks 2.3

* modify dockerfiles

* fix bug
This commit is contained in:
ming1753
2025-11-11 15:30:11 +08:00
committed by GitHub
parent cba7b2912f
commit 38ccf9b00b
25 changed files with 2322 additions and 134 deletions

View File

@@ -40,6 +40,8 @@ When using FastDeploy to deploy models (including offline inference and service
| ```use_cudagraph``` | `bool` | __[DEPRECATED]__ CUDAGraph is enabled by default since version 2.3. It is recommended to read [graph_optimization.md](./features/graph_optimization.md) carefully before opening. |
| ```graph_optimization_config``` | `dict[str]` | Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":true, "graph_opt_level":0}'Detailed description reference [graph_optimization.md](./features/graph_optimization.md)|
| ```disable_custom_all_reduce``` | `bool` | Disable Custom all-reduce, default: False |
| ```use_internode_ll_two_stage``` | `bool` | Use two stage communication in deepep moe, default: False |
| ```disable_sequence_parallel_moe``` | `bool` | Disable sequence parallel moe, default: False |
| ```splitwise_role``` | `str` | Whether to enable splitwise inference, default value: mixed, supported parameters: ["mixed", "decode", "prefill"] |
| ```innode_prefill_ports``` | `str` | Internal engine startup ports for prefill instances (only required for single-machine PD separation), default: None |
| ```guided_decoding_backend``` | `str` | Specify the guided decoding backend to use, supports `auto`, `xgrammar`, `off`, default: `off` |
@@ -49,6 +51,7 @@ When using FastDeploy to deploy models (including offline inference and service
| ```enable_expert_parallel``` | `bool` | Whether to enable expert parallel |
| ```enable_logprob``` | `bool` | Whether to enable return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message.If logrpob is not used, this parameter can be omitted when starting |
| ```logprobs_mode``` | `str` | Indicates the content returned in the logprobs. Supported mode: `raw_logprobs`, `processed_logprobs`, `raw_logits`, `processed_logits`. Raw means the values before applying logit processors, like bad words. Processed means the values after applying such processors. |
| ```max_logprobs``` | `int` | Maximum number of log probabilities to return, default: 20. -1 means vocab_size. |
| ```served_model_name```| `str`| The model name used in the API. If not specified, the model name will be the same as the --model argument |
| ```revision``` | `str` | The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version. |
| ```chat_template``` | `str` | Specify the template used for model concatenation, It supports both string input and file path input. The default value is None. If not specified, the model's default template will be used. |
@@ -57,6 +60,7 @@ When using FastDeploy to deploy models (including offline inference and service
| ```load_choices``` | `str` | By default, the "default" loader is used for weight loading. To load Torch weights or enable weight acceleration, "default_v1" must be used.|
| ```max_encoder_cache``` | `int` | Maximum number of tokens in the encoder cache (use 0 to disable). |
| ```max_processor_cache``` | `int` | Maximum number of bytes(in GiB) in the processor cache (use 0 to disable). |
| ```api_key``` |`dict[str]`| Validate API keys in the service request headers, supporting multiple key inputs|
## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```?
@@ -81,3 +85,41 @@ In actual inference, it's difficult for users to know how to properly configure
When `enable_chunked_prefill` is enabled, the service processes long input sequences through dynamic chunking, significantly improving GPU resource utilization. In this mode, the original `max_num_batched_tokens` parameter no longer constrains the batch token count in prefill phase (limiting single prefill token count), thus introducing `max_num_partial_prefills` parameter specifically to limit concurrently processed partial batches.
To optimize scheduling priority for short requests, new `max_long_partial_prefills` and `long_prefill_token_threshold` parameter combination is added. The former limits the number of long requests in single prefill batch, the latter defines the token threshold for long requests. The system will prioritize batch space for short requests, thereby reducing short request latency in mixed workload scenarios while maintaining stable throughput.
## 4. ```api_key``` parameter description
Multi-value configuration method in startup. That takes precedence over environment variable configuration.
```bash
--api-key "key1"
--api-key "key2"
```
Environment variable multi-value configuration method (use `,` separation):
```bash
export FD_API_KEY="key1,key2"
```
When making requests using Curl, add the validation header. Any matching `api_key` will pass.
```bash
curl -X POST "http://0.0.0.0:8265/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer key1" \
-d '{
"messages": [
{"role": "user", "content":"你好"}
],
"stream": false,
"return_token_ids": true,
"chat_template_kwargs": {"enable_thinking": true}
}'
```
The system will validate `key1` after parsing `Authorization: Bearer`.
When using the openai SDK for requests, pass the `api_key` parameter:
```python
client = OpenAI(
api_key="your-api-key-here",
base_url="http://localhost:8000/v1"
)
```