mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-10-05 00:33:03 +08:00

* add chat_template_kwagrs and update params docs * add chat_template_kwagrs and update params docs * update enable_thinking * pre-commit * update test case --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
82 lines
3.2 KiB
Markdown
82 lines
3.2 KiB
Markdown
# Reasoning Outputs
|
|
|
|
Reasoning models return an additional `reasoning_content` field in their output, which contains the reasoning steps that led to the final conclusion.
|
|
|
|
## Supported Models
|
|
| Model Name | Parser Name | Eable_thinking by Default |
|
|
|----------------|----------------|---------------------------|
|
|
| baidu/ERNIE-4.5-VL-424B-A47B-Paddle | ernie-45-vl | ✓ |
|
|
| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | ernie-45-vl | ✓ |
|
|
|
|
The reasoning model requires a specified parser to extract reasoning content. The reasoning mode can be disabled by setting the `"enable_thinking": false` parameter.
|
|
|
|
Interfaces that support toggling the reasoning mode:
|
|
1. `/v1/chat/completions` requests in OpenAI services.
|
|
2. `/v1/chat/completions` requests in the OpenAI Python client.
|
|
3. `llm.chat` requests in Offline interfaces.
|
|
|
|
For reasoning models, the length of the reasoning content can be controlled via `reasoning_max_tokens`. Add `"reasoning_max_tokens": 1024` to the request.
|
|
|
|
### Quick Start
|
|
When launching the model service, specify the parser name using the `--reasoning-parser` argument.
|
|
This parser will process the model's output and extract the `reasoning_content` field.
|
|
|
|
```bash
|
|
python -m fastdeploy.entrypoints.openai.api_server \
|
|
--model /path/to/your/model \
|
|
--enable-mm \
|
|
--tensor-parallel-size 8 \
|
|
--port 8192 \
|
|
--quantization wint4 \
|
|
--reasoning-parser ernie-45-vl
|
|
```
|
|
|
|
Next, make a request to the model that should return the reasoning content in the response.
|
|
|
|
```bash
|
|
curl -X POST "http://0.0.0.0:8192/v1/chat/completions" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"messages": [
|
|
{"role": "user", "content": [
|
|
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
|
|
{"type": "text", "text": "Which era does the cultural relic in the picture belong to"}
|
|
]}
|
|
],
|
|
"chat_template_kwargs":{"enable_thinking": true},
|
|
"reasoning_max_tokens": 1024
|
|
}'
|
|
```
|
|
|
|
The `reasoning_content` field contains the reasoning steps to reach the final conclusion, while the `content` field holds the conclusion itself.
|
|
|
|
### Streaming chat completions
|
|
Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in `chat completion response chunks`
|
|
|
|
```python
|
|
from openai import OpenAI
|
|
# Set OpenAI's API key and API base to use vLLM's API server.
|
|
openai_api_key = "EMPTY"
|
|
openai_api_base = "http://localhost:8192/v1"
|
|
client = OpenAI(
|
|
api_key=openai_api_key,
|
|
base_url=openai_api_base,
|
|
)
|
|
chat_response = client.chat.completions.create(
|
|
messages=[
|
|
{"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
|
|
{"type": "text", "text": "Which era does the cultural relic in the picture belong to"}]}
|
|
],
|
|
model="vl",
|
|
stream=True,
|
|
extra_body={
|
|
"chat_template_kwargs":{"enable_thinking": True},
|
|
"reasoning_max_tokens": 1024
|
|
}
|
|
)
|
|
for chunk in chat_response:
|
|
if chunk.choices[0].delta is not None:
|
|
print(chunk.choices[0].delta, end='')
|
|
print("\n")
|
|
```
|