mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-12-24 13:28:13 +08:00
* [Docs] Improve reasoning_out docs * [Docs] Improve reasoning_out docs * [Docs] Improve reasoning_out docs * [Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction * [Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction * [Docs] add ERNIE-4.5-VL-28B-A3B-Thinking instruction --------- Co-authored-by: liqinrui <liqinrui@baidu.com>
162 lines
5.7 KiB
Markdown
162 lines
5.7 KiB
Markdown
[简体中文](../zh/features/reasoning_output.md)
|
||
|
||
# Reasoning Outputs
|
||
|
||
Reasoning models return an additional `reasoning_content` field in their output, which contains the reasoning steps that led to the final conclusion.
|
||
|
||
## Supported Models
|
||
| Model Name | Parser Name | Enable thinking by Default | Tool Calling | Thinking switch parameters|
|
||
|---------------|-------------|---------|---------|----------------|
|
||
| baidu/ERNIE-4.5-VL-424B-A47B-Paddle | ernie-45-vl | ✅ | ❌ | "chat_template_kwargs":{"enable_thinking": true/false}|
|
||
| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | ernie-45-vl | ✅ | ❌ |"chat_template_kwargs":{"enable_thinking": true/false}|
|
||
| baidu/ERNIE-4.5-21B-A3B-Thinking | ernie-x1 | ✅ Not supported for turning off | ✅|❌|
|
||
| baidu/ERNIE-4.5-VL-28B-A3B-Thinking | ernie-45-vl-thinking | ✅ Not recommended to turn off | ✅|"chat_template_kwargs": {"options": {"thinking_mode": "open/close"}}|
|
||
|
||
The reasoning model requires a specified parser to extract reasoning content. Referring to the `thinking switch parameters` of each model can turn off the model's thinking mode.
|
||
|
||
Interfaces that support toggling the reasoning mode:
|
||
1. `/v1/chat/completions` requests in OpenAI services.
|
||
2. `/v1/chat/completions` requests in the OpenAI Python client.
|
||
3. `llm.chat` requests in Offline interfaces.
|
||
|
||
For reasoning models, the length of the reasoning content can be controlled via `reasoning_max_tokens`. Add `"reasoning_max_tokens": 1024` to the request.
|
||
|
||
### Quick Start
|
||
When launching the model service, specify the parser name using the `--reasoning-parser` argument.
|
||
This parser will process the model's output and extract the `reasoning_content` field.
|
||
|
||
```bash
|
||
python -m fastdeploy.entrypoints.openai.api_server \
|
||
--model /path/to/your/model \
|
||
--enable-mm \
|
||
--tensor-parallel-size 8 \
|
||
--port 8192 \
|
||
--quantization wint4 \
|
||
--reasoning-parser ernie-45-vl
|
||
```
|
||
|
||
Next, make a request to the model that should return the reasoning content in the response.
|
||
Taking the baidu/ERNIE-4.5-VL-28B-A3B-Paddle model as an example:
|
||
|
||
```bash
|
||
curl -X POST "http://0.0.0.0:8192/v1/chat/completions" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"messages": [
|
||
{"role": "user", "content": [
|
||
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
|
||
{"type": "text", "text": "Which era does the cultural relic in the picture belong to"}
|
||
]}
|
||
],
|
||
"chat_template_kwargs":{"enable_thinking": true},
|
||
"reasoning_max_tokens": 1024
|
||
}'
|
||
```
|
||
|
||
The `reasoning_content` field contains the reasoning steps to reach the final conclusion, while the `content` field holds the conclusion itself.
|
||
|
||
### Streaming chat completions
|
||
Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in `chat completion response chunks`
|
||
|
||
```python
|
||
from openai import OpenAI
|
||
# Set OpenAI's API key and API base to use vLLM's API server.
|
||
openai_api_key = "EMPTY"
|
||
openai_api_base = "http://localhost:8192/v1"
|
||
client = OpenAI(
|
||
api_key=openai_api_key,
|
||
base_url=openai_api_base,
|
||
)
|
||
chat_response = client.chat.completions.create(
|
||
messages=[
|
||
{"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
|
||
{"type": "text", "text": "Which era does the cultural relic in the picture belong to"}]}
|
||
],
|
||
model="vl",
|
||
stream=True,
|
||
extra_body={
|
||
"chat_template_kwargs":{"enable_thinking": True},
|
||
"reasoning_max_tokens": 1024
|
||
}
|
||
)
|
||
for chunk in chat_response:
|
||
if chunk.choices[0].delta is not None:
|
||
print(chunk.choices[0].delta, end='')
|
||
print("\n")
|
||
```
|
||
## Tool Calling
|
||
The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`.
|
||
|
||
Model request example:
|
||
```bash
|
||
curl -X POST "http://0.0.0.0:8390/v1/chat/completions" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"messages": [
|
||
{
|
||
"role": "user",
|
||
"content": "Get the current weather in BeiJing"
|
||
}
|
||
],
|
||
"tools": [
|
||
{
|
||
"type": "function",
|
||
"function": {
|
||
"name": "get_weather",
|
||
"description": "Determine weather in my location",
|
||
"parameters": {
|
||
"type": "object",
|
||
"properties": {
|
||
"location": {
|
||
"type": "string",
|
||
"description": "The city and state e.g. San Francisco, CA"
|
||
},
|
||
"unit": {
|
||
"type": "string",
|
||
"enum": [
|
||
"c",
|
||
"f"
|
||
]
|
||
}
|
||
},
|
||
"additionalProperties": false,
|
||
"required": [
|
||
"location",
|
||
"unit"
|
||
]
|
||
},
|
||
"strict": true
|
||
}
|
||
}],
|
||
"stream": false
|
||
}'
|
||
```
|
||
Model output example
|
||
|
||
```json
|
||
{
|
||
"choices": [
|
||
{
|
||
"index": 0,
|
||
"message": {
|
||
"role": "assistant",
|
||
"content": "",
|
||
"reasoning_content": "The user asks about ...",
|
||
"tool_calls": [
|
||
{
|
||
"id": "chatcmpl-tool-311b9bda34274722afc654c55c8ce6a0",
|
||
"type": "function",
|
||
"function": {
|
||
"name": "get_weather",
|
||
"arguments": "{\"location\": \"BeiJing\", \"unit\": \"c\"}"
|
||
}
|
||
}
|
||
]
|
||
},
|
||
"finish_reason": "tool_calls"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
More reference documentation related to tool calling usage: [Tool Calling](./tool_calling.md)
|