FastDeploy/docs/get_started/ernie-4.5-vl-thinking.md

[简体中文](../zh/get_started/ernie-4.5-vl-thinking.md)

# Deploy ERNIE-4.5-VL-28B-A3B-Thinking Multimodal Thinking Model

This document explains how to deploy the ERNIE-4.5-VL multimodal model, supporting user interaction via multimodal data and tool call (including for multimodal data). Ensure your hardware meets the requirements before deployment.

- GPU Driver >= 535
- CUDA >= 12.3
- CUDNN >= 9.5
- Linux X86_64
- Python >= 3.10
- 80G A/H 1 GPUs

Refer to the [Installation Guide](./installation/README.md) for FastDeploy setup.

## Prepare the Model
Specify ```--model baidu/ERNIE-4.5-VL-28B-A3B-Thinking``` during deployment to automatically download the model from AIStudio with resumable downloads. You can also manually download the model from other sources. Note that FastDeploy requires Paddle-format models. For more details, see [Supported Models](../supported_models.md).

## Launch the Service

Execute the following command to start the service. For parameter configurations, refer to [Parameter Guide](../parameters.md).

```shell
python -m fastdeploy.entrypoints.openai.api_server \
        --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
        --max-model-len 131072 \
        --max-num-seqs 32 \
        --port 8180 \
        --quantization wint8 \
        --reasoning-parser ernie-45-vl-thinking \
        --tool-call-parser ernie-45-vl-thinking \
        --mm-processor-kwargs '{"image_max_pixels": 12845056 }'
```

## Request the Service
After launching, the service is ready when the following logs appear:

```shell
api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
INFO:     Started server process [13909]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
```

### Health Check

Verify service status (HTTP 200 indicates success):

```shell
curl -i http://0.0.0.0:8180/health
```

### cURL Request
Send requests as follows:

```shell
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "Rewrite Li Bai's 'Quiet Night Thoughts' as a modern poem"}
  ]
}'
```
\
For image inputs:

```shell
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": [
      {"type":"image_url", "image_url": {"url":"https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
      {"type":"text", "text":"From which era does the artifact in the image originate?"}
    ]}
  ]
}'
```
Image can also be provided through base64-encoded string:
```shell
{"type":"image_url", "image_url": {"url":"data:image/jpg;base64,this/is/an/example"}
```
or absolute path to local file:
```shell
{"type":"image_url", "image_url": {"url":"file:///this/is/an/example"}
```
\
For video inputs:

```shell
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": [
      {"type":"video_url", "video_url": {"url":"https://bj.bcebos.com/v1/paddlenlp/datasets/paddlemix/demo_video/example_video.mp4"}},
      {"type":"text", "text":"How many apples are in the scene?"}
    ]}
  ]
}'
```
Video can also be provided through base64-encoded string:
```shell
{"type":"video_url", "video_url": {"url":"data:video/mp4;base64,this/is/an/example"}
```
or absolute path to local file:
```shell
{"type":"video_url", "video_url": {"url":"file:///this/is/an/example"}
```
\
Input includes tool calls, send requests with the command below:

```shell
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d $'{
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "image_zoom_in_tool",
                "description": "Zoom in on a specific region of an image by cropping it based on a bounding box (bbox) and an optional object label.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "bbox_2d": {
                            "type": "array",
                            "items": {
                                "type": "number"
                            },
                            "minItems": 4,
                            "maxItems": 4,
                            "description": "The bounding box of the region to zoom in, as [x1, y1, x2, y2], where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner, and the values of x1, y1, x2, y2 are all normalized to the range 0–1000 based on the original image dimensions."
                        },
                        "label": {
                            "type": "string",
                            "description": "The name or label of the object in the specified bounding box (optional)."
                        }
                    },
                    "required": [
                        "bbox_2d"
                    ]
                },
                "strict": false
            }
        }
    ],
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Is the old lady on the left side of the empty table behind older couple?"
                }
            ]
        }
    ],
    "stream": false
}'
```

For multi-round requests with tool results in history context, use the command below:
```shell
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d $'{
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Get the current weather in Beijing"
                }
            ]
        },
        {
            "role": "assistant",
            "tool_calls": [
                {
                    "id": "call_1",
                    "type": "function",
                    "function": {
                        "name": "get_weather",
                        "arguments": {
                            "location": "Beijing",
                            "unit": "c"
                        }
                    }
                }
            ],
            "content": ""
        },
        {
            "role": "tool",
            "content": [
                {
                    "type": "text",
                    "text": "location: Beijing，temperature: 23，weather: sunny，unit: c"
                }
            ]
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Determine weather in my location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": [
                                "c",
                                "f"
                            ]
                        }
                    },
                    "additionalProperties": false,
                    "required": [
                        "location",
                        "unit"
                    ]
                },
                "strict": true
            }
        }
    ],
    "stream": false
}'
```

### Python Client (OpenAI-compatible API)

FastDeploy's API is OpenAI-compatible. You can also use Python for streaming requests:

```python
import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "user", "content": [
            {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
            {"type": "text", "text": "From which era does the artifact in the image originate?"},
        ]},
    ],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')
print('\n')
```

## Model Output
Example output with reasoning (reasoning content in `reasoning_content`, response in `content`, tool_calls in `tool_calls`):

Example of non-streaming results without tool call:
```json
{
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The artifact in the image ...",
                "multimodal_content": null,
                "reasoning_content": "The user asks about ...",
                "tool_calls": null
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 1290,
        "total_tokens": 1681,
        "completion_tokens": 391,
        "prompt_tokens_details": {
            "cached_tokens": 0,
            "image_tokens": 1240,
            "video_tokens": 0
        },
        "completion_tokens_details": {
            "reasoning_tokens": 217,
            "image_tokens": 0
        }
    }
}
```

Example of non-streaming results with tool call, where the `content` field is empty and `finish_reason` is `tool_calls`:

```json
{
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "",
                "multimodal_content": null,
                "reasoning_content": "What immediately stands out is that I need to determine the spatial relationship between the old lady, the empty table, and the older couple. The original image might not provide enough detail to make this determination clearly, so I should use the image_zoom_in_tool to focus on the relevant area where these elements are located.\n",
                "tool_calls": [
                    {
                        "id": "chatcmpl-tool-dd0ef62027cf409c8f013af65f88adc3",
                        "type": "function",
                        "function": {
                            "name": "image_zoom_in_tool",
                            "arguments": "{\"bbox_2d\": [285, 235, 999, 652]}"
                        }
                    }
                ]
            }
            "finish_reason": "tool_calls"
        }
    ],
    "usage": {
        "prompt_tokens": 280,
        "total_tokens": 397,
        "completion_tokens": 117,
        "prompt_tokens_details": {
            "cached_tokens": 0,
            "image_tokens": 0,
            "video_tokens": 0
        },
        "completion_tokens_details": {
            "reasoning_tokens": 66,
            "image_tokens": 0
        }
    }
}
```