mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-12-24 13:28:13 +08:00
[Docs] release docs 2.3 (#4951)
Some checks failed
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Some checks failed
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
* [Docs] release docks 2.3 * modify dockerfiles * fix bug
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
FROM ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.1
|
||||
ARG PADDLE_VERSION=3.2.0
|
||||
ARG FD_VERSION=2.2.1
|
||||
FROM ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:tag-base
|
||||
ARG PADDLE_VERSION=3.2.1
|
||||
ARG FD_VERSION=2.3.0
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
|
||||
|
||||
122
docs/best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md
Normal file
122
docs/best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md
Normal file
@@ -0,0 +1,122 @@
|
||||
[简体中文](../zh/best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md)
|
||||
|
||||
# ERNIE-4.5-VL-28B-A3B-Thinking
|
||||
|
||||
## 1. Environment Preparation
|
||||
### 1.1 Support Status
|
||||
|
||||
The minimum number of cards required for deployment on the following hardware is as follows:
|
||||
|
||||
| Device [GPU Mem] | WINT4 | WINT8 | BFLOAT16 |
|
||||
|:----------:|:----------:|:------:| :------:|
|
||||
| A30 [24G] | 2 | 2 | 4 |
|
||||
| L20 [48G] | 1 | 1 | 2 |
|
||||
| H20 [144G] | 1 | 1 | 1 |
|
||||
| A100 [80G] | 1 | 1 | 1 |
|
||||
| H800 [80G] | 1 | 1 | 1 |
|
||||
|
||||
### 1.2 Install Fastdeploy
|
||||
|
||||
Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md)
|
||||
|
||||
## 2.How to Use
|
||||
### 2.1 Basic: Launching the Service
|
||||
**Example 1:** Deploying a 32K Context Service on a Single RTX 4090 GPU
|
||||
```shell
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
|
||||
--port 8180 \
|
||||
--metrics-port 8181 \
|
||||
--engine-worker-queue-port 8182 \
|
||||
--tensor-parallel-size 1 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-seqs 32 \
|
||||
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
|
||||
--reasoning-parser ernie-45-vl-thinking \
|
||||
--tool-call-parser ernie-45-vl-thinking \
|
||||
--mm-processor-kwargs '{"image_max_pixels": 12845056 }'
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--quantization wint4
|
||||
```
|
||||
**Example 2:** Deploying a 128K Context Service on Dual H800 GPUs
|
||||
```shell
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
|
||||
--port 8180 \
|
||||
--metrics-port 8181 \
|
||||
--engine-worker-queue-port 8182 \
|
||||
--tensor-parallel-size 2 \
|
||||
--max-model-len 131072 \
|
||||
--max-num-seqs 128 \
|
||||
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
|
||||
--reasoning-parser ernie-45-vl-thinking \
|
||||
--tool-call-parser ernie-45-vl-thinking \
|
||||
--mm-processor-kwargs '{"image_max_pixels": 12845056 }'
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--quantization wint4
|
||||
```
|
||||
|
||||
An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
|
||||
### 2.2 Advanced: How to Achieve Better Performance
|
||||
|
||||
#### 2.2.1 Evaluating Application Scenarios and Setting Parameters Correctly
|
||||
> **Context Length**
|
||||
- **Parameters:** `--max-model-len`
|
||||
- **Description:** Controls the maximum context length that the model can process.
|
||||
- **Recommendation:** Longer context lengths may reduce throughput. Adjust based on actual needs, with a maximum supported context length of **128k** (131,072).
|
||||
|
||||
⚠️ Note: Longer context lengths will significantly increase GPU memory requirements. Ensure your hardware resources are sufficient before setting a longer context.
|
||||
> **Maximum sequence count**
|
||||
- **Parameters:** `--max-num-seqs`
|
||||
- **Description:** Controls the maximum number of sequences the service can handle, supporting a range of 1 to 256.
|
||||
- **Recommendation:** If you are unsure of the average number of sequences per request in your actual application scenario, we recommend setting it to **256**. If the average number of sequences per request in your application is significantly fewer than 256, we suggest setting it to a slightly higher value than the average to further reduce GPU memory usage and optimize service performance.
|
||||
|
||||
> **Multi-image and multi-video input**
|
||||
- **Parameters**:`--limit-mm-per-prompt`
|
||||
- **Description**:Our model supports multi-image and multi-video input in a single prompt. Please use this **Parameters** setting to limit the number of images/videos per request, ensuring efficient resource utilization.
|
||||
- **Recommendation**:We recommend setting the number of images and videos in a single prompt to **100 each** to balance performance and memory usage.
|
||||
|
||||
> **Available GPU memory ratio during initialization**
|
||||
- **Parameters:** `--gpu-memory-utilization`
|
||||
- **Description:** Controls the available GPU memory for FastDeploy service initialization. The default value is 0.9, meaning 10% of the memory is reserved for backup.
|
||||
- **Recommendation:** It is recommended to use the default value of 0.9. If an "out of memory" error occurs during stress testing, you may attempt to reduce this value.
|
||||
|
||||
#### 2.2.2 Chunked Prefill
|
||||
- **Parameters:** `--enable-chunked-prefill`
|
||||
- **Description:** Enabling `chunked prefill` can reduce peak GPU memory usage and improve service throughput. Version 2.2 has **enabled by default**; for versions prior to 2.2, you need to enable it manually—refer to the best practices documentation for 2.1.
|
||||
- **Relevant configurations**:
|
||||
|
||||
`--max-num-batched-tokens`:Limit the maximum number of tokens per chunk, with a recommended setting of 384.
|
||||
|
||||
#### 2.2.3 **Quantization precision**
|
||||
- **Parameters:** `--quantization`
|
||||
|
||||
- **Supported precision types:**
|
||||
- WINT4 (Suitable for most users)
|
||||
- WINT8
|
||||
- BFLOAT16 (When the `--quantization` parameter is not set, BFLOAT16 is used by default.)
|
||||
|
||||
- **Recommendation:**
|
||||
- Unless you have extremely stringent precision requirements, we strongly recommend using WINT4 quantization. This will significantly reduce memory consumption and increase throughput.
|
||||
- If slightly higher precision is required, you may try WINT8.
|
||||
- Only consider using BFLOAT16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.
|
||||
|
||||
#### 2.2.4 **Adjustable environment variables**
|
||||
> **Rejection sampling:**`FD_SAMPLING_CLASS=rejection`
|
||||
- **Description:** Rejection sampling involves generating samples from a proposal distribution that is easy to sample from, thereby avoiding explicit sorting and achieving an effect of improving sampling speed, which can enhance inference performance.
|
||||
- **Recommendation:** This is a relatively aggressive optimization strategy that affects the results, and we are still conducting comprehensive validation of its impact. If you have high performance requirements and can accept potential compromises in results, you may consider enabling this strategy.
|
||||
|
||||
## 3. FAQ
|
||||
|
||||
### 3.1 Out of Memory
|
||||
If the service prompts "Out of Memory" during startup, please try the following solutions:
|
||||
1. Ensure no other processes are occupying GPU memory;
|
||||
2. Use WINT4/WINT8 quantization and enable chunked prefill;
|
||||
3. Reduce context length and maximum sequence count as needed;
|
||||
4. Increase the number of GPU cards for deployment (e.g., 2 or 4 cards) by modifying the parameter `--tensor-parallel-size 2` or `--tensor-parallel-size 4`.
|
||||
|
||||
If the service starts normally but later reports insufficient memory, try:
|
||||
1. Adjust the initial GPU memory utilization ratio by modifying `--gpu-memory-utilization`;
|
||||
2. Increase the number of deployment cards (parameter adjustment as above).
|
||||
98
docs/best_practices/GLM-4-MoE-Text.md
Normal file
98
docs/best_practices/GLM-4-MoE-Text.md
Normal file
@@ -0,0 +1,98 @@
|
||||
[简体中文](../zh/best_practices/GLM-4-MoE-Text.md)
|
||||
|
||||
# GLM-4.5/4.6 Text Model
|
||||
## Environmental Preparation
|
||||
### 1.1 Hardware requirements
|
||||
The minimum number of GPUs required to deploy `GLM-4.5/4.6` on the following hardware for each quantization is as follows:
|
||||
|
||||
| | WINT8 | WINT4 | FP8 |
|
||||
|-----|-----|-----|-----|
|
||||
|H800 80GB| 4 | 4 | 4 |
|
||||
|A800 80GB| 4 | 4 | / |
|
||||
|
||||
**Tips:**
|
||||
1. To modify the number of deployment GPUs, specify `--tensor-parallel-size 4` in starting command.
|
||||
2. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
|
||||
3. FP8 quantization is recommended.
|
||||
|
||||
### 1.2 Install fastdeploy and prepare the model
|
||||
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
|
||||
|
||||
- Model Download,For detail, please refer to [Supported Models](../supported_models.md).
|
||||
|
||||
## 2.How to Use
|
||||
### 2.1 Basic: Launching the Service
|
||||
Example 1: H100 4-GPU BF16 Deployment with 16K Context
|
||||
```bash
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model zai-org/GLM-4.5-Air \
|
||||
--tensor-parallel-size 4 \
|
||||
--port 8185 \
|
||||
--max-model-len 16384 \
|
||||
|
||||
```
|
||||
|
||||
Example 2: H100 4-GPU FP8 Inference Deployment
|
||||
```bash
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model zai-org/GLM-4.5-Air \
|
||||
--tensor-parallel-size 4 \
|
||||
--port 8185 \
|
||||
--quantization wfp8afp8 \
|
||||
|
||||
```
|
||||
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `wfp8afp8`(Hopper is needed).
|
||||
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
|
||||
|
||||
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
|
||||
|
||||
### 2.2 Advanced: How to get better performance
|
||||
#### 2.2.1 Correctly set parameters that match the application scenario
|
||||
Evaluate average input length, average output length, and maximum context length
|
||||
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
|
||||
|
||||
#### 2.2.2 Prefix Caching
|
||||
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
|
||||
|
||||
**How to enable:**
|
||||
Since version 2.2 (including the develop branch), Prefix Caching has been enabled by default.
|
||||
|
||||
For versions 2.1 and earlier, you need to enable it manually by adding following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
|
||||
```
|
||||
--enable-prefix-caching
|
||||
--swap-space 50
|
||||
```
|
||||
|
||||
#### 2.2.3 Chunked Prefill
|
||||
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
|
||||
|
||||
**How to enable:**
|
||||
Since version 2.2 (including the develop branch), Chunked Prefill has been enabled by default.
|
||||
|
||||
For versions 2.1 and earlier, you need to enable it manually by adding
|
||||
```
|
||||
--enable-chunked-prefill
|
||||
```
|
||||
|
||||
#### 2.2.4 CUDAGraph
|
||||
**Idea:**
|
||||
CUDAGraph is a GPU computing acceleration technology provided by NVIDIA. It achieves efficient execution and optimization of GPU tasks by capturing CUDA operation sequences into a graph structure. The core idea of CUDAGraph is to encapsulate a series of GPU computing and memory operations into a re-executable graph, thereby reducing CPU-GPU communication overhead, reducing kernel startup latency, and improving overall computing performance.
|
||||
|
||||
**How to enable:**
|
||||
Before version 2.3, it needs to be enabled through `--use-cudagraph`.
|
||||
CUDAGraph has been enabled by default in some scenarios at the beginning of version 2.3. CUDAGraph will be automatically closed for functions that are not compatible with CUDAGraph (speculative decoding, RL training, multi-mode model).
|
||||
Notes:
|
||||
- Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
|
||||
|
||||
#### 2.2.5 Rejection Sampling
|
||||
**Idea:**
|
||||
Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models.
|
||||
|
||||
**How to enable:**
|
||||
Add the following environment variables before starting
|
||||
```
|
||||
export FD_SAMPLING_CLASS=rejection
|
||||
```
|
||||
|
||||
## FAQ
|
||||
If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).
|
||||
@@ -26,6 +26,7 @@ We recommend using mpirun for one-command startup without manually starting each
|
||||
4. Ensure all nodes can resolve each other's hostnames
|
||||
|
||||
* Online inference startup example:
|
||||
|
||||
```shell
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
|
||||
@@ -39,6 +40,7 @@ We recommend using mpirun for one-command startup without manually starting each
|
||||
```
|
||||
|
||||
* Offline startup example:
|
||||
|
||||
```python
|
||||
from fastdeploy.engine.sampling_params import SamplingParams
|
||||
from fastdeploy.entrypoints.llm import LLM
|
||||
|
||||
@@ -5,12 +5,14 @@
|
||||
Reasoning models return an additional `reasoning_content` field in their output, which contains the reasoning steps that led to the final conclusion.
|
||||
|
||||
## Supported Models
|
||||
| Model Name | Parser Name | Eable_thinking by Default |
|
||||
|----------------|----------------|---------------------------|
|
||||
| baidu/ERNIE-4.5-VL-424B-A47B-Paddle | ernie-45-vl | ✓ |
|
||||
| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | ernie-45-vl | ✓ |
|
||||
| Model Name | Parser Name | Enable thinking by Default | Tool Calling | Thinking switch parameters|
|
||||
|---------------|-------------|---------|---------|----------------|
|
||||
| baidu/ERNIE-4.5-VL-424B-A47B-Paddle | ernie-45-vl | ✅ | ❌ | "chat_template_kwargs":{"enable_thinking": true/false}|
|
||||
| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | ernie-45-vl | ✅ | ❌ |"chat_template_kwargs":{"enable_thinking": true/false}|
|
||||
| baidu/ERNIE-4.5-21B-A3B-Thinking | ernie-x1 | ✅ Not supported for turning off | ✅|❌|
|
||||
| baidu/ERNIE-4.5-VL-28B-A3B-Thinking | ernie-45-vl-thinking | ✅ Not recommended to turn off | ✅|"chat_template_kwargs": {"options": {"thinking_mode": "open/close"}}|
|
||||
|
||||
The reasoning model requires a specified parser to extract reasoning content. The reasoning mode can be disabled by setting the `"enable_thinking": false` parameter.
|
||||
The reasoning model requires a specified parser to extract reasoning content. Referring to the `thinking switch parameters` of each model can turn off the model's thinking mode.
|
||||
|
||||
Interfaces that support toggling the reasoning mode:
|
||||
1. `/v1/chat/completions` requests in OpenAI services.
|
||||
@@ -34,6 +36,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
```
|
||||
|
||||
Next, make a request to the model that should return the reasoning content in the response.
|
||||
Taking the baidu/ERNIE-4.5-VL-28B-A3B-Paddle model as an example:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8192/v1/chat/completions" \
|
||||
@@ -81,3 +84,78 @@ for chunk in chat_response:
|
||||
print(chunk.choices[0].delta, end='')
|
||||
print("\n")
|
||||
```
|
||||
## Tool Calling
|
||||
The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`.
|
||||
|
||||
Model request example:
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8390/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Get the current weather in BeiJing"
|
||||
}
|
||||
],
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"description": "Determine weather in my location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state e.g. San Francisco, CA"
|
||||
},
|
||||
"unit": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"c",
|
||||
"f"
|
||||
]
|
||||
}
|
||||
},
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"location",
|
||||
"unit"
|
||||
]
|
||||
},
|
||||
"strict": true
|
||||
}
|
||||
}],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
Model output example
|
||||
|
||||
```json
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "",
|
||||
"reasoning_content": "The user asks about ...",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "chatcmpl-tool-311b9bda34274722afc654c55c8ce6a0",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"arguments": "{\"location\": \"BeiJing\", \"unit\": \"c\"}"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"finish_reason": "tool_calls"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
More reference documentation related to tool calling usage: [Tool Calling](./tool_calling.md)
|
||||
|
||||
227
docs/features/tool_calling.md
Normal file
227
docs/features/tool_calling.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# Tool_Calling
|
||||
|
||||
This document describes how to configure the server in FastDeploy to use the tool parser, and how to invoke tools from the client.
|
||||
|
||||
## Tool Call parser for Ernie series models
|
||||
| Model Name | Parser Name |
|
||||
|---------------|-------------|
|
||||
| baidu/ERNIE-4.5-21B-A3B-Thinking | ernie-x1 |
|
||||
| baidu/ERNIE-4.5-VL-28B-A3B-Thinking | ernie-45-vl-thinking |
|
||||
|
||||
## Quickstart
|
||||
|
||||
### Starting FastDeploy with Tool Calling Enabled.
|
||||
|
||||
Launch the server with tool-calling enabled.This example uses ERNIE-4.5-21B-A3B.Leverage the ernie-x1 reasoning parser and the ernie-x1 tool-call parser from the fastdeploy directory to extract the model’s reasoning content, response content, and the tool-calling information:
|
||||
|
||||
```bash
|
||||
python -m fastdeploy.entrypoints.openai.api_server
|
||||
--model /models/ERNIE-4.5-21B-A3B \
|
||||
--port 8000 \
|
||||
--reasoning-parser ernie-x1 \
|
||||
--tool-call-parser ernie-x1
|
||||
```
|
||||
### Example of triggering tool calling
|
||||
Make a request containing the tool to trigger the model to use the available tool:
|
||||
```python
|
||||
curl -X POST http://0.0.0.0:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "What's the weather in Beijing?"
|
||||
}
|
||||
],
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"description": "Get the current weather in a given location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "City name, for example: Beijing"
|
||||
},
|
||||
"unit": {
|
||||
"type": "string",
|
||||
"enum": ["c", "f"],
|
||||
"description": "Temperature units: c = Celsius, f = Fahrenheit"
|
||||
}
|
||||
},
|
||||
"required": ["location", "unit"],
|
||||
"additionalProperties": false
|
||||
},
|
||||
"strict": true
|
||||
}
|
||||
}
|
||||
],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
The example output is as follows. It shows that the model's output of the thought process `reasoning_content` and tool call information `tool_calls` was successfully parsed, and the current response content `content` is empty,`finish_reason` is `tool_calls`:
|
||||
```bash
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "",
|
||||
"multimodal_content": null,
|
||||
"reasoning_content": "User wants to ... ",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "chatcmpl-tool-bc90641c67e44dbfb981a79bc986fbe5",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"arguments": "{\"location\": \"北京\", \"unit\": \"c\"}"
|
||||
}
|
||||
}
|
||||
],
|
||||
"finish_reason": "tool_calls"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Parallel Tool Calls
|
||||
If the model can generate parallel tool calls, FastDeploy will return a list:
|
||||
```bash
|
||||
tool_calls=[
|
||||
{"id": "...", "function": {...}},
|
||||
{"id": "...", "function": {...}}
|
||||
]
|
||||
```
|
||||
|
||||
## Requests containing tools in the conversation history
|
||||
If tool-call information exists in previous turns, you can construct the request as follows:
|
||||
```python
|
||||
curl -X POST "http://0.0.0.0:8000/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Hello,What's the weather in Beijing?"
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "call_1",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"arguments": {
|
||||
"location": "Beijing",
|
||||
"unit": "c"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"thoughts": "Users need to check today's weather in Beijing."
|
||||
},
|
||||
{
|
||||
"role": "tool",
|
||||
"tool_call_id": "call_1",
|
||||
"content": {
|
||||
"type": "text",
|
||||
"text": "{\"location\": \"北京\",\"temperature\": \"23\",\"weather\": \"晴\",\"unit\": \"c\"}"
|
||||
}
|
||||
}
|
||||
],
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"description": "Determine weather in my location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state e.g. San Francisco, CA"
|
||||
},
|
||||
"unit": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"c",
|
||||
"f"
|
||||
]
|
||||
}
|
||||
},
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"location",
|
||||
"unit"
|
||||
]
|
||||
},
|
||||
"strict": true
|
||||
}
|
||||
}
|
||||
],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
The parsed model output is as follows, containing the thought content `reasoning_content` and the response content `content`, with `finish_reason` set to stop:
|
||||
```bash
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "Today's weather in Beijing is sunny with a temperature of 23 degrees Celsius.",
|
||||
"reasoning_content": "User wants to ...",
|
||||
"tool_calls": null
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
## Writing a Custom Tool Parser
|
||||
FastDeploy supports custom tool parser plugins. You can refer to the following address to create a `tool parser`: `fastdeploy/entrypoints/openai/tool_parser`
|
||||
|
||||
A custom parser should implement:
|
||||
``` python
|
||||
# import the required packages
|
||||
# register the tool parser to ToolParserManager
|
||||
@ToolParserManager.register_module("my-parser")
|
||||
class ToolParser:
|
||||
def __init__(self, tokenizer: AnyTokenizer):
|
||||
super().__init__(tokenizer)
|
||||
|
||||
# implement the tool parse for non-stream call
|
||||
def extract_tool_calls(self, model_output: str, request: ChatCompletionRequest) -> ExtractToolCallInformation:
|
||||
return ExtractedToolCallInformation(tools_called=False,tool_calls=[],content=text)
|
||||
|
||||
# implement the tool call parse for stream call
|
||||
def extract_tool_calls_streaming(
|
||||
self,
|
||||
previous_text: str,
|
||||
current_text: str,
|
||||
delta_text: str,
|
||||
previous_token_ids: Sequence[int],
|
||||
current_token_ids: Sequence[int],
|
||||
delta_token_ids: Sequence[int],
|
||||
request: ChatCompletionRequest,
|
||||
) -> DeltaMessage | None:
|
||||
return delta
|
||||
```
|
||||
Enable via:
|
||||
``` bash
|
||||
python -m fastdeploy.entrypoints.openai.api_server
|
||||
--model <model path>
|
||||
--tool-parser-plugin <absolute path of the plugin file>
|
||||
--tool-call-parser my-parser
|
||||
```
|
||||
|
||||
---
|
||||
331
docs/get_started/ernie-4.5-vl-thinking.md
Normal file
331
docs/get_started/ernie-4.5-vl-thinking.md
Normal file
@@ -0,0 +1,331 @@
|
||||
[简体中文](../zh/get_started/ernie-4.5-vl-thinking.md)
|
||||
|
||||
# Deploy ERNIE-4.5-VL-28B-A3B-Thinking Multimodal Thinking Model
|
||||
|
||||
This document explains how to deploy the ERNIE-4.5-VL multimodal model, supporting user interaction via multimodal data and tool call (including for multimodal data). Ensure your hardware meets the requirements before deployment.
|
||||
|
||||
- GPU Driver >= 535
|
||||
- CUDA >= 12.3
|
||||
- CUDNN >= 9.5
|
||||
- Linux X86_64
|
||||
- Python >= 3.10
|
||||
- 80G A/H 1 GPUs
|
||||
|
||||
Refer to the [Installation Guide](./installation/README.md) for FastDeploy setup.
|
||||
|
||||
## Prepare the Model
|
||||
Specify ```--model baidu/ERNIE-4.5-VL-28B-A3B-Thinking``` during deployment to automatically download the model from AIStudio with resumable downloads. You can also manually download the model from other sources. Note that FastDeploy requires Paddle-format models. For more details, see [Supported Models](../supported_models.md).
|
||||
|
||||
## Launch the Service
|
||||
|
||||
Execute the following command to start the service. For parameter configurations, refer to [Parameter Guide](../parameters.md).
|
||||
|
||||
```shell
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
|
||||
--max-model-len 131072 \
|
||||
--max-num-seqs 32 \
|
||||
--port 8180 \
|
||||
--quantization wint8 \
|
||||
--reasoning-parser ernie-45-vl-thinking \
|
||||
--tool-call-parser ernie-45-vl-thinking \
|
||||
--mm-processor-kwargs '{"image_max_pixels": 12845056 }'
|
||||
```
|
||||
|
||||
## Request the Service
|
||||
After launching, the service is ready when the following logs appear:
|
||||
|
||||
```shell
|
||||
api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
|
||||
api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
|
||||
api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
|
||||
INFO: Started server process [13909]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
Verify service status (HTTP 200 indicates success):
|
||||
|
||||
```shell
|
||||
curl -i http://0.0.0.0:8180/health
|
||||
```
|
||||
|
||||
### cURL Request
|
||||
Send requests as follows:
|
||||
|
||||
```shell
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": "Rewrite Li Bai's 'Quiet Night Thoughts' as a modern poem"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
For image inputs:
|
||||
|
||||
```shell
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": [
|
||||
{"type":"image_url", "image_url": {"url":"https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
|
||||
{"type":"text", "text":"From which era does the artifact in the image originate?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
For video inputs:
|
||||
|
||||
```shell
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": [
|
||||
{"type":"video_url", "video_url": {"url":"https://bj.bcebos.com/v1/paddlenlp/datasets/paddlemix/demo_video/example_video.mp4"}},
|
||||
{"type":"text", "text":"How many apples are in the scene?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
Input includes tool calls, send requests with the command below:
|
||||
|
||||
```shell
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d $'{
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "image_zoom_in_tool",
|
||||
"description": "Zoom in on a specific region of an image by cropping it based on a bounding box (bbox) and an optional object label.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"bbox_2d": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "number"
|
||||
},
|
||||
"minItems": 4,
|
||||
"maxItems": 4,
|
||||
"description": "The bounding box of the region to zoom in, as [x1, y1, x2, y2], where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner, and the values of x1, y1, x2, y2 are all normalized to the range 0–1000 based on the original image dimensions."
|
||||
},
|
||||
"label": {
|
||||
"type": "string",
|
||||
"description": "The name or label of the object in the specified bounding box (optional)."
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
"bbox_2d"
|
||||
]
|
||||
},
|
||||
"strict": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Is the old lady on the left side of the empty table behind older couple?"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
For multi-round requests with tool results in history context, use the command below:
|
||||
```shell
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d $'{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Get the current weather in Beijing"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "call_1",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"arguments": {
|
||||
"location": "Beijing",
|
||||
"unit": "c"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"content": ""
|
||||
},
|
||||
{
|
||||
"role": "tool",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "location: Beijing,temperature: 23,weather: sunny,unit: c"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"description": "Determine weather in my location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state e.g. San Francisco, CA"
|
||||
},
|
||||
"unit": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"c",
|
||||
"f"
|
||||
]
|
||||
}
|
||||
},
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"location",
|
||||
"unit"
|
||||
]
|
||||
},
|
||||
"strict": true
|
||||
}
|
||||
}
|
||||
],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
### Python Client (OpenAI-compatible API)
|
||||
|
||||
FastDeploy's API is OpenAI-compatible. You can also use Python for streaming requests:
|
||||
|
||||
```python
|
||||
import openai
|
||||
host = "0.0.0.0"
|
||||
port = "8180"
|
||||
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="null",
|
||||
messages=[
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
|
||||
{"type": "text", "text": "From which era does the artifact in the image originate?"},
|
||||
]},
|
||||
],
|
||||
stream=True,
|
||||
)
|
||||
for chunk in response:
|
||||
if chunk.choices[0].delta:
|
||||
print(chunk.choices[0].delta.content, end='')
|
||||
print('\n')
|
||||
```
|
||||
|
||||
## Model Output
|
||||
Example output with reasoning (reasoning content in `reasoning_content`, response in `content`, tool_calls in `tool_calls`):
|
||||
|
||||
Example of non-streaming results without tool call:
|
||||
```json
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "The artifact in the image ...",
|
||||
"multimodal_content": null,
|
||||
"reasoning_content": "The user asks about ...",
|
||||
"tool_calls": null
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 1290,
|
||||
"total_tokens": 1681,
|
||||
"completion_tokens": 391,
|
||||
"prompt_tokens_details": {
|
||||
"cached_tokens": 0,
|
||||
"image_tokens": 1240,
|
||||
"video_tokens": 0
|
||||
},
|
||||
"completion_tokens_details": {
|
||||
"reasoning_tokens": 217,
|
||||
"image_tokens": 0
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Example of non-streaming results with tool call, where the `content` field is empty and `finish_reason` is `tool_calls`:
|
||||
|
||||
```json
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "",
|
||||
"multimodal_content": null,
|
||||
"reasoning_content": "What immediately stands out is that I need to determine the spatial relationship between the old lady, the empty table, and the older couple. The original image might not provide enough detail to make this determination clearly, so I should use the image_zoom_in_tool to focus on the relevant area where these elements are located.\n",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "chatcmpl-tool-dd0ef62027cf409c8f013af65f88adc3",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "image_zoom_in_tool",
|
||||
"arguments": "{\"bbox_2d\": [285, 235, 999, 652]}"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
"finish_reason": "tool_calls"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 280,
|
||||
"total_tokens": 397,
|
||||
"completion_tokens": 117,
|
||||
"prompt_tokens_details": {
|
||||
"cached_tokens": 0,
|
||||
"image_tokens": 0,
|
||||
"video_tokens": 0
|
||||
},
|
||||
"completion_tokens_details": {
|
||||
"reasoning_tokens": 66,
|
||||
"image_tokens": 0
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -22,7 +22,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
### Start Container
|
||||
|
||||
```bash
|
||||
docker run -itd --name paddle_infer -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
docker run -itd --name paddle_infer --network host -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
docker exec -it paddle_infer bash
|
||||
```
|
||||
|
||||
@@ -432,7 +432,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
### Start Container
|
||||
|
||||
```bash
|
||||
docker run -itd --name paddle_infer -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
docker run -itd --name paddle_infer --network host -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
docker exec -it paddle_infer bash
|
||||
```
|
||||
|
||||
@@ -441,8 +441,8 @@ docker exec -it paddle_infer bash
|
||||
### Install paddle
|
||||
|
||||
```bash
|
||||
pip3 install paddlepaddle==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
|
||||
pip3 install paddle-iluvatar-gpu==3.0.0.dev20250926 -i https://www.paddlepaddle.org.cn/packages/nightly/ixuca/
|
||||
pip3 install paddlepaddle==3.3.0.dev20251028 -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
|
||||
pip3 install paddle-iluvatar-gpu==3.0.0.dev20251029 -i https://www.paddlepaddle.org.cn/packages/nightly/ixuca/
|
||||
```
|
||||
For latest paddle version on iluvatar. Refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/)
|
||||
|
||||
@@ -556,3 +556,80 @@ generated_text=
|
||||
|
||||
这件佛像具有典型的北齐风格,佛像结跏趺坐于莲花座上,身披通肩袈裟,面部圆润,神态安详,体现了北齐佛教艺术的独特魅力。
|
||||
```
|
||||
|
||||
## Testing thinking model
|
||||
|
||||
### ERNIE-4.5-21B-A3B-Thinking
|
||||
Refer to [gpu doc](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/best_practices/ERNIE-4.5-21B-A3B-Thinking.md), the command is bellow:
|
||||
|
||||
server:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
export PADDLE_XCCL_BACKEND=iluvatar_gpu
|
||||
export INFERENCE_MSG_QUEUE_ID=232132
|
||||
export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
|
||||
export FD_SAMPLING_CLASS=rejection
|
||||
export FD_DEBUG=1
|
||||
python3 -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-21B-A3B-Thinking \
|
||||
--port 8180 \
|
||||
--load-choices "default_v1" \
|
||||
--tensor-parallel-size 2 \
|
||||
--max-model-len 32768 \
|
||||
--quantization wint8 \
|
||||
--block-size 16 \
|
||||
--reasoning-parser ernie_x1 \
|
||||
--tool-call-parser ernie_x1 \
|
||||
--max-num-seqs 8
|
||||
```
|
||||
|
||||
client:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": "Write me a poem about large language model."}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### ERNIE-4.5-VL-28B-A3B
|
||||
Refer to [gpu doc](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/get_started/ernie-4.5-vl.md), set `"chat_template_kwargs":{"enable_thinking": true}` and the command is bellow:
|
||||
|
||||
server:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
export PADDLE_XCCL_BACKEND=iluvatar_gpu
|
||||
export INFERENCE_MSG_QUEUE_ID=232132
|
||||
export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
|
||||
export FD_SAMPLING_CLASS=rejection
|
||||
export FD_DEBUG=1
|
||||
python3 -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
|
||||
--port 8180 \
|
||||
--tensor-parallel-size 2 \
|
||||
--max-model-len 32768 \
|
||||
--quantization wint8 \
|
||||
--block-size 16 \
|
||||
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
|
||||
--reasoning-parser ernie-45-vl \
|
||||
--max-num-seqs 8
|
||||
```
|
||||
|
||||
client:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
|
||||
{"type": "text", "text": "From which era does the artifact in the image originate?"}
|
||||
]}
|
||||
],
|
||||
"chat_template_kwargs":{"enable_thinking": true}
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -15,7 +15,7 @@ The following installation methods are available when your environment meets the
|
||||
**Notice**: The pre-built image only supports SM80/90 GPU(e.g. H800/A800),if you are deploying on SM86/89GPU(L40/4090/L20), please reinstall ```fastdeploy-gpu``` after you create the container.
|
||||
|
||||
```shell
|
||||
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.1
|
||||
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.3.0-rc0
|
||||
```
|
||||
|
||||
## 2. Pre-built Pip Installation
|
||||
@@ -23,7 +23,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12
|
||||
First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
|
||||
```shell
|
||||
# Install stable release
|
||||
python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
|
||||
python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
|
||||
|
||||
# Install latest Nightly build
|
||||
python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu126/
|
||||
@@ -34,7 +34,7 @@ Then install fastdeploy. **Do not install from PyPI**. Use the following methods
|
||||
For SM80/90 architecture GPUs(e.g A30/A100/H100/):
|
||||
```
|
||||
# Install stable release
|
||||
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
python -m pip install fastdeploy-gpu==2.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
|
||||
# Install latest Nightly build
|
||||
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
@@ -43,7 +43,7 @@ python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages
|
||||
For SM86/89 architecture GPUs(e.g A10/4090/L20/L40):
|
||||
```
|
||||
# Install stable release
|
||||
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
python -m pip install fastdeploy-gpu==2.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
|
||||
# Install latest Nightly build
|
||||
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
@@ -64,7 +64,7 @@ docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .
|
||||
|
||||
First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
|
||||
```shell
|
||||
python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
|
||||
python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
|
||||
```
|
||||
|
||||
Then clone the source code and build:
|
||||
|
||||
@@ -29,6 +29,7 @@
|
||||
|QWEN2|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K|
|
||||
|DEEPSEEK-V3|BF16/WINT4|⛔|✅|🚧|🚧|✅|128K|
|
||||
|DEEPSEEK-R1|BF16/WINT4|⛔|✅|🚧|🚧|✅|128K|
|
||||
|GLM-4.5/4.6|BF16/FP8/WINT4|⛔|✅|✅|🚧|✅|128K|
|
||||
|
||||
```
|
||||
✅ Supported 🚧 In Progress ⛔ No Plan
|
||||
|
||||
@@ -40,6 +40,8 @@ When using FastDeploy to deploy models (including offline inference and service
|
||||
| ```use_cudagraph``` | `bool` | __[DEPRECATED]__ CUDAGraph is enabled by default since version 2.3. It is recommended to read [graph_optimization.md](./features/graph_optimization.md) carefully before opening. |
|
||||
| ```graph_optimization_config``` | `dict[str]` | Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":true, "graph_opt_level":0}',Detailed description reference [graph_optimization.md](./features/graph_optimization.md)|
|
||||
| ```disable_custom_all_reduce``` | `bool` | Disable Custom all-reduce, default: False |
|
||||
| ```use_internode_ll_two_stage``` | `bool` | Use two stage communication in deepep moe, default: False |
|
||||
| ```disable_sequence_parallel_moe``` | `bool` | Disable sequence parallel moe, default: False |
|
||||
| ```splitwise_role``` | `str` | Whether to enable splitwise inference, default value: mixed, supported parameters: ["mixed", "decode", "prefill"] |
|
||||
| ```innode_prefill_ports``` | `str` | Internal engine startup ports for prefill instances (only required for single-machine PD separation), default: None |
|
||||
| ```guided_decoding_backend``` | `str` | Specify the guided decoding backend to use, supports `auto`, `xgrammar`, `off`, default: `off` |
|
||||
@@ -49,6 +51,7 @@ When using FastDeploy to deploy models (including offline inference and service
|
||||
| ```enable_expert_parallel``` | `bool` | Whether to enable expert parallel |
|
||||
| ```enable_logprob``` | `bool` | Whether to enable return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message.If logrpob is not used, this parameter can be omitted when starting |
|
||||
| ```logprobs_mode``` | `str` | Indicates the content returned in the logprobs. Supported mode: `raw_logprobs`, `processed_logprobs`, `raw_logits`, `processed_logits`. Raw means the values before applying logit processors, like bad words. Processed means the values after applying such processors. |
|
||||
| ```max_logprobs``` | `int` | Maximum number of log probabilities to return, default: 20. -1 means vocab_size. |
|
||||
| ```served_model_name```| `str`| The model name used in the API. If not specified, the model name will be the same as the --model argument |
|
||||
| ```revision``` | `str` | The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version. |
|
||||
| ```chat_template``` | `str` | Specify the template used for model concatenation, It supports both string input and file path input. The default value is None. If not specified, the model's default template will be used. |
|
||||
@@ -57,6 +60,7 @@ When using FastDeploy to deploy models (including offline inference and service
|
||||
| ```load_choices``` | `str` | By default, the "default" loader is used for weight loading. To load Torch weights or enable weight acceleration, "default_v1" must be used.|
|
||||
| ```max_encoder_cache``` | `int` | Maximum number of tokens in the encoder cache (use 0 to disable). |
|
||||
| ```max_processor_cache``` | `int` | Maximum number of bytes(in GiB) in the processor cache (use 0 to disable). |
|
||||
| ```api_key``` |`dict[str]`| Validate API keys in the service request headers, supporting multiple key inputs|
|
||||
|
||||
## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```?
|
||||
|
||||
@@ -81,3 +85,41 @@ In actual inference, it's difficult for users to know how to properly configure
|
||||
When `enable_chunked_prefill` is enabled, the service processes long input sequences through dynamic chunking, significantly improving GPU resource utilization. In this mode, the original `max_num_batched_tokens` parameter no longer constrains the batch token count in prefill phase (limiting single prefill token count), thus introducing `max_num_partial_prefills` parameter specifically to limit concurrently processed partial batches.
|
||||
|
||||
To optimize scheduling priority for short requests, new `max_long_partial_prefills` and `long_prefill_token_threshold` parameter combination is added. The former limits the number of long requests in single prefill batch, the latter defines the token threshold for long requests. The system will prioritize batch space for short requests, thereby reducing short request latency in mixed workload scenarios while maintaining stable throughput.
|
||||
|
||||
## 4. ```api_key``` parameter description
|
||||
|
||||
Multi-value configuration method in startup. That takes precedence over environment variable configuration.
|
||||
```bash
|
||||
--api-key "key1"
|
||||
--api-key "key2"
|
||||
```
|
||||
Environment variable multi-value configuration method (use `,` separation):
|
||||
```bash
|
||||
export FD_API_KEY="key1,key2"
|
||||
```
|
||||
|
||||
When making requests using Curl, add the validation header. Any matching `api_key` will pass.
|
||||
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8265/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer key1" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content":"你好"}
|
||||
],
|
||||
"stream": false,
|
||||
"return_token_ids": true,
|
||||
"chat_template_kwargs": {"enable_thinking": true}
|
||||
}'
|
||||
```
|
||||
The system will validate `key1` after parsing `Authorization: Bearer`.
|
||||
|
||||
When using the openai SDK for requests, pass the `api_key` parameter:
|
||||
|
||||
```python
|
||||
client = OpenAI(
|
||||
api_key="your-api-key-here",
|
||||
base_url="http://localhost:8000/v1"
|
||||
)
|
||||
```
|
||||
|
||||
@@ -3,3 +3,4 @@ mkdocs-get-deps
|
||||
mkdocs-material
|
||||
mkdocs-material-extensions
|
||||
mkdocs-multilang
|
||||
mkdocs-static-i18n
|
||||
|
||||
@@ -35,13 +35,14 @@ These models accept text input.
|
||||
|
||||
|Models|DataType|Example HF Model|
|
||||
|-|-|-|
|
||||
|⭐ERNIE|BF16\WINT4\WINT8\W4A8C8\WINT2\FP8|baidu/ERNIE-4.5-VL-424B-A47B-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-Paddle<br> [quick start](./get_started/ernie-4.5.md)   [best practice](./best_practices/ERNIE-4.5-300B-A47B-Paddle.md);<br>baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-FP8-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-Base-Paddle;<br>[baidu/ERNIE-4.5-21B-A3B-Paddle](./best_practices/ERNIE-4.5-21B-A3B-Paddle.md);<br>baidu/ERNIE-4.5-21B-A3B-Base-Paddle;<br>baidu/ERNIE-4.5-21B-A3B-Thinking;<br>baidu/ERNIE-4.5-0.3B-Paddle<br> [quick start](./get_started/quick_start.md)   [best practice](./best_practices/ERNIE-4.5-0.3B-Paddle.md);<br>baidu/ERNIE-4.5-0.3B-Base-Paddle, etc.|
|
||||
|⭐ERNIE|BF16\WINT4\WINT8\W4A8C8\WINT2\FP8|baidu/ERNIE-4.5-VL-424B-A47B-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-Paddle<br> [quick start](./get_started/ernie-4.5.md)   [best practice](./best_practices/ERNIE-4.5-300B-A47B-Paddle.md);<br>baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-FP8-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-Base-Paddle;<br>[baidu/ERNIE-4.5-21B-A3B-Paddle](./best_practices/ERNIE-4.5-21B-A3B-Paddle.md);<br>baidu/ERNIE-4.5-21B-A3B-Base-Paddle;<br>baidu/ERNIE-4.5-21B-A3B-Thinking;<br>[baidu/ERNIE-4.5-VL-28B-A3B-Thinking](./get_started/ernie-4.5-vl-thinking.md);<br>baidu/ERNIE-4.5-0.3B-Paddle<br> [quick start](./get_started/quick_start.md)   [best practice](./best_practices/ERNIE-4.5-0.3B-Paddle.md);<br>baidu/ERNIE-4.5-0.3B-Base-Paddle, etc.|
|
||||
|⭐QWEN3-MOE|BF16/WINT4/WINT8/FP8|Qwen/Qwen3-235B-A22B;<br>Qwen/Qwen3-30B-A3B, etc.|
|
||||
|⭐QWEN3|BF16/WINT8/FP8|Qwen/qwen3-32B;<br>Qwen/qwen3-14B;<br>Qwen/qwen3-8B;<br>Qwen/qwen3-4B;<br>Qwen/qwen3-1.7B;<br>[Qwen/qwen3-0.6B](./get_started/quick_start_qwen.md), etc.|
|
||||
|⭐QWEN2.5|BF16/WINT8/FP8|Qwen/qwen2.5-72B;<br>Qwen/qwen2.5-32B;<br>Qwen/qwen2.5-14B;<br>Qwen/qwen2.5-7B;<br>Qwen/qwen2.5-3B;<br>Qwen/qwen2.5-1.5B;<br>Qwen/qwen2.5-0.5B, etc.|
|
||||
|⭐QWEN2|BF16/WINT8/FP8|Qwen/Qwen/qwen2-72B;<br>Qwen/Qwen/qwen2-7B;<br>Qwen/qwen2-1.5B;<br>Qwen/qwen2-0.5B;<br>Qwen/QwQ-32, etc.|
|
||||
|⭐DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;<br>unsloth/DeepSeek-V3-0324-BF16;<br>unsloth/DeepSeek-R1-BF16, etc.|
|
||||
|⭐GPT-OSS|BF16/WINT8|unsloth/gpt-oss-20b-BF16, etc.|
|
||||
|⭐GLM-4.5/4.6|BF16/wfp8afp8|zai-org/GLM-4.5-Air;<br>zai-org/GLM-4.6<br> [最佳实践](./best_practices/GLM-4-MoE-Text.md) etc.|
|
||||
|
||||
## Multimodal Language Models
|
||||
|
||||
@@ -49,7 +50,7 @@ These models accept multi-modal inputs (e.g., images and text).
|
||||
|
||||
|Models|DataType|Example HF Model|
|
||||
|-|-|-|
|
||||
| ERNIE-VL |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle<br> [quick start](./get_started/ernie-4.5-vl.md)   [best practice](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Paddle<br> [quick start](./get_started/quick_start_vl.md)   [best practice](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;|
|
||||
| ERNIE-VL |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle<br> [quick start](./get_started/ernie-4.5-vl.md)   [best practice](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Paddle<br> [quick start](./get_started/quick_start_vl.md)   [best practice](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Thinking<br> [quick start](./get_started/ernie-4.5-vl-thinking.md)   [best practice](./best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md);
|
||||
| PaddleOCR-VL |BF16/WINT4/WINT8| PaddlePaddle/PaddleOCR-VL<br>  [best practice](./best_practices/PaddleOCR-VL-0.9B.md) ;|
|
||||
| QWEN-VL |BF16/WINT4/FP8| Qwen/Qwen2.5-VL-72B-Instruct;<br>Qwen/Qwen2.5-VL-32B-Instruct;<br>Qwen/Qwen2.5-VL-7B-Instruct;<br>Qwen/Qwen2.5-VL-3B-Instruct|
|
||||
|
||||
|
||||
@@ -19,9 +19,9 @@
|
||||
|ERNIE-4.5-0.3B|128K|WINT8|1 (Recommended)|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 \ <br> --load-choices "default"|2.3.0|
|
||||
|ERNIE-4.5-300B-A47B-W4A8C8-TP4|32K|W4A8|4|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "W4A8" \ <br> --gpu-memory-utilization 0.9 \ <br> --load-choices "default"|2.3.0|
|
||||
|ERNIE-4.5-VL-28B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --quantization "wint8" \ <br> --max-model-len 32768 \ <br> --max-num-seqs 10 \ <br> --enable-mm \ <br> --mm-processor-kwargs '{"video_max_frames": 30}' \ <br> --limit-mm-per-prompt '{"image": 10, "video": 3}' \ <br> --reasoning-parser ernie-45-vl \ <br> --load-choices "default"|2.3.0|
|
||||
|ERNIE-4.5-VL-424B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-VL-424B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --quantization "wint8" \ <br> --max-model-len 32768 \ <br> --max-num-seqs 8 \ <br> --enable-mm \ <br> --mm-processor-kwargs '{"video_max_frames": 30}' \ <br> --limit-mm-per-prompt '{"image": 10, "video": 3}' \ <br> --reasoning-parser ernie-45-vl \ <br> --load-choices "default"|2.3.0|
|
||||
|ERNIE-4.5-VL-424B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-VL-424B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --quantization "wint8" \ <br> --max-model-len 32768 \ <br> --max-num-seqs 8 \ <br> --enable-mm \ <br> --mm-processor-kwargs '{"video_max_frames": 30}' \ <br> --limit-mm-per-prompt '{"image": 10, "video": 3}' \ <br> --reasoning-parser ernie-45-vl \ <br> --gpu-memory-utilization 0.7 \ <br> --load-choices "default"|2.3.0|
|
||||
|PaddleOCR-VL-0.9B|32K|BF16|1|export FD_ENABLE_MAX_PREFILL=1 <br>export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/PaddleOCR-VL \ <br> --port 8188 \ <br> --metrics-port 8181 \ <br> --engine-worker-queue-port 8182 \ <br> --max-model-len 16384 \ <br> --max-num-batched-tokens 16384 \ <br> --gpu-memory-utilization 0.8 \ <br> --max-num-seqs 256|2.3.0|
|
||||
|ERNIE-4.5-VL-28B-A3B-Thinking|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --quantization "wint8" \ <br> --max-model-len 131072 \ <br> --max-num-seqs 32 \ <br> --engine-worker-queue-port 8189 \ <br> --metrics-port 8190 \ <br> --cache-queue-port 8191 \ <br> --reasoning-parser ernie-45-vl-thinking \ <br> --tool-call-parser ernie-45-vl-thinking \ <br> --mm-processor-kwargs \ '{"image_max_pixels": 12845056 }' \ <br> --load-choices "default_v1"|2.3.0|
|
||||
|ERNIE-4.5-VL-28B-A3B-Thinking|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --quantization "wint8" \ <br> --max-model-len 131072 \ <br> --max-num-seqs 32 \ <br> --engine-worker-queue-port 8189 \ <br> --metrics-port 8190 \ <br> --cache-queue-port 8191 \ <br> --reasoning-parser ernie-45-vl-thinking \ <br> --tool-call-parser ernie-45-vl-thinking \ <br> --mm-processor-kwargs '{"image_max_pixels": 12845056 }' \ <br> --load-choices "default_v1"|2.3.0|
|
||||
|
||||
## Quick start
|
||||
|
||||
@@ -164,7 +164,7 @@ for chunk in response:
|
||||
if chunk.choices[0].delta is not None and chunk.choices[0].delta.role != 'assistant':
|
||||
reasoning_content = get_str(chunk.choices[0].delta.reasoning_content)
|
||||
content = get_str(chunk.choices[0].delta.content)
|
||||
print(reasoning_content + content + is_reason, end='', flush=True)
|
||||
print(reasoning_content + content, end='', flush=True)
|
||||
print('\n')
|
||||
```
|
||||
|
||||
@@ -245,70 +245,185 @@ print('\n')
|
||||
Deploy the ERNIE-4.5-VL-28B-A3B-Thinking model with WINT8 precision and 128K context length on 1 XPU
|
||||
|
||||
```bash
|
||||
export XPU_VISIBLE_DEVICES="0"# Specify any card
|
||||
export XPU_VISIBLE_DEVICES="0" # Specify any card
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking \
|
||||
--port 8188 \
|
||||
--tensor-parallel-size 1 \
|
||||
--quantization "wint8" \
|
||||
--max-model-len 131072 \
|
||||
--max-num-seqs 32 \
|
||||
--engine-worker-queue-port 8189 \
|
||||
--metrics-port 8190 \
|
||||
--cache-queue-port 8191 \
|
||||
--reasoning-parser ernie-45-vl-thinking \
|
||||
--tool-call-parser ernie-45-vl-thinking \
|
||||
--mm-processor-kwargs '{"image_max_pixels": 12845056 }' \
|
||||
--load-choices "default_v1"
|
||||
--model PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking \
|
||||
--port 8188 \
|
||||
--tensor-parallel-size 1 \
|
||||
--quantization "wint8" \
|
||||
--max-model-len 131072 \
|
||||
--max-num-seqs 32 \
|
||||
--engine-worker-queue-port 8189 \
|
||||
--metrics-port 8190 \
|
||||
--cache-queue-port 8191 \
|
||||
--reasoning-parser ernie-45-vl-thinking \
|
||||
--tool-call-parser ernie-45-vl-thinking \
|
||||
--mm-processor-kwargs '{"image_max_pixels": 12845056 }' \
|
||||
--load-choices "default_v1"
|
||||
```
|
||||
|
||||
#### Send requests
|
||||
|
||||
Initiate a service request through the following command
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg", "detail": "high"}},
|
||||
{"type": "text", "text": "Please describe the content of the image"}
|
||||
]}
|
||||
],
|
||||
"metadata": {"enable_thinking": true}
|
||||
{"role": "user", "content": "Adapt Li Bai's "Silent Night Thoughts" into a modern poem"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
ip = "0.0.0.0"
|
||||
service_http_port = "8188"
|
||||
client = openai.Client(base_url=f"http://{ip}:{service_http_port}/v1", api_key="EMPTY_API_KEY")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="default",
|
||||
messages=[
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg", "detail": "high"}},
|
||||
{"type": "text", "text": "Please describe the content of the image"}
|
||||
When inputting images, initiate a request using the following command
|
||||
```
|
||||
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": [
|
||||
{"type":"image_url", "image_url": {"url":"https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
|
||||
{"type":"text", "text":"Which era does the cultural relic in the picture belong to?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
When inputting a video, initiate a request by following the following command
|
||||
```
|
||||
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": [
|
||||
{"type":"video_url", "video_url": {"url":"https://bj.bcebos.com/v1/paddlenlp/datasets/paddlemix/demo_video/example_video.mp4"}},
|
||||
{"type":"text", "text":"How many apples are there in the picture"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
When the input contains a tool call, initiate the request by following the command
|
||||
```
|
||||
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d $'{
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "image_zoom_in_tool",
|
||||
"description": "Zoom in on a specific region of an image by cropping it based on a bounding box (bbox) and an optional object label.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"bbox_2d": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "number"
|
||||
},
|
||||
"minItems": 4,
|
||||
"maxItems": 4,
|
||||
"description": "The bounding box of the region to zoom in, as [x1, y1, x2, y2], where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner, and the values of x1, y1, x2, y2 are all normalized to the range 0–1000 based on the original image dimensions."
|
||||
},
|
||||
"label": {
|
||||
"type": "string",
|
||||
"description": "The name or label of the object in the specified bounding box (optional)."
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
"bbox_2d"
|
||||
]
|
||||
},
|
||||
"strict": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Is the old lady on the left side of the empty table behind older couple?"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
When there are multiple requests and the tool returns results in the historical context, initiate the request by following the command below
|
||||
When there are multiple requests and the tool returns results in the historical context, initiate the request by following the command below
|
||||
```
|
||||
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d $'{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Get the current weather in Beijing"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "call_1",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"arguments": {
|
||||
"location": "Beijing",
|
||||
"unit": "c"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"content": ""
|
||||
},
|
||||
{
|
||||
"role": "tool",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "location: Beijing,temperature: 23,weather: sunny,unit: c"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
temperature=0.0001,
|
||||
max_tokens=10000,
|
||||
stream=True,
|
||||
top_p=0,
|
||||
metadata={"enable_thinking": True},
|
||||
)
|
||||
|
||||
def get_str(content_raw):
|
||||
content_str = str(content_raw) if content_raw is not None else ''
|
||||
return content_str
|
||||
|
||||
for chunk in response:
|
||||
if chunk.choices[0].delta is not None and chunk.choices[0].delta.role != 'assistant':
|
||||
reasoning_content = get_str(chunk.choices[0].delta.reasoning_content)
|
||||
content = get_str(chunk.choices[0].delta.content)
|
||||
print(reasoning_content + content + is_reason, end='', flush=True)
|
||||
print('\n')
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"description": "Determine weather in my location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state e.g. San Francisco, CA"
|
||||
},
|
||||
"unit": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"c",
|
||||
"f"
|
||||
]
|
||||
}
|
||||
},
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"location",
|
||||
"unit"
|
||||
]
|
||||
},
|
||||
"strict": true
|
||||
}
|
||||
}
|
||||
],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
123
docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md
Normal file
123
docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md
Normal file
@@ -0,0 +1,123 @@
|
||||
[English](../../best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md)
|
||||
|
||||
# ERNIE-4.5-VL-28B-A3B-Thinking
|
||||
|
||||
## 一、环境准备
|
||||
### 1.1 支持情况
|
||||
在下列硬件上部署所需要的最小卡数如下:
|
||||
|
||||
| 设备[显存] | WINT4 | WINT8 | BFLOAT16 |
|
||||
|:----------:|:----------:|:------:| :------:|
|
||||
| A30 [24G] | 2 | 2 | 4 |
|
||||
| L20 [48G] | 1 | 1 | 2 |
|
||||
| H20 [144G] | 1 | 1 | 1 |
|
||||
| A100 [80G] | 1 | 1 | 1 |
|
||||
| H800 [80G] | 1 | 1 | 1 |
|
||||
|
||||
### 1.2 安装fastdeploy
|
||||
|
||||
安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)
|
||||
|
||||
## 二、如何使用
|
||||
### 2.1 基础:启动服务
|
||||
**示例1:** 4090上单卡部署32K上下文的服务
|
||||
```shell
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
|
||||
--port 8180 \
|
||||
--metrics-port 8181 \
|
||||
--engine-worker-queue-port 8182 \
|
||||
--tensor-parallel-size 1 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-seqs 32 \
|
||||
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
|
||||
--reasoning-parser ernie-45-vl-thinking \
|
||||
--tool-call-parser ernie-45-vl-thinking \
|
||||
--mm-processor-kwargs '{"image_max_pixels": 12845056 }'
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--quantization wint4
|
||||
```
|
||||
**示例2:** H800上双卡部署128K上下文的服务
|
||||
```shell
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
|
||||
--port 8180 \
|
||||
--metrics-port 8181 \
|
||||
--engine-worker-queue-port 8182 \
|
||||
--tensor-parallel-size 2 \
|
||||
--max-model-len 131072 \
|
||||
--max-num-seqs 128 \
|
||||
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
|
||||
--reasoning-parser ernie-45-vl-thinking \
|
||||
--tool-call-parser ernie-45-vl-thinking \
|
||||
--mm-processor-kwargs '{"image_max_pixels": 12845056 }'
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--quantization wint4
|
||||
```
|
||||
|
||||
示例是可以稳定运行的一组配置,同时也能得到比较好的性能。
|
||||
如果对精度、性能有进一步的要求,请继续阅读下面的内容。
|
||||
### 2.2 进阶:如何获取更优性能
|
||||
|
||||
#### 2.2.1 评估应用场景,正确设置参数
|
||||
> **上下文长度**
|
||||
- **参数:** `--max-model-len`
|
||||
- **描述:** 控制模型可处理的最大上下文长度。
|
||||
- **推荐:** 更长的上下文会导致吞吐降低,根据实际情况设置,`ERNIE-4.5-VL-28B-A3B-Thinking`最长支持**128k**(131072)长度的上下文。
|
||||
|
||||
⚠️ 注:更长的上下文会显著增加GPU显存需求,设置更长的上下文之前确保硬件资源是满足的。
|
||||
> **最大序列数量**
|
||||
- **参数:** `--max-num-seqs`
|
||||
- **描述:** 控制服务可以处理的最大序列数量,支持1~256。
|
||||
- **推荐:** 如果您不知道实际应用场景中请求的平均序列数量是多少,我们建议设置为**256**。如果您的应用场景中请求的平均序列数量明显少于256,我们建议设置为一个略大于平均值的较小值,以进一步降低显存占用,优化服务性能。
|
||||
|
||||
> **多图、多视频输入**
|
||||
- **参数**:`--limit-mm-per-prompt`
|
||||
- **描述**:我们的模型支持单次提示词(prompt)中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量,以确保资源高效利用。
|
||||
- **推荐**:我们建议将单次提示词(prompt)中的图片和视频数量均设置为100个,以平衡性能与内存占用。
|
||||
|
||||
> **初始化时可用的显存比例**
|
||||
- **参数:** `--gpu-memory-utilization`
|
||||
- **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。
|
||||
- **推荐:** 推荐使用默认值0.9。如果服务压测时提示显存不足,可以尝试调低该值。
|
||||
|
||||
#### 2.2.2 Chunked Prefill
|
||||
- **参数:** `--enable-chunked-prefill`
|
||||
- **用处:** 开启 `chunked prefill` 可降低显存峰值并提升服务吞吐。2.2版本已经**默认开启**,2.2之前需要手动开启,参考2.1的最佳实践文档。
|
||||
|
||||
- **相关配置**:
|
||||
|
||||
`--max-num-batched-tokens`:限制每个chunk的最大token数量。多模场景下每个chunk会向上取整保持图片的完整性,因此实际每次推理的总token数会大于该值。我们推荐设置为384。
|
||||
|
||||
#### 2.2.3 **量化精度**
|
||||
- **参数:** `--quantization`
|
||||
|
||||
- **已支持的精度类型:**
|
||||
- WINT4 (适合大多数用户)
|
||||
- WINT8
|
||||
- BFLOAT16 (未设置 `--quantization` 参数时,默认使用BFLOAT16)
|
||||
|
||||
- **推荐:**
|
||||
- 除非您有极其严格的精度要求,否则我们建议使用WINT4量化。这将显著降低内存占用并提升吞吐量。
|
||||
- 若需要稍高的精度,可尝试WINT8。
|
||||
- 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16,因为它需要更多显存。
|
||||
|
||||
#### 2.2.4 **可调整的环境变量**
|
||||
> **拒绝采样:**`FD_SAMPLING_CLASS=rejection`
|
||||
- **描述**:拒绝采样即从一个易于采样的提议分布(proposal distribution)中生成样本,避免显式排序从而达到提升采样速度的效果,可以提升推理性能。
|
||||
- **推荐**:这是一种影响效果的较为激进的优化策略,我们还在全面验证影响。如果对性能有较高要求,也可以接受对效果的影响时可以尝试开启。
|
||||
|
||||
## 三、常见问题FAQ
|
||||
|
||||
### 3.1 显存不足(OOM)
|
||||
如果服务启动时提示显存不足,请尝试以下方法:
|
||||
1. 确保无其他进程占用显卡显存;
|
||||
2. 使用WINT4/WINT8量化,开启chunked prefill;
|
||||
3. 酌情降低上下文长度和最大序列数量;
|
||||
4. 增加部署卡数,使用2卡或4卡部署,即修改参数 `--tensor-parallel-size 2` 或 `--tensor-parallel-size 4`。
|
||||
|
||||
如果可以服务可以正常启动,运行时提示显存不足,请尝试以下方法:
|
||||
1. 酌情降低初始化时可用的显存比例,即调整参数 `--gpu-memory-utilization` 的值;
|
||||
2. 增加部署卡数,参数修改同上。
|
||||
100
docs/zh/best_practices/GLM-4-MoE-Text.md
Normal file
100
docs/zh/best_practices/GLM-4-MoE-Text.md
Normal file
@@ -0,0 +1,100 @@
|
||||
[English](../../best_practices/GLM-4-MoE-Text.md)
|
||||
|
||||
# GLM-4.5/4.6 文本模型
|
||||
|
||||
## 一、环境准备
|
||||
### 1.1 支持情况
|
||||
GLM-4.5/4.6 各量化精度,在下列硬件上部署所需要的最小卡数如下:
|
||||
|
||||
| | WINT8 | WINT4 | FP8 |
|
||||
|-----|-----|-----|-----|
|
||||
|H800 80GB| 4 | 4 | 4 |
|
||||
|A800 80GB| 4 | 4 | / |
|
||||
|
||||
**注:**
|
||||
1. 在启动命令后指定`--tensor-parallel-size 4` 即可修改部署卡数
|
||||
2. 表格中未列出的硬件,可根据显存大小进行预估是否可以部署
|
||||
3. 量化精度推荐FP8。
|
||||
|
||||
### 1.2 安装fastdeploy
|
||||
|
||||
安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)
|
||||
|
||||
## 二、如何使用
|
||||
### 2.1 基础:启动服务
|
||||
**示例1:** H100上四卡部署BF16模型16K上下文的服务
|
||||
```shell
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model zai-org/GLM-4.5-Air \
|
||||
--tensor-parallel-size 4 \
|
||||
--port 8185 \
|
||||
--max-model-len 16384 \
|
||||
|
||||
```
|
||||
|
||||
**示例2:** H100上四卡部署FP8推理服务
|
||||
```shell
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model zai-org/GLM-4.5-Air \
|
||||
--tensor-parallel-size 4 \
|
||||
--port 8185 \
|
||||
--quantization wfp8afp8 \
|
||||
|
||||
```
|
||||
其中:
|
||||
- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `wfp8afp8`(需要Hopper架构)。
|
||||
- `--max-model-len`:表示当前部署的服务所支持的最长Token数量。设置得越大,模型可支持的上下文长度也越大,但相应占用的显存也越多,可能影响并发数。
|
||||
|
||||
更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。
|
||||
|
||||
### 2.2 进阶:如何获取更优性能
|
||||
#### 2.2.1 评估应用场景,正确设置参数
|
||||
结合应用场景,评估平均输入长度、平均输出长度、最大上下文长度。例如,平均输入长度为1000,输出长度为30000,那么建议设置为 32768
|
||||
- 根据最大上下文长度,设置`max-model-len`
|
||||
|
||||
#### 2.2.2 Prefix Caching
|
||||
**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果(KV Cache),避免重复计算,从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md)
|
||||
|
||||
**启用方式:**
|
||||
自2.2版本开始(包括develop分支),Prefix Caching已经默认开启。
|
||||
|
||||
对于2.1及更早的版本,需要手动开启。其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上,额外开启CPU缓存,大小为GB,应根据机器实际情况调整。建议取值为`(机器总内存 - 模型大小) * 20%`。如果因为其他程序占用内存等原因导致服务启动失败,可以尝试减小`--swap-space`的值。
|
||||
```
|
||||
--enable-prefix-caching
|
||||
--swap-space 50
|
||||
```
|
||||
|
||||
#### 2.2.3 Chunked Prefill
|
||||
**原理:** 采用分块策略,将预填充(Prefill)阶段请求拆解为小规模子任务,与解码(Decode)请求混合批处理执行。可以更好地平衡计算密集型(Prefill)和访存密集型(Decode)操作,优化GPU资源利用率,减少单次Prefill的计算量和显存占用,从而降低显存峰值,避免显存不足的问题。 具体请参考[Chunked Prefill](../features/chunked_prefill.md)
|
||||
|
||||
**启用方式:**
|
||||
自2.2版本开始(包括develop分支),Chunked Prefill已经默认开启。
|
||||
|
||||
对于2.1及更早的版本,需要手动开启。
|
||||
```
|
||||
--enable-chunked-prefill
|
||||
```
|
||||
|
||||
#### 2.2.4 CUDAGraph
|
||||
**原理:**
|
||||
CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操作序列捕获(capture)为图结构(graph),实现 GPU 任务的高效执行和优化。CUDAGraph 的核心思想是将一系列 GPU 计算和内存操作封装为一个可重复执行的图,从而减少 CPU-GPU 通信开销、降低内核启动延迟,并提升整体计算性能。
|
||||
|
||||
**启用方式:**
|
||||
在2.3版本之前需要通过`--use-cudagraph`启用。
|
||||
|
||||
2.3版本开始部分场景已默认开启 CUDAGraph,对于暂时不能兼容 CUDAGraph 的功能(投机解码、强化学习训练、多模模型推理)CUDAGraph 会自动关闭。
|
||||
注:
|
||||
- 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
|
||||
|
||||
#### 2.2.5 拒绝采样
|
||||
**原理:**
|
||||
拒绝采样即从一个易于采样的提议分布(proposal distribution)中生成样本,避免显式排序从而达到提升采样速度的效果,对小尺寸的模型有较明显的提升。
|
||||
|
||||
**启用方式:**
|
||||
启动前增加下列环境变量
|
||||
```
|
||||
export FD_SAMPLING_CLASS=rejection
|
||||
```
|
||||
|
||||
## 三、常见问题FAQ
|
||||
如果您在使用过程中遇到问题,可以在[FAQ](./FAQ.md)中查阅。
|
||||
@@ -26,6 +26,7 @@
|
||||
4. 确保所有节点能够解析彼此的主机名
|
||||
|
||||
* 在线推理启动示例:
|
||||
|
||||
```shell
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
|
||||
@@ -39,6 +40,7 @@
|
||||
```
|
||||
|
||||
* 离线启动示例:
|
||||
|
||||
```python
|
||||
from fastdeploy.engine.sampling_params import SamplingParams
|
||||
from fastdeploy.entrypoints.llm import LLM
|
||||
|
||||
@@ -4,13 +4,15 @@
|
||||
|
||||
思考模型在输出中返回 `reasoning_content` 字段,表示思考链内容,即得出最终结论的思考步骤.
|
||||
|
||||
##目前支持思考链的模型
|
||||
| 模型名称 | 解析器名称 | 默认开启思考链 |
|
||||
|---------------|-------------|---------|
|
||||
| baidu/ERNIE-4.5-VL-424B-A47B-Paddle | ernie-45-vl | ✓ |
|
||||
| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | ernie-45-vl | ✓ |
|
||||
## 目前支持思考链的模型
|
||||
| 模型名称 | 解析器名称 | 默认开启思考链 | 工具调用 | 思考开关控制参数|
|
||||
|---------------|-------------|---------|---------|--------- |
|
||||
| baidu/ERNIE-4.5-VL-424B-A47B-Paddle | ernie-45-vl | ✅ | ❌ | "chat_template_kwargs":{"enable_thinking": true/false}|
|
||||
| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | ernie-45-vl | ✅ | ❌ |"chat_template_kwargs":{"enable_thinking": true/false}|
|
||||
| baidu/ERNIE-4.5-21B-A3B-Thinking | ernie-x1 | ✅不支持关思考 | ✅|❌|
|
||||
| baidu/ERNIE-4.5-VL-28B-A3B-Thinking | ernie-45-vl-thinking | ✅不推荐关闭 | ✅|"chat_template_kwargs": {"options": {"thinking_mode": "open/close"}}|
|
||||
|
||||
思考模型需要指定解析器,以便于对思考内容进行解析. 通过 `"enable_thinking": false` 参数可以关闭模型思考模式.
|
||||
思考模型需要指定解析器,以便于对思考内容进行解析. 参考各个模型的 `思考开关控制参数` 可以关闭模型思考模式.
|
||||
|
||||
可以支持思考模式开关的接口:
|
||||
1. OpenAI 服务中 `/v1/chat/completions` 请求.
|
||||
@@ -33,7 +35,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--reasoning-parser ernie-45-vl
|
||||
```
|
||||
|
||||
接下来, 向模型发送 `chat completion` 请求
|
||||
接下来, 向模型发送 `chat completion` 请求, 以`baidu/ERNIE-4.5-VL-28B-A3B-Paddle`模型为例
|
||||
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8192/v1/chat/completions" \
|
||||
@@ -83,3 +85,76 @@ for chunk in chat_response:
|
||||
print("\n")
|
||||
|
||||
```
|
||||
## 工具调用
|
||||
如果模型支持工具调用, 可以同时启动模型回复内容的思考链解析 `reasoning_content` 及工具解析 `tool-call-parser`。 工具内容仅从模型回复内容 `content` 中进行解析,而不会影响思考链内容。
|
||||
例如,
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8390/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "北京今天天气怎么样?"
|
||||
}
|
||||
],
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"description": "Determine weather in my location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state e.g. San Francisco, CA"
|
||||
},
|
||||
"unit": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"c",
|
||||
"f"
|
||||
]
|
||||
}
|
||||
},
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"location",
|
||||
"unit"
|
||||
]
|
||||
},
|
||||
"strict": true
|
||||
}
|
||||
}],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
返回结果示例如下:
|
||||
```json
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "",
|
||||
"reasoning_content": "用户问的是..",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "chatcmpl-tool-311b9bda34274722afc654c55c8ce6a0",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"arguments": "{\"location\": \"北京\", \"unit\": \"c\"}"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"finish_reason": "tool_calls"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
更多工具调用相关的使用参考文档 [Tool Calling](./tool_calling.md)
|
||||
|
||||
236
docs/zh/features/tool_calling.md
Normal file
236
docs/zh/features/tool_calling.md
Normal file
@@ -0,0 +1,236 @@
|
||||
# Tool_Calling
|
||||
|
||||
本文档介绍如何在 FastDeploy 中配置服务器以使用工具解析器(tool parser),以及如何在客户端调用工具。
|
||||
|
||||
## Ernie系列模型配套工具解释器
|
||||
| 模型名称 | 解析器名称 |
|
||||
|---------------|-------------|
|
||||
| baidu/ERNIE-4.5-21B-A3B-Thinking | ernie-x1 |
|
||||
| baidu/ERNIE-4.5-VL-28B-A3B-Thinking | ernie-45-vl-thinking |
|
||||
|
||||
## 快速开始
|
||||
|
||||
### 启动包含解析器的FastDeploy
|
||||
|
||||
使用包含思考解析器和工具解析器的命令启动服务器。下面的示例使用 ERNIE-4.5-21B-A3B。我们可以使用 fastdeploy 目录中的 ernie-x1 思考解析器(reasoning parser)和 ernie-x1 工具调用解析器(tool-call parser);从而实现解析模型的思考内容、回复内容以及工具调用信息:
|
||||
|
||||
```bash
|
||||
python -m fastdeploy.entrypoints.openai.api_server
|
||||
--model /models/ERNIE-4.5-21B-A3B \
|
||||
--port 8000 \
|
||||
--reasoning-parser ernie-x1 \
|
||||
--tool-call-parser ernie-x1
|
||||
```
|
||||
|
||||
### 触发工具调用示例
|
||||
|
||||
构造一个包含工具的请求以触发模型调用工具:
|
||||
|
||||
```python
|
||||
curl -X POST http://0.0.0.0:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "北京今天天气怎么样?"
|
||||
}
|
||||
],
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"description": "获取指定地点的当前天气",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "城市名,如:北京。"
|
||||
},
|
||||
"unit": {
|
||||
"type": "string",
|
||||
"enum": ["c", "f"],
|
||||
"description": "温度单位:c = 摄氏度,f = 华氏度"
|
||||
}
|
||||
},
|
||||
"required": ["location", "unit"],
|
||||
"additionalProperties": false
|
||||
},
|
||||
"strict": true
|
||||
}
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
示例输出如下,可以看到成功解析出了模型输出的思考内容`reasoning_content`以及工具调用信息`tool_calls`,且当前的回复内容`content`为空,`finish_reason`为工具调用`tool_calls`:
|
||||
```bash
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "",
|
||||
"multimodal_content": null,
|
||||
"reasoning_content": "User wants to ... ",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "chatcmpl-tool-bc90641c67e44dbfb981a79bc986fbe5",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"arguments": "{\"location\": \"北京\", \"unit\": \"c\"}"
|
||||
}
|
||||
}
|
||||
],
|
||||
"finish_reason": "tool_calls"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
## 并行工具调用
|
||||
|
||||
如果模型能够生成多个并行的工具调用,FastDeploy 会返回一个列表:
|
||||
|
||||
```bash
|
||||
tool_calls=[
|
||||
{"id": "...", "function": {...}},
|
||||
{"id": "...", "function": {...}}
|
||||
]
|
||||
```
|
||||
|
||||
## 工具调用结果出现在历史会话中
|
||||
|
||||
如果前几轮对话中包含工具调用,可以按以下方式构造请求:
|
||||
|
||||
```python
|
||||
curl -X POST "http://0.0.0.0:8000/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "你好,北京天气怎么样?"
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "call_1",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"arguments": {
|
||||
"location": "北京",
|
||||
"unit": "c"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"thoughts": "用户需要查询北京今天的天气。"
|
||||
},
|
||||
{
|
||||
"role": "tool",
|
||||
"tool_call_id": "call_1",
|
||||
"content": {
|
||||
"type": "text",
|
||||
"text": "{\"location\": \"北京\",\"temperature\": \"23\",\"weather\": \"晴\",\"unit\": \"c\"}"
|
||||
}
|
||||
}
|
||||
],
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"description": "获取指定位置的当前天气。",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "城市名称,例如:北京"
|
||||
},
|
||||
"unit": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"c",
|
||||
"f"
|
||||
],
|
||||
"description": "温度单位:c = 摄氏度,f = 华氏度"
|
||||
}
|
||||
},
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"location",
|
||||
"unit"
|
||||
]
|
||||
},
|
||||
"strict": true
|
||||
}
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
解析出的模型输出结果如下,包含思考内容`reasoning_content`与回复内容`content`,且`finish_reason`为`stop`:
|
||||
```bash
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "北京今天的天气是晴天,气温为23摄氏度。",
|
||||
"reasoning_content": "用户想...",
|
||||
"tool_calls": null
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
## 编写自定义工具解析器
|
||||
FastDeploy支持自定义工具解析器插件,可以参考以下地址中的`tool parser`创建:`fastdeploy/entrypoints/openai/tool_parser`
|
||||
|
||||
自定义解析器需要实现:
|
||||
|
||||
```python
|
||||
# import the required packages
|
||||
# register the tool parser to ToolParserManager
|
||||
@ToolParserManager.register_module("my-parser")
|
||||
class ToolParser:
|
||||
def __init__(self, tokenizer: AnyTokenizer):
|
||||
super().__init__(tokenizer)
|
||||
|
||||
# implement the tool parse for non-stream call
|
||||
def extract_tool_calls(self, model_output: str, request: ChatCompletionRequest) -> ExtractToolCallInformation:
|
||||
return ExtractedToolCallInformation(tools_called=False,tool_calls=[],content=text)
|
||||
|
||||
# implement the tool call parse for stream call
|
||||
def extract_tool_calls_streaming(
|
||||
self,
|
||||
previous_text: str,
|
||||
current_text: str,
|
||||
delta_text: str,
|
||||
previous_token_ids: Sequence[int],
|
||||
current_token_ids: Sequence[int],
|
||||
delta_token_ids: Sequence[int],
|
||||
request: ChatCompletionRequest,
|
||||
) -> DeltaMessage | None:
|
||||
return delta
|
||||
```
|
||||
|
||||
通过以下方式启用自定义解析器:
|
||||
|
||||
```bash
|
||||
python -m fastdeploy.entrypoints.openai.api_server
|
||||
--model <模型地址>
|
||||
--tool-parser-plugin <自定义工具解释器的地址>
|
||||
--tool-call-parser my-parser
|
||||
```
|
||||
|
||||
---
|
||||
326
docs/zh/get_started/ernie-4.5-vl-thinking.md
Normal file
326
docs/zh/get_started/ernie-4.5-vl-thinking.md
Normal file
@@ -0,0 +1,326 @@
|
||||
[English](../../get_started/ernie-4.5-vl-thinking.md)
|
||||
|
||||
# ERNIE-4.5-VL-28B-A3B-Thinking 多模态思考模型
|
||||
|
||||
本文档讲解如何部署ERNIE-4.5-VL-28B-A3B-Thinking多模态思考模型,支持用户使用多模态数据与模型进行对话交互,同时支持工具调用能力(包括多模态数据)。在开始部署前,请确保你的硬件环境满足如下条件:
|
||||
|
||||
- GPU驱动 >= 535
|
||||
- CUDA >= 12.3
|
||||
- CUDNN >= 9.5
|
||||
- Linux X86_64
|
||||
- Python >= 3.10
|
||||
- 80G A/H 1卡
|
||||
|
||||
安装FastDeploy方式参考[安装文档](./installation/README.md)。
|
||||
|
||||
## 准备模型
|
||||
部署时指定```--model baidu/ERNIE-4.5-VL-28B-A3B-Thinking```即可自动从AIStudio下载模型,并支持断点续传。你也可以自行从不同渠道下载模型,需要注意的是FastDeploy依赖Paddle格式的模型,更多说明参考[支持模型列表](../supported_models.md)。
|
||||
|
||||
## 启动服务
|
||||
|
||||
执行如下命令,启动服务,其中启动命令配置方式参考[参数说明](../parameters.md)
|
||||
|
||||
```shell
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
|
||||
--max-model-len 131072 \
|
||||
--max-num-seqs 32 \
|
||||
--port 8180 \
|
||||
--quantization wint8 \
|
||||
--reasoning-parser ernie-45-vl-thinking \
|
||||
--tool-call-parser ernie-45-vl-thinking \
|
||||
--mm-processor-kwargs '{"image_max_pixels": 12845056 }'
|
||||
```
|
||||
|
||||
## 用户发起服务请求
|
||||
执行启动服务指令后,当终端打印如下信息,说明服务已经启动成功。
|
||||
|
||||
```shell
|
||||
api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
|
||||
api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
|
||||
api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
|
||||
INFO: Started server process [13909]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
|
||||
```
|
||||
|
||||
FastDeploy提供服务探活接口,用以判断服务的启动状态,执行如下命令返回 ```HTTP/1.1 200 OK``` 即表示服务启动成功。
|
||||
|
||||
```shell
|
||||
curl -i http://0.0.0.0:8180/health
|
||||
```
|
||||
|
||||
通过如下命令发起服务请求
|
||||
|
||||
```shell
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": "把李白的静夜思改写为现代诗"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
输入包含图片时,按如下命令发起请求
|
||||
|
||||
```shell
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": [
|
||||
{"type":"image_url", "image_url": {"url":"https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
|
||||
{"type":"text", "text":"图中的文物属于哪个年代?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
输入包含视频时,按如下命令发起请求
|
||||
|
||||
```shell
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": [
|
||||
{"type":"video_url", "video_url": {"url":"https://bj.bcebos.com/v1/paddlenlp/datasets/paddlemix/demo_video/example_video.mp4"}},
|
||||
{"type":"text", "text":"画面中有几个苹果?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
输入包含工具调用时,按如下命令发起请求
|
||||
|
||||
```shell
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d $'{
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "image_zoom_in_tool",
|
||||
"description": "Zoom in on a specific region of an image by cropping it based on a bounding box (bbox) and an optional object label.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"bbox_2d": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "number"
|
||||
},
|
||||
"minItems": 4,
|
||||
"maxItems": 4,
|
||||
"description": "The bounding box of the region to zoom in, as [x1, y1, x2, y2], where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner, and the values of x1, y1, x2, y2 are all normalized to the range 0–1000 based on the original image dimensions."
|
||||
},
|
||||
"label": {
|
||||
"type": "string",
|
||||
"description": "The name or label of the object in the specified bounding box (optional)."
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
"bbox_2d"
|
||||
]
|
||||
},
|
||||
"strict": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Is the old lady on the left side of the empty table behind older couple?"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
多轮请求, 历史上下文中包含工具返回结果时,按如下命令发起请求
|
||||
```shell
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d $'{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Get the current weather in Beijing"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "call_1",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"arguments": {
|
||||
"location": "Beijing",
|
||||
"unit": "c"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"content": ""
|
||||
},
|
||||
{
|
||||
"role": "tool",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "location: Beijing,temperature: 23,weather: sunny,unit: c"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"description": "Determine weather in my location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state e.g. San Francisco, CA"
|
||||
},
|
||||
"unit": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"c",
|
||||
"f"
|
||||
]
|
||||
}
|
||||
},
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"location",
|
||||
"unit"
|
||||
]
|
||||
},
|
||||
"strict": true
|
||||
}
|
||||
}
|
||||
],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
FastDeploy服务接口兼容OpenAI协议,可以通过如下Python代码发起服务请求, 以下示例开启流式用法。
|
||||
|
||||
```python
|
||||
import openai
|
||||
host = "0.0.0.0"
|
||||
port = "8180"
|
||||
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="null",
|
||||
messages=[
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
|
||||
{"type": "text", "text": "图中的文物属于哪个年代?"},
|
||||
]},
|
||||
],
|
||||
stream=True,
|
||||
)
|
||||
for chunk in response:
|
||||
if chunk.choices[0].delta:
|
||||
print(chunk.choices[0].delta.content, end='')
|
||||
print('\n')
|
||||
```
|
||||
|
||||
## 模型输出
|
||||
模型生成内容中, 思考内容在 `reasoning_content` 字段中, 模型回复内容在 `content` 字段中, 工具调用在 `tool_calls` 字段中。
|
||||
|
||||
非流式无工具调用结果示例:
|
||||
```json
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "\n\n图中的文物是**北朝**时期的佛教造像(约公元420 - 589年)。 \n\n从造像风格来看,其背光形制、佛像面相特征(如慈祥的神情、面部轮廓)、服饰样式(通肩式袈裟)、胁侍菩萨的配置,以及整体雕刻技法(如莲座的莲瓣处理、背光区域的装饰纹样)等,都符合北朝(涵盖北魏、东魏、西魏、北齐、北周等政权)佛教造像的典型艺术特征。北朝是佛教艺术在中国北方蓬勃发展的时期,这类石造像在形制与审美上融合了西域佛教艺术传统与中原文化审美,是研究该阶段宗教、艺术与社会文化的重要实物资料。",
|
||||
"multimodal_content": null,
|
||||
"reasoning_content": "用户现在需要判断这尊佛像的年代。首先看风格:背光、造像特征。这应该是北朝(北魏、北周等)的佛教造像,尤其是北朝时期的石造像,风格上背光有繁复的装饰,佛像的面相、服饰(通肩式袈裟)、胁侍菩萨等元素。北朝佛教造像盛行,尤其是北魏迁都后,造像风格从西域向中原过渡,这尊像的背光造型、雕刻技法(如莲座的样式、胁侍的配置)符合北朝(约420 - 589年)的特征。需要确认典型特征:背光是舟形或火穗形?这里背光是类似舟形,装饰莲瓣、飞天等,佛像结跏趺坐,双手施无畏印,胁侍菩萨,这些都是北朝石造像的常见元素。所以判断为北朝时期(公元420 - 589年)。\n",
|
||||
"tool_calls": null
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 1290,
|
||||
"total_tokens": 1681,
|
||||
"completion_tokens": 391,
|
||||
"prompt_tokens_details": {
|
||||
"cached_tokens": 0,
|
||||
"image_tokens": 1240,
|
||||
"video_tokens": 0
|
||||
},
|
||||
"completion_tokens_details": {
|
||||
"reasoning_tokens": 217,
|
||||
"image_tokens": 0
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
非流式有工具结果调用示例, 其中 `content` 字段为空, 且 `finish_reason` 为 `tool_calls`:
|
||||
|
||||
```python
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "",
|
||||
"multimodal_content": null,
|
||||
"reasoning_content": "What immediately stands out is that I need to determine the spatial relationship between the old lady, the empty table, and the older couple. The original image might not provide enough detail to make this determination clearly, so I should use the image_zoom_in_tool to focus on the relevant area where these elements are located.\n",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "chatcmpl-tool-dd0ef62027cf409c8f013af65f88adc3",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "image_zoom_in_tool",
|
||||
"arguments": "{\"bbox_2d\": [285, 235, 999, 652]}"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
"finish_reason": "tool_calls"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 280,
|
||||
"total_tokens": 397,
|
||||
"completion_tokens": 117,
|
||||
"prompt_tokens_details": {
|
||||
"cached_tokens": 0,
|
||||
"image_tokens": 0,
|
||||
"video_tokens": 0
|
||||
},
|
||||
"completion_tokens_details": {
|
||||
"reasoning_tokens": 66,
|
||||
"image_tokens": 0
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -22,7 +22,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
### 启动容器
|
||||
|
||||
```bash
|
||||
docker run -itd --name paddle_infer -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
docker run -itd --name paddle_infer --network host -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
docker exec -it paddle_infer bash
|
||||
```
|
||||
|
||||
@@ -432,7 +432,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
### 启动容器
|
||||
|
||||
```bash
|
||||
docker run -itd --name paddle_infer -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
docker run -itd --name paddle_infer --network host -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
docker exec -it paddle_infer bash
|
||||
```
|
||||
|
||||
@@ -441,8 +441,8 @@ docker exec -it paddle_infer bash
|
||||
### Install paddle
|
||||
|
||||
```bash
|
||||
pip3 install paddlepaddle==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
|
||||
pip3 install paddle-iluvatar-gpu==3.0.0.dev20250926 -i https://www.paddlepaddle.org.cn/packages/nightly/ixuca/
|
||||
pip3 install paddlepaddle==3.3.0.dev20251028 -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
|
||||
pip3 install paddle-iluvatar-gpu==3.0.0.dev20251029 -i https://www.paddlepaddle.org.cn/packages/nightly/ixuca/
|
||||
```
|
||||
获取Paddle的最新安装版本: [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/)
|
||||
|
||||
@@ -556,3 +556,80 @@ generated_text=
|
||||
|
||||
这件佛像具有典型的北齐风格,佛像结跏趺坐于莲花座上,身披通肩袈裟,面部圆润,神态安详,体现了北齐佛教艺术的独特魅力。
|
||||
```
|
||||
|
||||
## 测试thinking模型
|
||||
|
||||
### ERNIE-4.5-21B-A3B-Thinking
|
||||
参考 [gpu doc](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/best_practices/ERNIE-4.5-21B-A3B-Thinking.md), 命令如下所示:
|
||||
|
||||
server:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
export PADDLE_XCCL_BACKEND=iluvatar_gpu
|
||||
export INFERENCE_MSG_QUEUE_ID=232132
|
||||
export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
|
||||
export FD_SAMPLING_CLASS=rejection
|
||||
export FD_DEBUG=1
|
||||
python3 -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-21B-A3B-Thinking \
|
||||
--port 8180 \
|
||||
--load-choices "default_v1" \
|
||||
--tensor-parallel-size 2 \
|
||||
--max-model-len 32768 \
|
||||
--quantization wint8 \
|
||||
--block-size 16 \
|
||||
--reasoning-parser ernie_x1 \
|
||||
--tool-call-parser ernie_x1 \
|
||||
--max-num-seqs 8
|
||||
```
|
||||
|
||||
client:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": "Write me a poem about large language model."}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### ERNIE-4.5-VL-28B-A3B
|
||||
参考 [gpu doc](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/get_started/ernie-4.5-vl.md), 设置 `"chat_template_kwargs":{"enable_thinking": true}`,命令如下所示:
|
||||
|
||||
server:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
export PADDLE_XCCL_BACKEND=iluvatar_gpu
|
||||
export INFERENCE_MSG_QUEUE_ID=232132
|
||||
export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
|
||||
export FD_SAMPLING_CLASS=rejection
|
||||
export FD_DEBUG=1
|
||||
python3 -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
|
||||
--port 8180 \
|
||||
--tensor-parallel-size 2 \
|
||||
--max-model-len 32768 \
|
||||
--quantization wint8 \
|
||||
--block-size 16 \
|
||||
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
|
||||
--reasoning-parser ernie-45-vl \
|
||||
--max-num-seqs 8
|
||||
```
|
||||
|
||||
client:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
|
||||
{"type": "text", "text": "From which era does the artifact in the image originate?"}
|
||||
]}
|
||||
],
|
||||
"chat_template_kwargs":{"enable_thinking": true}
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -14,10 +14,10 @@
|
||||
|
||||
## 1. 预编译Docker安装(推荐)
|
||||
|
||||
**注意**: 如下镜像仅支持SM 80/90架构GPU(A800/H800等),如果你是在L20/L40/4090等SM 86/69架构的GPU上部署,请在创建容器后,卸载```fastdeploy-gpu```再重新安装如下文档指定支持86/89架构的`fastdeploy-gpu`包。
|
||||
**注意**: 如下镜像仅支持SM 80/90架构GPU(A800/H800等),如果你是在L20/L40/4090等SM 86/89架构的GPU上部署,请在创建容器后,卸载```fastdeploy-gpu```再重新安装如下文档指定支持86/89架构的`fastdeploy-gpu`包。
|
||||
|
||||
``` shell
|
||||
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.1
|
||||
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.3.0-rc0
|
||||
```
|
||||
|
||||
## 2. 预编译Pip安装
|
||||
@@ -26,7 +26,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12
|
||||
|
||||
``` shell
|
||||
# Install stable release
|
||||
python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
|
||||
python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
|
||||
|
||||
# Install latest Nightly build
|
||||
python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu126/
|
||||
@@ -38,7 +38,7 @@ python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/
|
||||
|
||||
```
|
||||
# 安装稳定版本fastdeploy
|
||||
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
python -m pip install fastdeploy-gpu==2.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
|
||||
# 安装Nightly Build的最新版本fastdeploy
|
||||
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
@@ -48,7 +48,7 @@ python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages
|
||||
|
||||
```
|
||||
# 安装稳定版本fastdeploy
|
||||
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
python -m pip install fastdeploy-gpu==2.3.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
|
||||
# 安装Nightly Build的最新版本fastdeploy
|
||||
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
@@ -70,7 +70,7 @@ docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .
|
||||
首先安装 paddlepaddle-gpu,详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/)
|
||||
|
||||
``` shell
|
||||
python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
|
||||
python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
|
||||
```
|
||||
|
||||
接着克隆源代码,编译安装
|
||||
|
||||
@@ -38,6 +38,8 @@
|
||||
| ```use_cudagraph``` | `bool` | __[已废弃]__ 2.3版本开始 CUDAGraph 默认开启,详细说明参考 [graph_optimization.md](./features/graph_optimization.md) |
|
||||
| ```graph_optimization_config``` | `dict[str]` | 可以配置计算图优化相关的参数,默认值为'{"use_cudagraph":true, "graph_opt_level":0}',详细说明参考 [graph_optimization.md](./features/graph_optimization.md)|
|
||||
| ```disable_custom_all_reduce``` | `bool` | 关闭Custom all-reduce,默认False |
|
||||
| ```use_internode_ll_two_stage``` | `bool` | 是否在DeepEP MoE中使用两阶段通信, default: False |
|
||||
| ```disable_sequence_parallel_moe``` | `bool` | 禁止在TP+EP中使用序列并行优化, default: False |
|
||||
| ```splitwise_role``` | `str` | 是否开启splitwise推理,默认值mixed, 支持参数为["mixed", "decode", "prefill"] |
|
||||
| ```innode_prefill_ports``` | `str` | prefill 实例内部引擎启动端口 (仅单机PD分离需要),默认值None |
|
||||
| ```guided_decoding_backend``` | `str` | 指定要使用的guided decoding后端,支持 `auto`、`xgrammar`、`off`, 默认为 `off` |
|
||||
@@ -47,6 +49,7 @@
|
||||
| ```enable_expert_parallel``` | `bool` | 是否启用专家并行 |
|
||||
| ```enable_logprob``` | `bool` | 是否启用输出token返回logprob。如果未使用 logrpob,则在启动时可以省略此参数。 |
|
||||
| ```logprobs_mode``` | `str` | 指定logprobs中返回的内容。支持的模式:`raw_logprobs`、`processed_logprobs'、`raw_logits`,`processed_logits'。processed表示logits应用温度、惩罚、禁止词处理后计算的logprobs。|
|
||||
| ```max_logprobs``` | `int` | 服务支持返回的最大logprob数量,默认20。-1表示词表大小。 |
|
||||
| ```served_model_name``` | `str` | API 中使用的模型名称,如果未指定,模型名称将与--model参数相同 |
|
||||
| ```revision``` | `str` | 自动下载模型时,用于指定模型的Git版本,分支名或tag |
|
||||
| ```chat_template``` | `str` | 指定模型拼接使用的模板,支持字符串与文件路径,默认为None,如未指定,则使用模型默认模板 |
|
||||
@@ -55,6 +58,7 @@
|
||||
| ```load_choices``` | `str` | 默认使用"default" loader进行权重加载,加载torch权重/权重加速需开启 "default_v1"|
|
||||
| ```max_encoder_cache``` | `int` | 编码器缓存的最大token数(使用0表示禁用)。 |
|
||||
| ```max_processor_cache``` | `int` | 处理器缓存的最大字节数(以GiB为单位,使用0表示禁用)。 |
|
||||
| ```api_key``` |`dict[str]`| 校验服务请求头中的API密钥,支持传入多个密钥;与环境变量`FD_API_KEY`中的值效果相同,且优先级高于环境变量配置|
|
||||
|
||||
## 1. KVCache分配与```num_gpu_blocks_override```、```block_size```的关系?
|
||||
|
||||
@@ -78,3 +82,39 @@ FastDeploy在推理过程中,显存被```模型权重```、```预分配KVCache
|
||||
当启用 `enable_chunked_prefill` 时,服务通过动态分块处理长输入序列,显著提升GPU资源利用率。在此模式下,原有 `max_num_batched_tokens` 参数不再约束预填充阶段的批处理token数量(限制单次prefill的token数量),因此引入 `max_num_partial_prefills` 参数,专门用于限制同时处理的分块批次数。
|
||||
|
||||
为优化短请求的调度优先级,新增 `max_long_partial_prefills` 与 `long_prefill_token_threshold` 参数组合。前者限制单个预填充批次中的长请求数量,后者定义长请求的token阈值。系统会优先保障短请求的批处理空间,从而在混合负载场景下降低短请求延迟,同时保持整体吞吐稳定。
|
||||
|
||||
## 4. ```api_key``` 参数使用说明
|
||||
|
||||
启动参数多值配置方式, 优先级高于环境变量中配置。
|
||||
```bash
|
||||
--api-key "key1"
|
||||
--api-key "key2"
|
||||
```
|
||||
环境变量多值配置方式,使用逗号分隔
|
||||
```bash
|
||||
export FD_API_KEY="key1,key2"
|
||||
```
|
||||
|
||||
使用 Curl 命令请求时,增加 API_KEY头信息,进行请求合法性校验。匹配任一```api_key```即可。
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8265/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer key1" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content":"你好"}
|
||||
],
|
||||
"stream": false,
|
||||
"return_token_ids": true,
|
||||
"chat_template_kwargs": {"enable_thinking": true}
|
||||
}'
|
||||
```
|
||||
解析`Authorization: Bearer` 后 `key1`进行校验。
|
||||
|
||||
使用 openai sdk 进行请求时,需要传`api_key`参数。
|
||||
```python
|
||||
client = OpenAI(
|
||||
api_key="your-api-key-here",
|
||||
base_url="http://localhost:8000/v1"
|
||||
)
|
||||
```
|
||||
|
||||
@@ -33,13 +33,14 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
|
||||
|模型|DataType|模型案例|
|
||||
|-|-|-|
|
||||
|⭐ERNIE|BF16\WINT4\WINT8\W4A8C8\WINT2\FP8|baidu/ERNIE-4.5-VL-424B-A47B-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-Paddle<br> [快速部署](./get_started/ernie-4.5.md)   [最佳实践](./best_practices/ERNIE-4.5-300B-A47B-Paddle.md);<br>baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-FP8-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-Base-Paddle;<br>[baidu/ERNIE-4.5-21B-A3B-Paddle](./best_practices/ERNIE-4.5-21B-A3B-Paddle.md);<br>baidu/ERNIE-4.5-21B-A3B-Base-Paddle;<br>baidu/ERNIE-4.5-21B-A3B-Thinking;<br>baidu/ERNIE-4.5-0.3B-Paddle<br> [快速部署](./get_started/quick_start.md)   [最佳实践](./best_practices/ERNIE-4.5-0.3B-Paddle.md);<br>baidu/ERNIE-4.5-0.3B-Base-Paddle, etc.|
|
||||
|⭐ERNIE|BF16\WINT4\WINT8\W4A8C8\WINT2\FP8|baidu/ERNIE-4.5-VL-424B-A47B-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-Paddle<br> [快速部署](./get_started/ernie-4.5.md)   [最佳实践](./best_practices/ERNIE-4.5-300B-A47B-Paddle.md);<br>baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-FP8-Paddle;<br>baidu/ERNIE-4.5-300B-A47B-Base-Paddle;<br>[baidu/ERNIE-4.5-21B-A3B-Paddle](./best_practices/ERNIE-4.5-21B-A3B-Paddle.md);<br>baidu/ERNIE-4.5-21B-A3B-Base-Paddle;<br>baidu/ERNIE-4.5-21B-A3B-Thinking;<br>[baidu/ERNIE-4.5-VL-28B-A3B-Thinking](./get_started/ernie-4.5-vl-thinking.md);<br>baidu/ERNIE-4.5-0.3B-Paddle<br> [快速部署](./get_started/quick_start.md)   [最佳实践](./best_practices/ERNIE-4.5-0.3B-Paddle.md);<br>baidu/ERNIE-4.5-0.3B-Base-Paddle, etc.|
|
||||
|⭐QWEN3-MOE|BF16/WINT4/WINT8/FP8|Qwen/Qwen3-235B-A22B;<br>Qwen/Qwen3-30B-A3B, etc.|
|
||||
|⭐QWEN3|BF16/WINT8/FP8|Qwen/qwen3-32B;<br>Qwen/qwen3-14B;<br>Qwen/qwen3-8B;<br>Qwen/qwen3-4B;<br>Qwen/qwen3-1.7B;<br>[Qwen/qwen3-0.6B](./get_started/quick_start_qwen.md), etc.|
|
||||
|⭐QWEN2.5|BF16/WINT8/FP8|Qwen/qwen2.5-72B;<br>Qwen/qwen2.5-32B;<br>Qwen/qwen2.5-14B;<br>Qwen/qwen2.5-7B;<br>Qwen/qwen2.5-3B;<br>Qwen/qwen2.5-1.5B;<br>Qwen/qwen2.5-0.5B, etc.|
|
||||
|⭐QWEN2|BF16/WINT8/FP8|Qwen/Qwen/qwen2-72B;<br>Qwen/Qwen/qwen2-7B;<br>Qwen/qwen2-1.5B;<br>Qwen/qwen2-0.5B;<br>Qwen/QwQ-32, etc.|
|
||||
|⭐DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;<br>unsloth/DeepSeek-V3-0324-BF16;<br>unsloth/DeepSeek-R1-BF16, etc.|
|
||||
|⭐GPT-OSS|BF16/WINT8|unsloth/gpt-oss-20b-BF16, etc.|
|
||||
|⭐GLM-4.5/4.6|BF16/wfp8afp8|zai-org/GLM-4.5-Air;<br>zai-org/GLM-4.6<br> [最佳实践](./best_practices/GLM-4-MoE-Text.md) etc.|
|
||||
|
||||
## 多模态语言模型列表
|
||||
|
||||
@@ -47,7 +48,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
|
||||
|模型|DataType|模型案例|
|
||||
|-|-|-|
|
||||
| ERNIE-VL |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle<br> [快速部署](./get_started/ernie-4.5-vl.md)   [最佳实践](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Paddle<br> [快速部署](./get_started/quick_start_vl.md)   [最佳实践](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;|
|
||||
| ERNIE-VL |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle<br> [快速部署](./get_started/ernie-4.5-vl.md)   [最佳实践](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Paddle<br> [快速部署](./get_started/quick_start_vl.md)   [最佳实践](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Thinking<br> [快速部署](./get_started/ernie-4.5-vl-thinking.md)  [最佳实践](./best_practices/ERNIE-4.5-VL-28B-A3B-Thinking.md) ;
|
||||
| PaddleOCR-VL |BF16/WINT4/WINT8| PaddlePaddle/PaddleOCR-VL<br>  [最佳实践](./best_practices/PaddleOCR-VL-0.9B.md) ;|
|
||||
| QWEN-VL |BF16/WINT4/FP8| Qwen/Qwen2.5-VL-72B-Instruct;<br>Qwen/Qwen2.5-VL-32B-Instruct;<br>Qwen/Qwen2.5-VL-7B-Instruct;<br>Qwen/Qwen2.5-VL-3B-Instruct|
|
||||
|
||||
|
||||
@@ -19,9 +19,9 @@
|
||||
|ERNIE-4.5-0.3B|128K|WINT8|1 (推荐)|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 \ <br> --load-choices "default"|2.3.0|
|
||||
|ERNIE-4.5-300B-A47B-W4A8C8-TP4|32K|W4A8|4|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "W4A8" \ <br> --gpu-memory-utilization 0.9 \ <br> --load-choices "default"|2.3.0|
|
||||
|ERNIE-4.5-VL-28B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0"# 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --quantization "wint8" \ <br> --max-model-len 32768 \ <br> --max-num-seqs 10 \ <br> --enable-mm \ <br> --mm-processor-kwargs '{"video_max_frames": 30}' \ <br> --limit-mm-per-prompt '{"image": 10, "video": 3}' \ <br> --reasoning-parser ernie-45-vl \ <br> --load-choices "default"|2.3.0|
|
||||
|ERNIE-4.5-VL-424B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-VL-424B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --quantization "wint8" \ <br> --max-model-len 32768 \ <br> --max-num-seqs 8 \ <br> --enable-mm \ <br> --mm-processor-kwargs '{"video_max_frames": 30}' \ <br> --limit-mm-per-prompt '{"image": 10, "video": 3}' \ <br> --reasoning-parser ernie-45-vl \ <br> --load-choices "default"|2.3.0|
|
||||
|ERNIE-4.5-VL-424B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-VL-424B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --quantization "wint8" \ <br> --max-model-len 32768 \ <br> --max-num-seqs 8 \ <br> --enable-mm \ <br> --mm-processor-kwargs '{"video_max_frames": 30}' \ <br> --limit-mm-per-prompt '{"image": 10, "video": 3}' \ <br> --reasoning-parser ernie-45-vl \ <br> --gpu-memory-utilization 0.7 \ <br> --load-choices "default"|2.3.0|
|
||||
|PaddleOCR-VL-0.9B|32K|BF16|1|export FD_ENABLE_MAX_PREFILL=1 <br>export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/PaddleOCR-VL \ <br> --port 8188 \ <br> --metrics-port 8181 \ <br> --engine-worker-queue-port 8182 \ <br> --max-model-len 16384 \ <br> --max-num-batched-tokens 16384 \ <br> --gpu-memory-utilization 0.8 \ <br> --max-num-seqs 256|2.3.0|
|
||||
|ERNIE-4.5-VL-28B-A3B-Thinking|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --quantization "wint8" \ <br> --max-model-len 131072 \ <br> --max-num-seqs 32 \ <br> --engine-worker-queue-port 8189 \ <br> --metrics-port 8190 \ <br> --cache-queue-port 8191 \ <br> --reasoning-parser ernie-45-vl-thinking \ <br> --tool-call-parser ernie-45-vl-thinking \ <br> --mm-processor-kwargs \ '{"image_max_pixels": 12845056 }' \ <br> --load-choices "default_v1"|2.3.0|
|
||||
|ERNIE-4.5-VL-28B-A3B-Thinking|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --quantization "wint8" \ <br> --max-model-len 131072 \ <br> --max-num-seqs 32 \ <br> --engine-worker-queue-port 8189 \ <br> --metrics-port 8190 \ <br> --cache-queue-port 8191 \ <br> --reasoning-parser ernie-45-vl-thinking \ <br> --tool-call-parser ernie-45-vl-thinking \ <br> --mm-processor-kwargs '{"image_max_pixels": 12845056 }' \ <br> --load-choices "default_v1"|2.3.0|
|
||||
|
||||
## 快速开始
|
||||
|
||||
@@ -248,7 +248,7 @@ print('\n')
|
||||
基于 WINT8 精度和 128K 上下文部署 ERNIE-4.5-VL-28B-A3B-Thinking 模型到 单卡 P800 服务器
|
||||
|
||||
```bash
|
||||
export XPU_VISIBLE_DEVICES="0"# 指定任意一张卡
|
||||
export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking \
|
||||
--port 8188 \
|
||||
@@ -265,53 +265,166 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--load-choices "default_v1"
|
||||
```
|
||||
|
||||
#### 请求服务
|
||||
|
||||
### 请求服务
|
||||
通过如下命令发起服务请求
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg", "detail": "high"}},
|
||||
{"type": "text", "text": "请描述图片内容"}
|
||||
]}
|
||||
],
|
||||
"metadata": {"enable_thinking": true}
|
||||
{"role": "user", "content": "把李白的静夜思改写为现代诗"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
ip = "0.0.0.0"
|
||||
service_http_port = "8188"
|
||||
client = openai.Client(base_url=f"http://{ip}:{service_http_port}/v1", api_key="EMPTY_API_KEY")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="default",
|
||||
messages=[
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg", "detail": "high"}},
|
||||
{"type": "text", "text": "请描述图片内容"}
|
||||
输入包含图片时,按如下命令发起请求
|
||||
```
|
||||
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": [
|
||||
{"type":"image_url", "image_url": {"url":"https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
|
||||
{"type":"text", "text":"图中的文物属于哪个年代?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
输入包含视频时,按如下命令发起请求
|
||||
```
|
||||
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": [
|
||||
{"type":"video_url", "video_url": {"url":"https://bj.bcebos.com/v1/paddlenlp/datasets/paddlemix/demo_video/example_video.mp4"}},
|
||||
{"type":"text", "text":"画面中有几个苹果?"}
|
||||
]}
|
||||
]
|
||||
}'
|
||||
```
|
||||
输入包含工具调用时,按如下命令发起请求
|
||||
```
|
||||
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d $'{
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "image_zoom_in_tool",
|
||||
"description": "Zoom in on a specific region of an image by cropping it based on a bounding box (bbox) and an optional object label.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"bbox_2d": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "number"
|
||||
},
|
||||
"minItems": 4,
|
||||
"maxItems": 4,
|
||||
"description": "The bounding box of the region to zoom in, as [x1, y1, x2, y2], where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner, and the values of x1, y1, x2, y2 are all normalized to the range 0–1000 based on the original image dimensions."
|
||||
},
|
||||
"label": {
|
||||
"type": "string",
|
||||
"description": "The name or label of the object in the specified bounding box (optional)."
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
"bbox_2d"
|
||||
]
|
||||
},
|
||||
"strict": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Is the old lady on the left side of the empty table behind older couple?"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
多轮请求, 历史上下文中包含工具返回结果时,按如下命令发起请求
|
||||
```
|
||||
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d $'{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Get the current weather in Beijing"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "call_1",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"arguments": {
|
||||
"location": "Beijing",
|
||||
"unit": "c"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"content": ""
|
||||
},
|
||||
{
|
||||
"role": "tool",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "location: Beijing,temperature: 23,weather: sunny,unit: c"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
temperature=0.0001,
|
||||
max_tokens=10000,
|
||||
stream=True,
|
||||
top_p=0,
|
||||
metadata={"enable_thinking": True},
|
||||
)
|
||||
|
||||
def get_str(content_raw):
|
||||
content_str = str(content_raw) if content_raw is not None else ''
|
||||
return content_str
|
||||
|
||||
for chunk in response:
|
||||
if chunk.choices[0].delta is not None and chunk.choices[0].delta.role != 'assistant':
|
||||
reasoning_content = get_str(chunk.choices[0].delta.reasoning_content)
|
||||
content = get_str(chunk.choices[0].delta.content)
|
||||
print(reasoning_content + content, end='', flush=True)
|
||||
print('\n')
|
||||
"tools": [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_weather",
|
||||
"description": "Determine weather in my location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state e.g. San Francisco, CA"
|
||||
},
|
||||
"unit": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"c",
|
||||
"f"
|
||||
]
|
||||
}
|
||||
},
|
||||
"additionalProperties": false,
|
||||
"required": [
|
||||
"location",
|
||||
"unit"
|
||||
]
|
||||
},
|
||||
"strict": true
|
||||
}
|
||||
}
|
||||
],
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user