Sync v2.0 version of code to github repo

2025-10-05 08:37:06 +08:00 · 2025-06-29 23:29:37 +00:00
parent d151496038
commit 92c2cfa2e7
597 changed files with 78776 additions and 22905 deletions
--- a/docs/get_started/ernie-4.5-vl.md
+++ b/docs/get_started/ernie-4.5-vl.md
@@ -0,0 +1,199 @@
+# ERNIE-4.5-VL Multimodal Model
+
+This document explains how to deploy the ERNIE-4.5-VL multimodal model, which supports users to interact with the model using multimodal data (including reasoning capabilities). Before starting the deployment, please ensure that your hardware environment meets the following requirements:
+
+- GPU Driver >= 535
+- CUDA >= 12.3
+- CUDNN >= 9.5
+- Linux X86_64
+- Python >= 3.10
+- 80G A/H 8 GPUs
+
+Refer to the [Installation Guide](./installation/README.md) for FastDeploy setup.
+
+>💡 **Note**: ERNIE multimodal models all support thinking mode, which can be enabled by setting ```enable_thinking``` when initiating a service request (see the example below)..
+
+## Prepare the Model
+Specify ```--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle``` during deployment to automatically download the model from AIStudio with resumable downloads. You can also manually download the model from other sources. Note that FastDeploy requires Paddle-format models. For more details, see [Supported Models](../supported_models.md).
+
+## Launch the Service
+
+Execute the following command to start the service. For parameter configurations, refer to [Parameter Guide](../parameters.md).
+
+>💡 **Note**: Since the model parameter size is 424B-A47B, on an 80G * 8 GPU machine, specify ```--quantization wint4``` (wint8 is also supported).
+
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
+       --port 8180 --engine-worker-queue-port 8181 \
+       --cache-queue-port 8182 --metrics-port 8182 \
+       --tensor-parallel-size 8 \
+       --quantization wint4 \
+       --max-model-len 32768 \
+       --max-num-seqs 32 \
+       --enable-mm \
+       --mm-processor-kwargs '{"video_max_frames": 30}' \
+       --limit-mm-per-prompt '{"image": 10, "video": 3}' \
+       --reasoning-parser ernie-45-vl
+```
+
+## Request the Service
+After launching, the service is ready when the following logs appear:
+
+```shell
+api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
+api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
+api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
+INFO:     Started server process [13909]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
+```
+
+### Health Check
+
+Verify service status (HTTP 200 indicates success):
+
+```shell
+curl -i http://0.0.0.0:8180/health
+```
+
+### cURL Request
+Send requests as follows:
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "Rewrite Li Bai's 'Quiet Night Thoughts' as a modern poem"}
+  ]
+}'
+```
+
+For image inputs:
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": [
+      {"type":"image_url", "image_url": {"url":"https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
+      {"type":"text", "text":"From which era does the artifact in the image originate?"}
+    ]}
+  ]
+}'
+```
+
+For video inputs:
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": [
+      {"type":"video_url", "video_url": {"url":"https://bj.bcebos.com/v1/paddlenlp/datasets/paddlemix/demo_video/example_video.mp4"}},
+      {"type":"text", "text":"How many apples are in the scene?"}
+    ]}
+  ]
+}'
+```
+
+ERNIE-4.5-VL supports reasoning mode (enabled by default). Disable it as follows:
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": [
+      {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
+      {"type": "text", "text": "From which era does the artifact in the image originate?"}
+    ]}
+  ],
+  "metadata": {"enable_thinking": false}
+}'
+```
+
+### Python Client (OpenAI-compatible API)
+
+FastDeploy's API is OpenAI-compatible. You can also use Python for streaming requests:
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8180"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "user", "content": [
+            {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
+            {"type": "text", "text": "From which era does the artifact in the image originate?"},
+        ]},
+    ],
+    stream=True,
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```
+
+## Model Output
+Example output with reasoning (reasoning content in `reasoning_content`, response in `content`):
+
+```json
+{
+    "id": "chatcmpl-c4772bea-1950-4bf4-b5f8-3d3c044aab06",
+    "object": "chat.completion",
+    "created": 1750236617,
+    "model": "default",
+    "choices": [
+        {
+            "index": 0,
+            "message": {
+                "role": "assistant",
+                "content": "The artifact in the image ...",
+                "reasoning_content": "The user asks about ..."
+            },
+            "finish_reason": "stop"
+        }
+    ],
+    "usage": {
+        "prompt_tokens": 1260,
+        "total_tokens": 2042,
+        "completion_tokens": 782
+    }
+}
+```
+
+Example output without reasoning:
+
+```json
+{
+    "id": "chatcmpl-4d508b96-0ea1-4430-98a6-ae569f74f25b",
+    "object": "chat.completion",
+    "created": 1750236495,
+    "model": "default",
+    "choices": [
+        {
+            "index": 0,
+            "message": {
+                "role": "assistant",
+                "content": "The artifact is a ...",
+                "reasoning_content": null
+            },
+            "finish_reason": "stop"
+        }
+    ],
+    "usage": {
+        "prompt_tokens": 1265,
+        "total_tokens": 1407,
+        "completion_tokens": 142
+    }
+}
+```
--- a/docs/get_started/ernie-4.5.md
+++ b/docs/get_started/ernie-4.5.md
@@ -0,0 +1,89 @@
+# ERNIE-4.5 Model
+
+This document explains how to deploy the ERNIE-4.5 model. Before starting the deployment, please ensure that your hardware environment meets the following requirements:
+- GPU Driver >= 535
+- CUDA >= 12.3
+- CUDNN >= 9.5
+- Linux X86_64
+- Python >= 3.10
+- 80G A/H 4 GPUs
+
+For FastDeploy installation, refer to the [Installation Guide](./installation/README.md).
+
+## Prepare the Model
+Specify `--model baidu/ERNIE-4.5-300B-A47B-Paddle` during deployment to automatically download the model from AIStudio with support for resumable transfers. Alternatively, you can download the model manually from other sources. Note that FastDeploy requires the model in Paddle format. For more details, see the [Supported Models List](../supported_models.md).
+
+## Start the Service
+
+>💡 **Note**: Since the model parameter size is 300B-A47B,, on an 80G * 8-GPU machine, specify `--quantization wint4` (wint8 is also supported, where wint4 requires 4 GPUs and wint8 requires 8 GPUs).
+
+Execute the following command to start the service. For configuration details, refer to the [Parameter Guide](../parameters.md):
+
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-300B-A47B-Paddle \
+       --port 8180 --engine-worker-queue-port 8181 \
+       --cache-queue-port 8182 --metrics-port 8182 \
+       --tensor-parallel-size 8 \
+       --quantization wint4 \
+       --max-model-len 32768 \
+       --max-num-seqs 32
+```
+
+## Request the Service
+After starting the service, the following output indicates successful initialization:
+
+```shell
+api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
+api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
+api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
+INFO:     Started server process [13909]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
+```
+
+### Health Check
+
+Verify service status (HTTP 200 indicates success):
+
+```shell
+curl -i http://0.0.0.0:8180/health
+```
+
+### cURL Request
+Send requests to the service with the following command:
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "Write me a poem about large language model."}
+  ]
+}'
+```
+
+### Python Client (OpenAI-compatible API)
+
+FastDeploy's API is OpenAI-compatible. You can also use Python for requests:
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8180"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "system", "content": "I'm a helpful AI assistant."},
+        {"role": "user", "content": "Write me a poem about large language model."},
+    ],
+    stream=True,
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```
--- a/docs/get_started/installation/Enflame_gcu.md
+++ b/docs/get_started/installation/Enflame_gcu.md
@@ -0,0 +1,129 @@
+# Running ERNIE-4.5-21B-A3B with FastDeploy
+
+The Enflame S60 ([Learn about Enflame](https://www.enflame-tech.com/)) is a next-generation AI inference accelerator card designed for large-scale deployment in data centers. It meets the demands of large language models (LLMs), search/advertising/recommendation systems, and traditional models. Characterized by broad model coverage, user-friendliness, and high portability, it is widely applicable to mainstream inference scenarios such as image and text generation applications, search and recommendation systems, and text/image/speech recognition.
+
+FastDeploy has deeply adapted and optimized the ernie-4_5-21b-a3b-bf16-paddle model for the Enflame S60, achieving a unified inference interface between GCU and GPU. This allows seamless migration of inference tasks without code modifications.
+
+## 🚀 Quick Start 🚀
+
+### 0. Machine Preparation
+Before starting, prepare a machine equipped with Enflame S60 accelerator cards. Requirements:
+
+| Chip Type | Driver Version | TopsRider Version |
+| :---: | :---: | :---: |
+| Enflame S60 | 1.5.0.5 | 3.4.623 |
+
+**Note: To verify if your machine has Enflame S60 accelerator cards installed, run the following command in the host environment and check for output:**
+```bash
+lspci | grep S60
+
+# Example: lspci | grep S60, Example Output:
+08:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+09:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+0e:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+11:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+32:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+38:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+3b:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+3c:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+```
+### 1. Environment Setup (Estimated time: 5–10 minutes)
+1. Pull the Docker image
+```bash
+# Note: This image only contains the Paddle development environment, not precompiled PaddlePaddle packages
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.4.623-ubuntu20-x86_64-gcc84
+```
+2. Start the container
+```bash
+docker run --name paddle-gcu-llm -v /home:/home -v /work:/work --network=host --ipc=host -it --privileged ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.4.623-ubuntu20-x86_64-gcc84 /bin/bash
+```
+3. Obtain and install drivers<br/>
+**Full software packages are preloaded in the Docker container. Copy them to an external directory, e.g., ```/home/workspace/deps/```**
+```bash
+mkdir -p /home/workspace/deps/ && cp /root/TopsRider_i3x_*/TopsRider_i3x_*_deb_amd64.run /home/workspace/deps/
+```
+4. Install drivers<br/>
+**Execute this operation in the host environment**
+```bash
+cd /home/workspace/deps/
+bash TopsRider_i3x_*_deb_amd64.run --driver --no-auto-load -y
+```
+After driver installation, **re-enter the Docker container**:
+```bash
+docker start paddle-gcu-llm
+docker exec -it paddle-gcu-llm bash
+```
+5. Install PaddlePaddle
+```bash
+# PaddlePaddle Deep Learning Framework provides fundamental computing capabilities
+python -m pip install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
+```
+6. Install PaddleCustomDevice<br/>
+```bash
+# PaddleCustomDevice implements custom hardware backend for PaddlePaddle, providing GCU operator implementations
+python -m pip install paddle-custom-gcu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/gcu/
+# For source compilation, refer to: https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/gcu/README_cn.md
+```
+7. Install FastDeploy and dependencies
+```bash
+python -m pip install fastdeploy -i https://www.paddlepaddle.org.cn/packages/stable/gcu/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
+apt install python3.10-distutils
+```
+### 2. Data Preparation (Estimated time: 2–5 minutes)
+Use a trained model for inference on GSM8K dataset:
+```bash
+mkdir -p /home/workspace/benchmark/ && cd /home/workspace/benchmark/
+wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
+```
+Place model weights in a directory, e.g., ```/work/models/ernie-4_5-21b-a3b-bf16-paddle/```
+### 3. Inference (Estimated time: 2–5 minutes)
+Start the inference service:
+```bash
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model "/work/models/ernie-4_5-21b-a3b-bf16-paddle/" \
+    --port 8188 \
+    --metrics-port 8200 \
+    --tensor-parallel-size 4 \
+    --max-model-len 8192 \
+    --num-gpu-blocks-override 1024
+```
+Query the model service:
+```bash
+curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "The largest ocean is"}
+  ]
+}'
+```
+Successful execution returns inference results, e.g.:
+```json
+{"id":"chatcmpl-5cd96f3b-eff3-4dc0-8aa2-8b5d7b7b86f2","object":"chat.completion","created":1751167862,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"3. **Pacific Ocean**: The Pacific Ocean is the largest and deepest of the world's oceans. It covers an area of approximately 181,344,000 square kilometers, which is more than 30% of the Earth's surface. It is located between the Americas to the west and east, and Asia and Australia to the north and south. The Pacific Ocean is known for its vastness, diverse marine life, and numerous islands.\n\nIn summary, the largest ocean in the world is the Pacific Ocean.","reasoning_content":null,"tool_calls":null},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"total_tokens":127,"completion_tokens":116,"prompt_tokens_details":{"cached_tokens":0}}}
+```
+### 4. Accuracy Testing (Estimated time: 60–180 minutes)
+Place the accuracy script ```bench_gsm8k.py``` in ```/home/workspace/benchmark/``` and modify sampling parameters, e.g.:
+```bash
+data = {
+    "messages": [
+        {
+            "role": "user",
+            "content": prompt,
+        }
+    ],
+    "temperature": 0.6,
+    "max_tokens": 2047,
+    "top_p": 0.95,
+    "do_sample": True,
+}
+```
+Run accuracy tests:
+```bash
+cd /home/workspace/benchmark/
+python -u bench_gsm8k.py --port 8188 --num-questions 1319 --num-shots 5 --parallel 2
+```
+Upon completion, accuracy results are saved in ```result.jsonl```, e.g.:
+```json
+{"task": "gsm8k", "backend": "paddlepaddle", "num_gpus": 1, "latency": 365.548, "accuracy": 0.967, "num_requests": 30, "other": {"num_questions": 30, "parallel": 2}}
+```
+
--- a/docs/get_started/installation/README.md
+++ b/docs/get_started/installation/README.md
@@ -1 +1,8 @@
-# Installation
+# FastDeploy Installation
+
+FastDeploy currently supports installation on the following hardware platforms:
+
+- [NVIDIA GPU Installation](nvidia_gpu.md)
+- [Kunlun XPU Installation](kunlunxin_xpu.md)
+- [Enflame S60 GCU Installation](Enflame_gcu.md)
+- [Iluvatar GPU Installation](iluvatar_gpu.md)
--- a/docs/get_started/installation/iluvatar_gpu.md
+++ b/docs/get_started/installation/iluvatar_gpu.md
@@ -0,0 +1,101 @@
+# Run ERNIE-4.5-300B-A47B model on iluvatar machine
+The current version of the software merely serves as a demonstration demo for the Iluvatar CoreX combined with the Fastdeploy inference framework for large models. There may be issues when running the latest ERNIE4.5 model, and we will conduct repairs and performance optimization in the future. Subsequent versions will provide customers with a more stable version.
+
+##  Machine Preparation
+First, you need to prepare a machine with the following configurations:
+| CPU | Memory | Card | Hard Disk|
+|-----|------|-----|-----|
+| x86 | 1TB| 8xBI150| 1TB|
+
+Currently, the entire model needs to be loaded into the host memory, which requires more than 600GB of host memory. This issue will be optimized in subsequent versions.
+
+## Image Preparation
+Pull the Docker image
+
+```bash
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
+```
+
+## Container Preparation
+1. Start Container
+```bash
+docker run -itd --name paddle_infer -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
+docker exec -it paddle_infer bash
+```
+/home/paddle contains the model files, *.whl packages, and scripts.
+
+2. Install packages
+
+```bash
+pip3 install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
+pip3 install paddle-iluvatar-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/
+pip3 install fastdeploy -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
+pip3 install aistudio-sdk==0.2.6
+```
+
+## Prepare the inference demo script
+
+script list below:
+
+`run_demo.sh`:
+```bash
+#!/bin/bash
+export PADDLE_XCCL_BACKEND=iluvatar_gpu
+export USE_WORKER_V1=1
+export INFERENCE_MSG_QUEUE_ID=232132
+export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
+export FD_DEBUG=1
+python3 run_demo.py
+```
+
+`run_demo.py`:
+
+```python
+from fastdeploy import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+]
+
+# sampling parameters
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)
+
+# load the model
+llm = LLM(model="/home/paddle/ernie-4_5-300b-a47b-bf16-paddle", tensor_parallel_size=16, max_model_len=8192)
+
+# Perform batch inference
+outputs = llm.generate(prompts, sampling_params)
+# Note：Rlace `/home/paddle/ernie-4_5-300b-a47b-bf16-paddle` in it with the path to the ERNIE model you have downloaded.。
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs.text
+    print(prompt, generated_text)
+```
+
+## run demo
+
+```bash
+./run_demo.sh
+```
+The following logs will be printed: Loading the model took approximately 470 seconds, and running the demo took approximately 90 seconds.
+```
+/usr/local/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:715: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
+  warnings.warn(warning_message)
+/usr/local/lib/python3.10/site-packages/_distutils_hack/__init__.py:31: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
+  warnings.warn(
+[2025-06-27 16:35:10,856] [    INFO] - Loading configuration file /home/paddle/ernie-45t/generation_config.json
+/usr/local/lib/python3.10/site-packages/paddlenlp/generation/configuration_utils.py:250: UserWarning: using greedy search strategy. However, `temperature` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `decode_strategy="greedy_search" ` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
+  warnings.warn(
+/usr/local/lib/python3.10/site-packages/paddlenlp/generation/configuration_utils.py:255: UserWarning: using greedy search strategy. However, `top_p` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `decode_strategy="greedy_search" ` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
+  warnings.warn(
+INFO     2025-06-27 16:35:12,205 2717757 engine.py[line:134] Waitting worker processes ready...
+Loading Weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:05<00:00, 18.13it/s]
+Loading Layers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 199.50it/s]
+[2025-06-27 16:35:24,030] [ WARNING] - import EventHandle and deep_ep Failed!
+[2025-06-27 16:35:24,032] [ WARNING] - import EventHandle and deep_ep Failed!
+INFO     2025-06-27 16:43:02,392 2717757 engine.py[line:700] Stop profile, num_gpu_blocks:  1820
+INFO     2025-06-27 16:43:02,393 2717757 engine.py[line:175] Worker processes are launched with 471.5467264652252 seconds.
+Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:29<00:00, 89.98s/it, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
+Hello, my name is Hello! It's nice to meet you. I'm here to help with questions, have conversations, or assist with whatever you need. What would you like to talk about today? 😊
+```
--- a/docs/get_started/installation/kunlunxin_xpu.md
+++ b/docs/get_started/installation/kunlunxin_xpu.md
@@ -0,0 +1,221 @@
+# Kunlunxin XPU
+
+## Requirements
+
+- OS: Linux
+- Python: 3.10
+- XPU Model: P800
+- XPU Driver Version: ≥ 5.0.21.10
+- XPU Firmware Version: ≥ 1.31
+
+Verified platform:
+- CPU: INTEL(R) XEON(R) PLATINUM 8563C
+- Memory: 2T
+- Disk: 4T
+- OS: CentOS release 7.6 (Final)
+- Python: 3.10
+- XPU Model: P800 (OAM Edition)
+- XPU Driver Version: 5.0.21.10
+- XPU Firmware Version: 1.31
+
+**Note:** Currently, only INTEL or Hygon CPU-based P800 (OAM Edition) servers have been verified. Other CPU types and P800 (PCIe Edition) servers have not been tested yet.
+
+## 1. Set up using Docker (Recommended)
+
+```bash
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0
+```
+
+## 2. Set up using pre-built wheels
+
+### Install PaddlePaddle
+
+```bash
+python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
+```
+
+Alternatively, you can install the latest version of PaddlePaddle (Not recommended)
+
+```bash
+python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
+```
+
+### Install FastDeploy (**Do NOT install via PyPI source**)
+
+```bash
+python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/ 
+```
+
+Alternatively, you can install the latest version of FastDeploy (Not recommended)
+
+```bash
+python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
+```
+
+### 3. Build wheel from source
+
+### Install PaddlePaddle
+
+```bash
+python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
+```
+
+Alternatively, you can install the latest version of PaddlePaddle (Not recommended)
+
+```bash
+python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
+```
+
+### Download Kunlunxin Toolkit (XTDK) and XVLLM library, then set their paths.
+
+```bash
+# XTDK
+wget https://klx-sdk-release-public.su.bcebos.com/xtdk_15fusion/dev/3.2.40.1/xtdk-llvm15-ubuntu2004_x86_64.tar.gz
+tar -xvf xtdk-llvm15-ubuntu2004_x86_64.tar.gz && mv xtdk-llvm15-ubuntu2004_x86_64 xtdk
+export CLANG_PATH=$(pwd)/xtdk
+
+# XVLLM
+wget https://klx-sdk-release-public.su.bcebos.com/xinfer/daily/eb/20250624/output.tar.gz
+tar -xvf output.tar.gz && mv output xvllm
+export XVLLM_PATH=$(pwd)/xvllm
+```
+
+Alternatively, you can download the latest versions of XTDK and XVLLM (Not recommended)
+
+```bash
+XTDK: https://klx-sdk-release-public.su.bcebos.com/xtdk_15fusion/dev/latest/xtdk-llvm15-ubuntu2004_x86_64.tar.gz
+XVLLM: https://klx-sdk-release-public.su.bcebos.com/xinfer/daily/eb/latest/output.tar.gz
+```
+
+### Download FastDeploy source code, checkout the stable branch/TAG, then compile and install.
+
+```bash
+git clone https://github.com/PaddlePaddle/FastDeploy
+cd FastDeploy
+bash build.sh
+```
+
+The compiled outputs will be located in the ```FastDeploy/dist``` directory.
+
+## Installation verification
+
+```python
+import paddle
+from paddle.jit.marker import unified
+paddle.utils.run_check()
+from fastdeploy.model_executor.ops.xpu import block_attn
+```
+
+If all the above steps execute successfully, FastDeploy is installed correctly.
+
+## Quick start
+
+Currently, P800 has only validated deployment of the following models:
+- ERNIE-4.5-300B-A47B-Paddle 32K WINT4 (8-card)
+- ERNIE-4.5-300B-A47B-Paddle 128K WINT4 (8-card)
+
+### Offline inference
+ 
+After installing FastDeploy, you can perform offline text generation with user-provided prompts using the following code,
+ 
+```python
+from fastdeploy import LLM, SamplingParams
+
+prompts = [
+    "Where is the capital of China?",
+]
+
+sampling_params = SamplingParams(top_p=0.95)
+
+llm = LLM(model="baidu/ERNIE-4.5-300B-A47B-Paddle", tensor_parallel_size=8, max_model_len=8192, quantization='wint4')
+
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs.text
+
+    print(f"Prompt: {prompt}")
+    print(f"Generated text: {generated_text}")
+```
+
+Refer to [Parameters](../../parameters.md) for more configuration options.
+
+## Online serving (OpenAI API-Compatible server)
+
+Deploy an OpenAI API-compatible server using FastDeploy with the following commands:
+
+### Start service
+
+**ERNIE-4.5-300B-A47B-Paddle 32K WINT4 (8-card) (Recommended)**
+
+```bash
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-300B-A47B-Paddle \
+    --port 8188 \
+    --tensor-parallel-size 8 \
+    --max-model-len 32768 \
+    --max-num-seqs 64 \
+    --quantization "wint4" \
+    --gpu-memory-utilization 0.9
+```
+
+**ERNIE-4.5-300B-A47B-Paddle 128K WINT4 (8-card)**
+
+```bash
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-300B-A47B-Paddle \
+    --port 8188 \
+    --tensor-parallel-size 8 \
+    --max-model-len 131072 \
+    --max-num-seqs 64 \
+    --quantization "wint4" \
+    --gpu-memory-utilization 0.9
+```
+
+Refer to [Parameters](../../parameters.md) for more options.
+
+### Send requests
+
+Send requests using either curl or Python
+
+```bash
+curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "Where is the capital of China?"}
+  ]
+}'
+```
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8188"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.completions.create(
+    model="null",
+    prompt="Where is the capital of China?",
+    stream=True,
+)
+for chunk in response:
+    print(chunk.choices[0].text, end='')
+print('\n')
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "system", "content": "I'm a helpful AI assistant."},
+        {"role": "user", "content": "Where is the capital of China?"},
+    ],
+    stream=True,
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```
+
+For detailed OpenAI protocol specifications, see [OpenAI Chat Compeltion API](https://platform.openai.com/docs/api-reference/chat/create). Differences from the standard OpenAI protocol are documented in [Deployment](../../serving/README.md).
--- a/docs/get_started/installation/nvidia_gpu.md
+++ b/docs/get_started/installation/nvidia_gpu.md
@@ -0,0 +1,89 @@
+# NVIDIA CUDA GPU Installation
+
+The following installation methods are available when your environment meets these requirements:
+
+- GPU Driver >= 535
+- CUDA >= 12.3
+- CUDNN >= 9.5
+- Python >= 3.10
+- Linux X86_64
+
+## 1. Pre-built Docker Installation (Recommended)
+```shell
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy:${fastdeploy_latest_version}
+```
+Where ```${fastdeploy_latest_version}``` is the FastDeploy release version number. [Check latest release here](https://github.com/PaddlePaddle/FastDeploy/releases). For example:
+```shell
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy:2.0.0
+```
+
+## 2. Pre-built Pip Installation
+
+First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
+```shell
+python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+```
+
+Then install fastdeploy. **Do not install from PyPI**. Use the following methods instead:
+
+For SM80/90 architecture GPUs(e.g A100/H100):
+```
+# Install stable release
+python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+# Install latest Nightly build
+python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+```
+
+For SM86/89 architecture GPUs(e.g 4090/L20/L40):
+```
+# Install stable release
+python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+# Install latest Nightly build
+python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+```
+
+## 3. Build from Source Using Docker
+
+- Note: ```dockerfiles/Dockerfile.gpu``` by default supports SM 80/90 architectures. To support other architectures, modify ```bash build.sh 1 python false [80,90]``` in the Dockerfile. It's recommended to specify no more than 2 architectures.
+
+```shell
+git clone https://github.com/PaddlePaddle/FastDeploy
+cd FastDeploy
+
+docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .
+```
+
+## 4. Build Wheel from Source
+
+First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
+```shell
+python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+```
+
+Then clone the source code and build:
+```shell
+git clone https://github.com/PaddlePaddle/FastDeploy
+cd FastDeploy
+
+# Argument 1: Whether to build wheel package (1 for yes, 0 for compile only)
+# Argument 2: Python interpreter path
+# Argument 3: Whether to compile CPU inference operators
+# Argument 4: Target GPU architectures
+bash build.sh 1 python false [80,90]
+```
+The built packages will be in the ```FastDeploy/dist``` directory.
+
+## Environment Verification
+
+After installation, verify the environment with this Python code:
+```python
+import paddle
+from paddle.jit.marker import unified
+# Verify GPU availability
+paddle.utils.run_check()
+# Verify FastDeploy custom operators compilation
+from fastdeploy.model_executor.ops.gpu import beam_search_softmax
+```
+If the above code executes successfully, the environment is ready.
--- a/docs/get_started/quick_start.md
+++ b/docs/get_started/quick_start.md
@@ -0,0 +1,93 @@
+# Deploy ERNIE-4.5-0.3B-Base-Paddle in 10 Minutes
+
+Before deployment, ensure your environment meets the following requirements:
+
+- GPU Driver ≥ 535
+- CUDA ≥ 12.3
+- cuDNN ≥ 9.5
+- Linux X86_64
+- Python ≥ 3.10
+
+This guide uses the lightweight ERNIE-4.5-0.3B-Base-Paddle model for demonstration, which can be deployed on most hardware configurations. Docker deployment is recommended.
+
+For more information about how to install FastDeploy, refer to the [installation document](../installation/README.md).
+
+## 1. Launch Service
+After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)
+
+```
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-0.3B-Base-Paddle \
+       --port 8180 \
+       --metrics-port 8181 \
+       --engine-worker-queue-port 8182 \
+       --max-model-len 32768 \
+       --max-num-seqs 32
+```
+
+> 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```baidu/ERNIE-4.5-0.3B-Base-Paddle```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md).  
+```--max-model-len``` indicates the maximum number of tokens supported by the currently deployed service.  
+```--max-num-seqs``` indicates the maximum number of concurrent processing supported by the currently deployed service.  
+
+**Related Documents**
+- [Service Deployment](../online_serving/README.md)
+- [Service Monitoring](../online_serving/metrics.md)
+
+## 2. Request the Service
+After starting the service, the following output indicates successful initialization:
+
+```shell
+api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
+api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
+api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
+INFO:     Started server process [13909]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
+```
+
+### Health Check
+
+Verify service status (HTTP 200 indicates success):
+
+```shell
+curl -i http://0.0.0.0:8180/health
+```
+
+### cURL Request
+
+Send requests to the service with the following command:
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "Write me a poem about large language model."}
+  ]
+}'
+```
+
+### Python Client (OpenAI-compatible API)
+
+FastDeploy's API is OpenAI-compatible. You can also use Python for requests:
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8180"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "system", "content": "I'm a helpful AI assistant."},
+        {"role": "user", "content": "Write me a poem about large language model."},
+    ],
+    stream=True,
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```
--- a/docs/get_started/quick_start_vl.md
+++ b/docs/get_started/quick_start_vl.md
@@ -0,0 +1,106 @@
+# Deploy ERNIE-4.5-VL-28B-A3B-Paddle Multimodal Model in 10 Minutes
+
+Before deployment, please ensure your environment meets the following requirements:
+
+- GPU Driver >= 535
+- CUDA >= 12.3
+- CUDNN >= 9.5
+- Linux X86_64
+- Python >= 3.10
+- Hardware configuration meets minimum requirements (refer to [Supported Models](../supported_models.md))
+
+For quick deployment across different hardware, this guide uses the ERNIE-4.5-VL-28B-A3B-Paddle multimodel model as an example, which can run on most hardware configurations.
+
+For more information about how to install FastDeploy, refer to the [installation document](./installation/README.md).
+
+>💡 **Note**: All ERNIE multimodal models support reasoning capability. Enable/disable it by setting ```enable_thinking``` in requests (see example below).
+
+## 1. Launch Service
+After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)
+
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+       --port 8180 \
+       --metrics-port 8181 \
+       --engine-worker-queue-port 8182 \
+       --max-model-len 32768 \
+       --max-num-seqs 32 \
+       --reasoning-parser ernie-45-vl \
+       --enable-mm
+```
+
+> 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```baidu/ERNIE-4.5-0.3B-Base-Paddle```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md).  
+```--max-model-len``` indicates the maximum number of tokens supported by the currently deployed service.  
+```--max-num-seqs``` indicates the maximum number of concurrent processing supported by the currently deployed service.  
+```--reasoning-parser``` specifies the thinking content parser.  
+```--enable-mm``` indicates whether to enable multi-modal support.
+
+**Related Documents**
+- [Service Deployment](../online_serving/README.md)
+- [Service Monitoring](../online_serving/metrics.md)
+
+## 2. Request the Service
+After starting the service, the following output indicates successful initialization:
+
+```shell
+api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
+api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
+api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
+INFO:     Started server process [13909]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
+```
+
+### Health Check
+
+Verify service status (HTTP 200 indicates success):
+
+```shell
+curl -i http://0.0.0.0:8180/health
+```
+
+### cURL Request
+Send requests to the service with the following command:
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": [
+      {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
+      {"type": "text", "text": "What era does this artifact belong to?"}
+    ]}
+  ],
+  "metadata": {"enable_thinking": false}
+}'
+```
+
+### Python Client (OpenAI-compatible API)
+
+FastDeploy's API is OpenAI-compatible. You can also use Python for requests:
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8180"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "user", "content": [
+            {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
+            {"type": "text", "text": "What era does this artifact belong to?"},
+        ]},
+    ],
+    metadata={"enable_thinking": false},
+    stream=True,
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```