Sync v2.0 version of code to github repo

2025-12-24 13:28:13 +08:00 · 2025-06-29 23:29:37 +00:00
parent d151496038
commit 92c2cfa2e7
597 changed files with 78776 additions and 22905 deletions
--- a/docs/zh/get_started/ernie-4.5-vl.md
+++ b/docs/zh/get_started/ernie-4.5-vl.md
@@ -0,0 +1,201 @@
+# ERNIE-4.5-VL多模态模型
+
+本文档讲解如何部署ERNIE-4.5-VL多模态模型，支持用户使用多模态数据与模型进行对话交互(包含思考Reasoning)，在开始部署前，请确保你的硬件环境满足如下条件：
+
+- GPU驱动 >= 535
+- CUDA >= 12.3
+- CUDNN >= 9.5
+- Linux X86_64
+- Python >= 3.10
+- 80G A/H 8卡
+
+安装FastDeploy方式参考[安装文档](./installation/README.md)。
+
+>💡 **提示**：  ERNIE多模态系列模型均支持思考模式，可以通过在发起服务请求时设置 ```enable_thinking``` 开启（参考如下示例）。
+
+## 准备模型
+部署时指定```--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle```即可自动从AIStudio下载模型，并支持断点续传。你也可以自行从不同渠道下载模型，需要注意的是FastDeploy依赖Paddle格式的模型，更多说明参考[支持模型列表](../supported_models.md)。
+
+## 启动服务
+
+执行如下命令，启动服务,其中启动命令配置方式参考[参数说明](../parameters.md)
+
+**注意**： 由于模型参数量为424B-A47B，在80G * 8卡的机器上，需指定```--quantization wint4```(wint8也可部署)。
+
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
+       --port 8180 --engine-worker-queue-port 8181 \
+       --cache-queue-port 8182 --metrics-port 8182 \
+       --tensor-parallel-size 8 \
+       --quantization wint4 \
+       --max-model-len 32768 \
+       --max-num-seqs 32 \
+       --enable-mm \
+       --mm-processor-kwargs '{"video_max_frames": 30}' \
+       --limit-mm-per-prompt '{"image": 10, "video": 3}' \
+       --reasoning-parser ernie-45-vl
+```
+
+## 用户发起服务请求
+执行启动服务指令后，当终端打印如下信息，说明服务已经启动成功。
+
+```shell
+api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
+api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
+api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
+INFO:     Started server process [13909]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
+```
+
+FastDeploy提供服务探活接口，用以判断服务的启动状态，执行如下命令返回 ```HTTP/1.1 200 OK``` 即表示服务启动成功。
+
+```shell
+curl -i http://0.0.0.0:8180/health
+```
+
+通过如下命令发起服务请求
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "把李白的静夜思改写为现代诗"}
+  ]
+}'
+```
+
+输入包含图片时，按如下命令发起请求
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": [
+      {"type":"image_url", "image_url": {"url":"https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
+      {"type":"text", "text":"图中的文物属于哪个年代?"}
+    ]}
+  ]
+}'
+```
+
+输入包含视频时，按如下命令发起请求
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": [
+      {"type":"video_url", "video_url": {"url":"https://bj.bcebos.com/v1/paddlenlp/datasets/paddlemix/demo_video/example_video.mp4"}},
+      {"type":"text", "text":"画面中有几个苹果?"}
+    ]}
+  ]
+}'
+```
+
+当前ERNIE-4.5-VL模型支持思考模式且默认开启，按如下命令可关闭思考模式
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": [
+      {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
+      {"type": "text", "text": "图中的文物属于哪个年代"}
+    ]}
+  ],
+  "metadata": {"enable_thinking": false}
+}'
+```
+
+FastDeploy服务接口兼容OpenAI协议，可以通过如下Python代码发起服务请求, 以下示例开启流式用法。
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8180"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "user", "content": [
+            {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
+            {"type": "text", "text": "图中的文物属于哪个年代?"},
+        ]},
+    ],
+    stream=True,
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```
+
+## 模型输出
+包含思考的输出示例如下, 思考内容在 `reasoning_content` 字段中, 模型回复内容在 `content` 字段中。
+
+```json
+{
+    "id": "chatcmpl-c4772bea-1950-4bf4-b5f8-3d3c044aab06",
+    "object": "chat.completion",
+    "created": 1750236617,
+    "model": "default",
+    "choices": [
+        {
+            "index": 0,
+            "message": {
+                "role": "assistant",
+                "content": "图中的文物是**唐代（7-8世纪）的佛陀坐像**，现藏于东京国立博物馆。其年代判断依据如下：\n\n1. **造型特征**：\n   - 佛陀结跏趺坐，双手结禅定印，身披通肩袈裟，衣纹呈阶梯状排列，线条厚重且富有层次感，体现了唐代佛像的典型衣饰风格。\n   - 面部圆润丰腴，双目微闭，嘴角含笑，展现了唐代佛像的慈悲祥和之态，与北魏时期的清瘦造型形成鲜明对比。\n\n2. **背光设计**：\n   - 背光呈舟形，内层雕刻密集的千佛（小佛像），外层装饰火焰纹，这种繁复的背光设计在唐代尤为盛行，象征佛法无边。\n\n3. **工艺与材质**：\n   - 石像表面有风化痕迹，符合唐代石雕历经千年的自然侵蚀特征。唐代多采用汉白玉、砂岩等材质雕刻佛像，注重细节刻画与整体气势。\n\n4. **历史背景**：\n   - 唐代是中国佛教发展的鼎盛时期，统治者推崇佛教，各地开窟造像之风盛行。此像的庄严法相与盛唐时期“丰腴为美”的审美取向高度契合。\n\n综上，此像从艺术风格到工艺特征均符合唐代佛教造像的典型特点，是研究唐代佛教艺术的重要实物资料。",
+                "reasoning_content": "用户问的是图中的文物属于哪个年代。首先，我需要确定这张图片中的文物是什么。看起来像是一尊佛像，可能是中国的佛教造像。佛像的造型和装饰风格可能能帮助判断年代。\n\n首先，观察佛像的衣纹和姿势。这尊佛像结跏趺坐，双手放在腿上，可能是在禅定印，这是比较常见的姿势。佛像的衣纹比较厚重，有层次感，可能是北魏或者隋唐时期的风格。北魏时期的佛像通常比较清瘦，衣纹线条硬朗，而隋唐时期的佛像则更丰腴，衣纹流畅。\n\n接下来看背光部分。背光上有许多小佛像，排列成同心圆，这种设计在隋唐时期比较常见，尤其是唐代。北魏时期的背光可能更简单，或者有飞天等装饰，但这种密集的小佛像排列可能更晚一些。\n\n另外，佛像的头部有螺发，肉髻较高，面部圆润，这些都是唐代佛像的特点。北魏的佛像面部通常较为清瘦，鼻梁高挺，而唐代的佛像面部更丰满，表情慈祥。\n\n综合这些特征，这尊佛像可能属于唐代，大约7到8世纪。不过，也有可能属于北魏晚期到隋代之间的过渡时期，但结合衣纹和背光的设计，唐代的可能性更大。需要进一步确认是否有其他特征，比如底座的样式、铭文等，但图片中没有显示这些细节。\n\n可能还需要考虑材质，如果是石雕，唐代常用汉白玉或砂岩，而北魏可能更多使用石灰岩。但图片中的材质看起来像是青铜或铁质，不过也有可能是石雕经过风化后的颜色。不过，佛像的金属质感可能更接近唐代，尤其是如果表面有鎏金的话，但这里看起来有些氧化，可能为铜质。\n\n总之，结合造型、衣纹、背光和面部特征，这尊佛像很可能属于中国唐代的佛教造像，大约7至8世纪。"
+            },
+            "finish_reason": "stop"
+        }
+    ],
+    "usage": {
+        "prompt_tokens": 1260,
+        "total_tokens": 2042,
+        "completion_tokens": 782,
+        "prompt_tokens_details": {
+            "cached_tokens": 0
+        }
+    }
+}
+```
+
+不包含思考的输出示例如下, 模型回复内容在 `content` 字段中。
+
+```python
+
+{
+    "id": "chatcmpl-4d508b96-0ea1-4430-98a6-ae569f74f25b",
+    "object": "chat.completion",
+    "created": 1750236495,
+    "model": "default",
+    "choices": [
+        {
+            "index": 0,
+            "message": {
+                "role": "assistant",
+                "content": "图中的文物是**北魏太和元年（477年）释迦牟尼佛像**，现收藏于故宫博物院。这尊佛像具有显著的北魏佛像艺术特征，其年代明确，题记中记载了“太和元年”的纪年，即北魏孝文帝元宏的年号。北魏时期（386-534年）是佛教艺术在中国发展的重要阶段，佛像造型逐渐从外来风格转向本土化，此像正是这一转变的典型代表。其衣纹流畅、面相慈祥，背光雕刻精美，展现了北魏中晚期佛像艺术的成熟与独特力。",
+                "reasoning_content": null
+            },
+            "finish_reason": "stop"
+        }
+    ],
+    "usage": {
+        "prompt_tokens": 1265,
+        "total_tokens": 1407,
+        "completion_tokens": 142,
+        "prompt_tokens_details": {
+            "cached_tokens": 0
+        }
+    }
+}
+```
--- a/docs/zh/get_started/ernie-4.5.md
+++ b/docs/zh/get_started/ernie-4.5.md
@@ -0,0 +1,85 @@
+# ERNIE-4.5模型
+
+本文档讲解如何部署ERNIE-4.5模型，在开始部署前，请确保你的硬件环境满足如下条件：
+
+- GPU驱动 >= 535
+- CUDA >= 12.3
+- CUDNN >= 9.5
+- Linux X86_64
+- Python >= 3.10
+- 80G A/H 4卡
+
+安装FastDeploy方式参考[安装文档](./installation/README.md)。
+
+## 准备模型
+部署时指定 ```--model baidu/ERNIE-4.5-300B-A47B-Paddle``` 即可自动从AIStudio下载模型，并支持断点续传。你也可以自行从不同渠道下载模型，需要注意的是FastDeploy依赖Paddle格式的模型，更多说明参考[支持模型列表](../supported_models.md)。
+
+## 启动服务
+
+>💡 **注意**： 由于模型参数量为300B-A47B，在80G * 8卡的机器上，需指定 ```--quantization wint4``` (wint8也可部署，其中wint4 4卡即可部署，wint8则需要8卡)。
+
+执行如下命令，启动服务，其中启动命令配置方式参考[参数说明](../parameters.md)。
+
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-300B-A47B-Paddle \
+       --port 8180 --engine-worker-queue-port 8181 \
+       --cache-queue-port 8182 --metrics-port 8182 \
+       --tensor-parallel-size 8 \
+       --quantization wint4 \
+       --max-model-len 32768 \
+       --max-num-seqs 32
+```
+
+## 用户发起服务请求
+执行启动服务指令后，当终端打印如下信息，说明服务已经启动成功。
+
+```shell
+api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
+api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
+api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
+INFO:     Started server process [13909]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
+```
+
+FastDeploy提供服务探活接口，用以判断服务的启动状态，执行如下命令返回 ```HTTP/1.1 200 OK``` 即表示服务启动成功。
+
+```shell
+curl -i http://0.0.0.0:8180/health
+```
+
+通过如下命令进行服务请求。
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "把李白的静夜思改写为现代诗"}
+  ]
+}'
+```
+
+FastDeploy服务接口兼容OpenAI协议，可以通过如下Python代码发起服务请求。
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8180"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "system", "content": "I'm a helpful AI assistant."},
+        {"role": "user", "content": "把李白的静夜思改写为现代诗"},
+    ],
+    stream=True,
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```
--- a/docs/zh/get_started/installation/Enflame_gcu.md
+++ b/docs/zh/get_started/installation/Enflame_gcu.md
@@ -0,0 +1,128 @@
+# 使用 FastDeploy 在燧原 S60 上运行 ERNIE-4.5-21B-A3B模型
+
+燧原 S60（[了解燧原](https://www.enflame-tech.com/)）是面向数据中心大规模部署的新一代人工智能推理加速卡，满足大语言模型、搜广推及传统模型的需求，具有模型覆盖面广、易用性强、易迁移易部署等特点，可广泛应用于图像及文本生成等应用、搜索与推荐、文本、图像及语音识别等主流推理场景。
+
+FastDeploy 在燧原 S60 上对 ernie-4_5-21b-a3b-bf16-paddle 模型进行了深度适配和优化，实现了 GCU 推理入口和 GPU 的统一，无需修改即可完成推理任务的迁移。
+
+## 🚀 快速开始 🚀
+
+### 0. 机器准备。快速开始之前，您需要准备一台插有燧原 S60 加速卡的机器，要求如下：
+
+| 芯片类型 | 驱动版本 | TopsRider 版本 |
+| :---: | :---: | :---: |
+| 燧原 S60 | 1.5.0.5 | 3.4.623 |
+
+**注：如果需要验证您的机器是否插有燧原 S60 加速卡，只需主机环境下输入以下命令，看是否有输出：**
+```bash
+lspci | grep S60
+
+# 例如：lspci | grep S60 , 输出如下
+08:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+09:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+0e:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+11:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+32:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+38:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+3b:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+3c:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
+```
+### 1. 环境准备：(这将花费您 5～10min 时间)
+1. 拉取镜像
+```bash
+# 注意此镜像仅为paddle开发环境，镜像中不包含预编译的飞桨安装包
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.4.623-ubuntu20-x86_64-gcc84
+```
+2. 参考如下命令启动容器
+```bash
+docker run --name paddle-gcu-llm -v /home:/home -v /work:/work --network=host --ipc=host -it --privileged ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.4.623-ubuntu20-x86_64-gcc84 /bin/bash
+```
+3. 获取并安装驱动<br/>
+**docker 内提前放置了全量软件包，需拷贝至 docker 外目录，如：```/home/workspace/deps/```**
+```bash
+mkdir -p /home/workspace/deps/ && cp /root/TopsRider_i3x_*/TopsRider_i3x_*_deb_amd64.run /home/workspace/deps/
+```
+4. 安装驱动<br/>
+**此操作需要在主机环境下执行**
+```bash
+cd /home/workspace/deps/
+bash TopsRider_i3x_*_deb_amd64.run --driver --no-auto-load -y
+```
+驱动安装完成后**重新进入 docker**，参考如下命令
+```bash
+docker start paddle-gcu-llm
+docker exec -it paddle-gcu-llm bash
+```
+5. 安装 PaddlePaddle<br/>
+```bash
+# PaddlePaddle『飞桨』深度学习框架，提供运算基础能力
+python -m pip install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
+```
+6. 安装 PaddleCustomDevice<br/>
+```bash
+# PaddleCustomDevice是PaddlePaddle『飞桨』深度学习框架的自定义硬件接入实现，提供GCU的算子实现
+python -m pip install paddle-custom-gcu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/gcu/
+# 如想源码编译安装，请参考https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/gcu/README_cn.md
+```
+7. 安装 FastDeploy 和 依赖<br/>
+```bash
+python -m pip install fastdeploy -i https://www.paddlepaddle.org.cn/packages/stable/gcu/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
+apt install python3.10-distutils
+```
+### 2. 数据准备：(这将花费您 2～5min 时间)
+使用训练好的模型，在 GSM8K 上推理
+```bash
+mkdir -p /home/workspace/benchmark/ && cd /home/workspace/benchmark/
+wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
+```
+准备模型和权重，置于环境目录，如：```/work/models/ernie-4_5-21b-a3b-bf16-paddle/```
+### 3. 推理：(这将花费您 2~5min 时间)
+执行如下命令启动推理服务
+```bash
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model "/work/models/ernie-4_5-21b-a3b-bf16-paddle/" \
+    --port 8188 \
+    --metrics-port 8200 \
+    --tensor-parallel-size 4 \
+    --max-model-len 8192 \
+    --num-gpu-blocks-override 1024
+```
+使用如下命令请求模型服务
+```bash
+curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "The largest ocean is"}
+  ]
+}'
+```
+成功运行后，可以查看到推理结果的生成，样例如下
+```json
+{"id":"chatcmpl-5cd96f3b-eff3-4dc0-8aa2-8b5d7b7b86f2","object":"chat.completion","created":1751167862,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"3. **Pacific Ocean**: The Pacific Ocean is the largest and deepest of the world's oceans. It covers an area of approximately 181,344,000 square kilometers, which is more than 30% of the Earth's surface. It is located between the Americas to the west and east, and Asia and Australia to the north and south. The Pacific Ocean is known for its vastness, diverse marine life, and numerous islands.\n\nIn summary, the largest ocean in the world is the Pacific Ocean.","reasoning_content":null,"tool_calls":null},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"total_tokens":127,"completion_tokens":116,"prompt_tokens_details":{"cached_tokens":0}}}
+```
+### 4. 精度测试：(这将花费您 60~180min 时间)
+准备精度脚本 ```bench_gsm8k.py``` 置于 ```/home/workspace/benchmark/``` ，并修改采样参数，如：
+```bash
+data = {
+    "messages": [
+        {
+            "role": "user",
+            "content": prompt,
+        }
+    ],
+    "temperature": 0.6,
+    "max_tokens": 2047,
+    "top_p": 0.95,
+    "do_sample": True,
+}
+```
+执行以下命令启动精度测试
+```bash
+cd /home/workspace/benchmark/
+python -u bench_gsm8k.py --port 8188 --num-questions 1319 --num-shots 5 --parallel 2
+```
+执行成功运行后，当前目录可以查看到精度结果的生成，文件为 ```result.jsonl```，样例如下（部分数据集，仅示例）
+```json
+{"task": "gsm8k", "backend": "paddlepaddle", "num_gpus": 1, "latency": 365.548, "accuracy": 0.967, "num_requests": 30, "other": {"num_questions": 30, "parallel": 2}}
+```
+
--- a/docs/zh/get_started/installation/README.md
+++ b/docs/zh/get_started/installation/README.md
@@ -0,0 +1,8 @@
+# FastDeploy Installation Guide
+
+FastDeploy currently supports installation on the following hardware platforms:
+
+- [NVIDIA GPU Installation](nvidia_gpu.md) 
+- [Kunlunxin XPU Installation](kunlunxin_xpu.md)
+- [Enflame S60 GCU Installation](Enflame_gcu.md)
+- [Iluvatar GPU Installation](iluvatar_gpu.md)
--- a/docs/zh/get_started/installation/iluvatar_gpu.md
+++ b/docs/zh/get_started/installation/iluvatar_gpu.md
@@ -0,0 +1,102 @@
+# 如何在天数机器上运行 ERNIE-4.5-300B-A47B-BF16
+当前版本软件只是作为天数芯片 + Fastdeploy 推理大模型的一个演示 demo，跑最新ERNIE4.5模型可能存在问题，后续进行修复和性能优化，给客户提供一个更稳定的版本。
+
+## 准备机器
+首先您需要准备以下配置的机器
+| CPU | 内存 | 天数 | 硬盘|
+|-----|------|-----|-----|
+| x86 | 1TB| 8xBI150| 1TB|
+
+目前需要将完整模型 load 到 host memory 中，需要需要大于 600GB 的 host memory，后续版本会优化。
+
+## 镜像
+从官网获取：
+
+```bash
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
+```
+
+## 准备容器
+1. 启动容器
+```bash
+docker run -itd --name paddle_infer -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
+docker exec -it paddle_infer bash
+```
+/home/paddle 为模型文件、whl包、脚本所在目录
+
+2. 安装whl包
+
+```bash
+pip3 install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
+pip3 install paddle-iluvatar-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/
+pip3 install fastdeploy -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
+pip3 install aistudio-sdk==0.2.6
+```
+
+## 准备推理demo脚本
+推理 demo 路径：/home/paddle/scripts
+脚本内容如下
+
+`run_demo.sh`:
+```bash
+#!/bin/bash
+export PADDLE_XCCL_BACKEND=iluvatar_gpu
+export USE_WORKER_V1=1
+export INFERENCE_MSG_QUEUE_ID=232132
+export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
+export FD_DEBUG=1
+python3 run_demo.py
+```
+
+run_demo.py
+
+
+```python
+from fastdeploy import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+]
+
+# 采样参数
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)
+
+# 加载模型
+llm = LLM(model="/home/paddle/ernie-4_5-300b-a47b-bf16-paddle", tensor_parallel_size=16, max_model_len=8192)
+
+# 批量进行推理（llm内部基于资源情况进行请求排队、动态插入处理）
+outputs = llm.generate(prompts, sampling_params)
+# 注意将其中`/home/paddle/ernie-4_5-300b-a47b-bf16-paddle`替换为您下载的ERNIE模型的路径。
+# 输出结果
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs.text
+    print(prompt, generated_text)
+```
+
+## 运行demo
+执行
+```bash
+./run_demo.sh
+```
+会有如下 log 打印；load 模型耗时约470s，demo 运行约90s。
+```
+/usr/local/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:715: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
+  warnings.warn(warning_message)
+/usr/local/lib/python3.10/site-packages/_distutils_hack/__init__.py:31: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
+  warnings.warn(
+[2025-06-27 16:35:10,856] [    INFO] - Loading configuration file /home/paddle/ernie-45t/generation_config.json
+/usr/local/lib/python3.10/site-packages/paddlenlp/generation/configuration_utils.py:250: UserWarning: using greedy search strategy. However, `temperature` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `decode_strategy="greedy_search" ` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
+  warnings.warn(
+/usr/local/lib/python3.10/site-packages/paddlenlp/generation/configuration_utils.py:255: UserWarning: using greedy search strategy. However, `top_p` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `decode_strategy="greedy_search" ` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
+  warnings.warn(
+INFO     2025-06-27 16:35:12,205 2717757 engine.py[line:134] Waitting worker processes ready...
+Loading Weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:05<00:00, 18.13it/s]
+Loading Layers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 199.50it/s]
+[2025-06-27 16:35:24,030] [ WARNING] - import EventHandle and deep_ep Failed!
+[2025-06-27 16:35:24,032] [ WARNING] - import EventHandle and deep_ep Failed!
+INFO     2025-06-27 16:43:02,392 2717757 engine.py[line:700] Stop profile, num_gpu_blocks:  1820
+INFO     2025-06-27 16:43:02,393 2717757 engine.py[line:175] Worker processes are launched with 471.5467264652252 seconds.
+Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:29<00:00, 89.98s/it, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
+Hello, my name is Hello! It's nice to meet you. I'm here to help with questions, have conversations, or assist with whatever you need. What would you like to talk about today? 😊
+```
--- a/docs/zh/get_started/installation/kunlunxin_xpu.md
+++ b/docs/zh/get_started/installation/kunlunxin_xpu.md
@@ -0,0 +1,226 @@
+# 昆仑芯 XPU
+
+## 要求
+
+- OS：Linux
+- Python：3.10
+- XPU 型号：P800
+- XPU 驱动版本：≥ 5.0.21.10
+- XPU 固件版本：≥ 1.31
+
+已验证的平台：
+- CPU：INTEL(R) XEON(R) PLATINUM 8563C
+- 内存：2T
+- 磁盘：4T
+- OS：CentOS release 7.6 (Final)
+- Python：3.10
+- XPU 型号：P800（OAM 版）
+- XPU 驱动版本：5.0.21.10
+- XPU 固件版本：1.31
+
+**注：** 目前只验证过 INTEL 或海光 CPU OAM 版 P800 服务器，暂未验证其它 CPU 和 PCIe 版 P800 服务器。
+
+## 1. 使用 Docker 安装（推荐）
+
+```bash
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0
+```
+
+## 2. 使用 Pip 安装
+
+### 安装 PaddlePaddle
+
+```bash
+python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
+```
+
+或者您也可以安装最新版 PaddlePaddle（不推荐）
+
+```bash
+python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
+```
+
+### 安装 FastDeploy（**注意不要通过 pypi 源安装**）
+
+```bash
+python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/ 
+```
+
+或者你也可以安装最新版 FastDeploy（不推荐）
+
+```bash
+python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
+```
+
+## 3. 从源码编译安装
+
+### 安装 PaddlePaddle
+
+```bash
+python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
+```
+
+或者您也可以安装最新版 PaddlePaddle（不推荐）
+
+```bash
+python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
+```
+
+### 下载昆仑编译套件 XTDK 和 XVLLM 预编译算子库并设置路径
+
+```bash
+# XTDK
+wget https://klx-sdk-release-public.su.bcebos.com/xtdk_15fusion/dev/3.2.40.1/xtdk-llvm15-ubuntu2004_x86_64.tar.gz
+tar -xvf xtdk-llvm15-ubuntu2004_x86_64.tar.gz && mv xtdk-llvm15-ubuntu2004_x86_64 xtdk
+export CLANG_PATH=$(pwd)/xtdk
+
+# XVLLM
+wget https://klx-sdk-release-public.su.bcebos.com/xinfer/daily/eb/20250624/output.tar.gz
+tar -xvf output.tar.gz && mv output xvllm
+export XVLLM_PATH=$(pwd)/xvllm
+```
+
+或者你也可以下载最新版 XTDK 和 XVLLM（不推荐）
+
+```bash
+XTDK: https://klx-sdk-release-public.su.bcebos.com/xtdk_15fusion/dev/latest/xtdk-llvm15-ubuntu2004_x86_64.tar.gz
+XVLLM: https://klx-sdk-release-public.su.bcebos.com/xinfer/daily/eb/latest/output.tar.gz
+```
+
+### 下载 FastDelpoy 源码，切换到稳定分支或 TAG，开始编译并安装：
+
+```bash
+git clone https://github.com/PaddlePaddle/FastDeploy
+git checkout <tag or branch>
+cd FastDeploy
+bash build.sh
+```
+编译后的产物在 ```FastDeploy/dist``` 目录下。
+
+## 验证是否安装成功
+
+```python
+import paddle
+from paddle.jit.marker import unified
+paddle.utils.run_check()
+from fastdeploy.model_executor.ops.xpu import block_attn
+```
+
+如果上述步骤均执行成功，代表 FastDeploy 已安装成功。
+
+## 快速开始
+
+目前 P800 暂时仅验证了以下模型的部署：
+- ERNIE-4.5-300B-A47B-Paddle 32K WINT4（8卡）
+- ERNIE-4.5-300B-A47B-Paddle 128K WINT4（8卡）
+
+### 离线推理
+
+安装 FastDeploy 后，您可以通过如下代码，基于用户给定的输入完成离线推理生成文本。
+
+```python
+from fastdeploy import LLM, SamplingParams
+
+prompts = [
+    "Where is the capital of China?",
+]
+
+# 采样参数
+sampling_params = SamplingParams(top_p=0.95)
+
+# 加载模型
+llm = LLM(model="baidu/ERNIE-4.5-300B-A47B-Paddle", tensor_parallel_size=8, max_model_len=8192, quantization='wint4')
+
+# 批量进行推理（llm内部基于资源情况进行请求排队、动态插入处理）
+outputs = llm.generate(prompts, sampling_params)
+
+# 输出结果
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs.text
+
+    print(f"Prompt: {prompt}")
+    print(f"Generated text: {generated_text}")
+```
+
+更多参数可以参考文档 [参数说明](../../parameters.md)。
+
+### OpenAI 兼容服务器
+
+您还可以通过如下命令，基于 FastDeploy 实现 OpenAI API 协议兼容的服务器部署。
+
+#### 启动服务
+
+**ERNIE-4.5-300B-A47B-Paddle 32K WINT4（8卡）（推荐）**
+
+```bash
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-300B-A47B-Paddle \
+    --port 8188 \
+    --tensor-parallel-size 8 \
+    --max-model-len 32768 \
+    --max-num-seqs 64 \
+    --quantization "wint4" \
+    --gpu-memory-utilization 0.9
+```
+
+**ERNIE-4.5-300B-A47B-Paddle 128K WINT4（8卡）**
+
+```bash
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-300B-A47B-Paddle \
+    --port 8188 \
+    --tensor-parallel-size 8 \
+    --max-model-len 131072 \
+    --max-num-seqs 64 \
+    --quantization "wint4" \
+    --gpu-memory-utilization 0.9
+```
+
+更多参数可以参考 [参数说明](../../parameters.md)。
+
+#### 请求服务
+
+您可以基于 OpenAI 协议，通过 curl 和 python 两种方式请求服务。
+
+```bash
+curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "Where is the capital of China?"}
+  ]
+}'
+```
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8188"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.completions.create(
+    model="null",
+    prompt="Where is the capital of China?",
+    stream=True,
+)
+for chunk in response:
+    print(chunk.choices[0].text, end='')
+print('\n')
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "system", "content": "I'm a helpful AI assistant."},
+        {"role": "user", "content": "Where is the capital of China?"},
+    ],
+    stream=True,
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```
+
+OpenAI 协议的更多说明可参考文档 [OpenAI Chat Compeltion API](https://platform.openai.com/docs/api-reference/chat/create)，以及与 OpenAI 协议的区别可以参考 [服务化部署](../../serving/README.md)。
+
--- a/docs/zh/get_started/installation/nvidia_gpu.md
+++ b/docs/zh/get_started/installation/nvidia_gpu.md
@@ -0,0 +1,87 @@
+# NVIDIA CUDA GPU Installation
+
+在环境满足如下条件前提下
+
+- GPU驱动 >= 535
+- CUDA >= 12.3
+- CUDNN >= 9.5
+- Python >= 3.10
+- Linux X86_64
+
+可通过如下4种方式进行安装
+
+## 1. 预编译Docker安装(推荐)
+``` shell
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.0.0
+```
+
+## 2. 预编译Pip安装
+
+首先安装 paddlepaddle-gpu，详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
+``` shell
+python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+```
+
+再安装 fastdeploy，**注意不要通过pypi源安装**，需要通过如下方式安装
+
+如你的 GPU 是 SM80/90 架构(A100/H100等)，按如下方式安装
+```
+# 安装稳定版本fastdeploy
+python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+# 安装Nightly Build的最新版本fastdeploy
+python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+```
+
+如你的 GPU 是 SM86/89 架构(4090/L20/L40等)，按如下方式安装
+```
+# 安装稳定版本fastdeploy
+python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+
+# 安装Nightly Build的最新版本fastdeploy
+python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+```
+
+## 3. 镜像自行构建
+
+> 注意 ```dockerfiles/Dockerfile.gpu``` 默认编译的架构支持SM 80/90，如若需要支持其它架构，需自行修改Dockerfile中的 ```bash build.sh 1 python false [80,90]```，建议不超过2个架构。
+
+```
+git clone https://github.com/PaddlePaddle/FastDeploy
+cd FastDeploy
+
+docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .
+```
+
+## 4. Wheel包源码编译
+
+首先安装 paddlepaddle-gpu，详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/)
+``` shell
+python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+```
+
+接着克隆源代码，编译安装
+``` shell
+git clone https://github.com/PaddlePaddle/FastDeploy
+cd FastDeploy
+
+# 第1个参数: 表示是否要构建wheel包，1表示打包，0表示只编译
+# 第2个参数: Python解释器路径
+# 第3个参数: 是否编译CPU推理算子
+# 第4个参数: 编译的GPU架构
+bash build.sh 1 python false [80,90]
+```
+编译后的产物在```FastDeploy/dist```目录下。
+
+## 环境检查
+
+在安装 FastDeploy 后，通过如下 Python 代码检查环境的可用性
+``` python
+import paddle
+from paddle.jit.marker import unified
+# 检查GPU卡的可用性
+paddle.utils.run_check()
+# 检查FastDeploy自定义算子编译成功与否
+from fastdeploy.model_executor.ops.gpu import beam_search_softmax
+```
+如上代码执行成功，则认为环境可用。
--- a/docs/zh/get_started/quick_start.md
+++ b/docs/zh/get_started/quick_start.md
@@ -0,0 +1,85 @@
+# 10分钟完成 ERNIE-4.5-0.3B-Base-Paddle 模型部署
+
+本文档讲解如何部署ERNIE-4.5-0.3B-Base-Paddle模型，在开始部署前，请确保你的硬件环境满足如下条件：
+
+- GPU驱动 >= 535
+- CUDA >= 12.3
+- CUDNN >= 9.5
+- Linux X86_64
+- Python >= 3.10
+- 运行模型满足最低硬件配置要求，参考[支持模型列表文档](../supported_models.md)
+
+为了快速在各类硬件部署，本文档采用 ```ERNIE-4.5-0.3B-Base-Paddle``` 模型作为示例，可在大部分硬件上完成部署。
+
+安装FastDeploy方式参考[安装文档](./installation/README.md)。
+
+## 1. 启动服务
+安装FastDeploy后，在终端执行如下命令，启动服务，其中启动命令配置方式参考[参数说明](../parameters.md)
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-0.3B-Base-Paddle \
+       --port 8180 \
+       --metrics-port 8181 \
+       --engine-worker-queue-port 8182 \
+       --max-model-len 32768 \
+       --max-num-seqs 32
+```
+>💡 注意：在 ```--model``` 指定的路径中，若当前目录下不存在该路径对应的子目录，则会尝试根据指定的模型名称（如 ```baidu/ERNIE-4.5-0.3B-Base-Paddle```）查询AIStudio是否存在预置模型，若存在，则自动启动下载。默认的下载路径为：```~/xx```。关于模型自动下载的说明和配置参阅[模型下载](../supported_models.md)。  
+```--max-model-len``` 表示当前部署的服务所支持的最长Token数量。  
+```--max-num-seqs``` 表示当前部署的服务所支持的最大并发处理数量。  
+
+**相关文档**
+
+- [服务部署配置](../online_serving/README.md)
+- [服务监控metrics](../online_serving/metrics.md)
+
+## 2. 用户发起服务请求
+
+执行启动服务指令后，当终端打印如下信息，说明服务已经启动成功。
+```
+api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
+api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
+api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
+INFO:     Started server process [13909]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
+```
+
+FastDeploy提供服务探活接口，用以判断服务的启动状态，执行如下命令返回 ```HTTP/1.1 200 OK``` 即表示服务启动成功。
+```shell
+curl -i http://0.0.0.0:8180/health
+```
+
+通过如下命令发起服务请求
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "把李白的静夜思改写为现代诗"}
+  ]
+}'
+```
+
+FastDeploy服务接口兼容OpenAI协议，可以通过如下Python代码发起服务请求。
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8180"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "system", "content": "I'm a helpful AI assistant."},
+        {"role": "user", "content": "把李白的静夜思改写为现代诗"},
+    ],
+    stream=True,
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```
--- a/docs/zh/get_started/quick_start_vl.md
+++ b/docs/zh/get_started/quick_start_vl.md
@@ -0,0 +1,103 @@
+# 10分钟完成 ERNIE-4.5-VL-28B-A3B-Paddle 多模态模型部署
+
+本文档讲解如何部署ERNIE-4.5-VL-28B-A3B-Paddle模型，在开始部署前，请确保你的硬件环境满足如下条件：
+
+- GPU驱动 >= 535
+- CUDA >= 12.3
+- CUDNN >= 9.5
+- Linux X86_64
+- Python >= 3.10
+- 运行模型满足最低硬件配置要求，参考[支持模型列表文档](../supported_models.md)
+
+为了快速在各类硬件部署，本文档采用 ```ERNIE-4.5-VL-28B-A3B-Paddle``` 多模态模型作为示例，可在大部分硬件上完成部署。
+
+安装FastDeploy方式参考[安装文档](./installation/README.md)。
+
+>💡 **提示**： ERNIE多模态系列模型均支持思考模式，可以通过在发起服务请求时设置 ```enable_thinking``` 开启（参考如下示例）。
+
+## 1. 启动服务
+安装FastDeploy后，在终端执行如下命令，启动服务，其中启动命令配置方式参考[参数说明](../parameters.md)
+
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+       --port 8180 \
+       --metrics-port 8181 \
+       --engine-worker-queue-port 8182 \
+       --max-model-len 32768 \
+       --max-num-seqs 32 \
+       --reasoning-parser ernie-45-vl \
+       --enable-mm
+```
+
+>💡 注意：在 ```--model``` 指定的路径中，若当前目录下不存在该路径对应的子目录，则会尝试根据指定的模型名称（如 ```baidu/ERNIE-4.5-0.3B-Base-Paddle```）查询AIStudio是否存在预置模型，若存在，则自动启动下载。默认的下载路径为：```~/xx```。关于模型自动下载的说明和配置参阅[模型下载](../supported_models.md)。  
+```--max-model-len``` 表示当前部署的服务所支持的最长Token数量。  
+```--max-num-seqs``` 表示当前部署的服务所支持的最大并发处理数量。  
+```--reasoning-parser``` 指定思考内容解析器。  
+```--enable-mm``` 表示是否开启多模态支持。  
+
+**相关文档**
+
+- [服务部署配置](../online_serving/README.md)
+- [服务监控metrics](../online_serving/metrics.md)
+
+## 2. 用户发起服务请求
+
+执行启动服务指令后，当终端打印如下信息，说明服务已经启动成功。
+
+```
+api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
+api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
+api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
+INFO:     Started server process [13909]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
+```
+
+FastDeploy提供服务探活接口，用以判断服务的启动状态，执行如下命令返回 ```HTTP/1.1 200 OK``` 即表示服务启动成功。
+
+```shell
+curl -i http://0.0.0.0:8180/health
+```
+
+通过如下命令发起服务请求
+
+```shell
+curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": [
+      {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
+      {"type": "text", "text": "图中的文物属于哪个年代"}
+    ]}
+  ],
+  "metadata": {"enable_thinking": false}
+}'
+```
+
+FastDeploy服务接口兼容OpenAI协议，可以通过如下Python代码发起服务请求。
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8180"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "user", "content": [
+            {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
+            {"type": "text", "text": "图中的文物属于哪个年代?"},
+        ]},
+    ],
+    metadata={"enable_thinking": false},
+    stream=True,
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```