Sync v2.0 version of code to github repo

This commit is contained in:
Jiang-Jia-Jun
2025-06-29 23:29:37 +00:00
parent d151496038
commit 92c2cfa2e7
597 changed files with 78776 additions and 22905 deletions

View File

@@ -0,0 +1,199 @@
# ERNIE-4.5-VL Multimodal Model
This document explains how to deploy the ERNIE-4.5-VL multimodal model, which supports users to interact with the model using multimodal data (including reasoning capabilities). Before starting the deployment, please ensure that your hardware environment meets the following requirements:
- GPU Driver >= 535
- CUDA >= 12.3
- CUDNN >= 9.5
- Linux X86_64
- Python >= 3.10
- 80G A/H 8 GPUs
Refer to the [Installation Guide](./installation/README.md) for FastDeploy setup.
>💡 **Note**: ERNIE multimodal models all support thinking mode, which can be enabled by setting ```enable_thinking``` when initiating a service request (see the example below)..
## Prepare the Model
Specify ```--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle``` during deployment to automatically download the model from AIStudio with resumable downloads. You can also manually download the model from other sources. Note that FastDeploy requires Paddle-format models. For more details, see [Supported Models](../supported_models.md).
## Launch the Service
Execute the following command to start the service. For parameter configurations, refer to [Parameter Guide](../parameters.md).
>💡 **Note**: Since the model parameter size is 424B-A47B, on an 80G * 8 GPU machine, specify ```--quantization wint4``` (wint8 is also supported).
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
--port 8180 --engine-worker-queue-port 8181 \
--cache-queue-port 8182 --metrics-port 8182 \
--tensor-parallel-size 8 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 32 \
--enable-mm \
--mm-processor-kwargs '{"video_max_frames": 30}' \
--limit-mm-per-prompt '{"image": 10, "video": 3}' \
--reasoning-parser ernie-45-vl
```
## Request the Service
After launching, the service is ready when the following logs appear:
```shell
api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
INFO: Started server process [13909]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
```
### Health Check
Verify service status (HTTP 200 indicates success):
```shell
curl -i http://0.0.0.0:8180/health
```
### cURL Request
Send requests as follows:
```shell
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Rewrite Li Bai's 'Quiet Night Thoughts' as a modern poem"}
]
}'
```
For image inputs:
```shell
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": [
{"type":"image_url", "image_url": {"url":"https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
{"type":"text", "text":"From which era does the artifact in the image originate?"}
]}
]
}'
```
For video inputs:
```shell
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": [
{"type":"video_url", "video_url": {"url":"https://bj.bcebos.com/v1/paddlenlp/datasets/paddlemix/demo_video/example_video.mp4"}},
{"type":"text", "text":"How many apples are in the scene?"}
]}
]
}'
```
ERNIE-4.5-VL supports reasoning mode (enabled by default). Disable it as follows:
```shell
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
{"type": "text", "text": "From which era does the artifact in the image originate?"}
]}
],
"metadata": {"enable_thinking": false}
}'
```
### Python Client (OpenAI-compatible API)
FastDeploy's API is OpenAI-compatible. You can also use Python for streaming requests:
```python
import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
{"type": "text", "text": "From which era does the artifact in the image originate?"},
]},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```
## Model Output
Example output with reasoning (reasoning content in `reasoning_content`, response in `content`):
```json
{
"id": "chatcmpl-c4772bea-1950-4bf4-b5f8-3d3c044aab06",
"object": "chat.completion",
"created": 1750236617,
"model": "default",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The artifact in the image ...",
"reasoning_content": "The user asks about ..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 1260,
"total_tokens": 2042,
"completion_tokens": 782
}
}
```
Example output without reasoning:
```json
{
"id": "chatcmpl-4d508b96-0ea1-4430-98a6-ae569f74f25b",
"object": "chat.completion",
"created": 1750236495,
"model": "default",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The artifact is a ...",
"reasoning_content": null
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 1265,
"total_tokens": 1407,
"completion_tokens": 142
}
}
```

View File

@@ -0,0 +1,89 @@
# ERNIE-4.5 Model
This document explains how to deploy the ERNIE-4.5 model. Before starting the deployment, please ensure that your hardware environment meets the following requirements:
- GPU Driver >= 535
- CUDA >= 12.3
- CUDNN >= 9.5
- Linux X86_64
- Python >= 3.10
- 80G A/H 4 GPUs
For FastDeploy installation, refer to the [Installation Guide](./installation/README.md).
## Prepare the Model
Specify `--model baidu/ERNIE-4.5-300B-A47B-Paddle` during deployment to automatically download the model from AIStudio with support for resumable transfers. Alternatively, you can download the model manually from other sources. Note that FastDeploy requires the model in Paddle format. For more details, see the [Supported Models List](../supported_models.md).
## Start the Service
>💡 **Note**: Since the model parameter size is 300B-A47B,, on an 80G * 8-GPU machine, specify `--quantization wint4` (wint8 is also supported, where wint4 requires 4 GPUs and wint8 requires 8 GPUs).
Execute the following command to start the service. For configuration details, refer to the [Parameter Guide](../parameters.md):
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 --engine-worker-queue-port 8181 \
--cache-queue-port 8182 --metrics-port 8182 \
--tensor-parallel-size 8 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 32
```
## Request the Service
After starting the service, the following output indicates successful initialization:
```shell
api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
INFO: Started server process [13909]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
```
### Health Check
Verify service status (HTTP 200 indicates success):
```shell
curl -i http://0.0.0.0:8180/health
```
### cURL Request
Send requests to the service with the following command:
```shell
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write me a poem about large language model."}
]
}'
```
### Python Client (OpenAI-compatible API)
FastDeploy's API is OpenAI-compatible. You can also use Python for requests:
```python
import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
{"role": "user", "content": "Write me a poem about large language model."},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```

View File

@@ -0,0 +1,129 @@
# Running ERNIE-4.5-21B-A3B with FastDeploy
The Enflame S60 ([Learn about Enflame](https://www.enflame-tech.com/)) is a next-generation AI inference accelerator card designed for large-scale deployment in data centers. It meets the demands of large language models (LLMs), search/advertising/recommendation systems, and traditional models. Characterized by broad model coverage, user-friendliness, and high portability, it is widely applicable to mainstream inference scenarios such as image and text generation applications, search and recommendation systems, and text/image/speech recognition.
FastDeploy has deeply adapted and optimized the ernie-4_5-21b-a3b-bf16-paddle model for the Enflame S60, achieving a unified inference interface between GCU and GPU. This allows seamless migration of inference tasks without code modifications.
## 🚀 Quick Start 🚀
### 0. Machine Preparation
Before starting, prepare a machine equipped with Enflame S60 accelerator cards. Requirements:
| Chip Type | Driver Version | TopsRider Version |
| :---: | :---: | :---: |
| Enflame S60 | 1.5.0.5 | 3.4.623 |
**Note: To verify if your machine has Enflame S60 accelerator cards installed, run the following command in the host environment and check for output:**
```bash
lspci | grep S60
# Example: lspci | grep S60, Example Output:
08:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
09:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
0e:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
11:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
32:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
38:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
3b:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
3c:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
```
### 1. Environment Setup (Estimated time: 510 minutes)
1. Pull the Docker image
```bash
# Note: This image only contains the Paddle development environment, not precompiled PaddlePaddle packages
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.4.623-ubuntu20-x86_64-gcc84
```
2. Start the container
```bash
docker run --name paddle-gcu-llm -v /home:/home -v /work:/work --network=host --ipc=host -it --privileged ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.4.623-ubuntu20-x86_64-gcc84 /bin/bash
```
3. Obtain and install drivers<br/>
**Full software packages are preloaded in the Docker container. Copy them to an external directory, e.g., ```/home/workspace/deps/```**
```bash
mkdir -p /home/workspace/deps/ && cp /root/TopsRider_i3x_*/TopsRider_i3x_*_deb_amd64.run /home/workspace/deps/
```
4. Install drivers<br/>
**Execute this operation in the host environment**
```bash
cd /home/workspace/deps/
bash TopsRider_i3x_*_deb_amd64.run --driver --no-auto-load -y
```
After driver installation, **re-enter the Docker container**:
```bash
docker start paddle-gcu-llm
docker exec -it paddle-gcu-llm bash
```
5. Install PaddlePaddle
```bash
# PaddlePaddle Deep Learning Framework provides fundamental computing capabilities
python -m pip install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
```
6. Install PaddleCustomDevice<br/>
```bash
# PaddleCustomDevice implements custom hardware backend for PaddlePaddle, providing GCU operator implementations
python -m pip install paddle-custom-gcu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/gcu/
# For source compilation, refer to: https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/gcu/README_cn.md
```
7. Install FastDeploy and dependencies
```bash
python -m pip install fastdeploy -i https://www.paddlepaddle.org.cn/packages/stable/gcu/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
apt install python3.10-distutils
```
### 2. Data Preparation (Estimated time: 25 minutes)
Use a trained model for inference on GSM8K dataset:
```bash
mkdir -p /home/workspace/benchmark/ && cd /home/workspace/benchmark/
wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
```
Place model weights in a directory, e.g., ```/work/models/ernie-4_5-21b-a3b-bf16-paddle/```
### 3. Inference (Estimated time: 25 minutes)
Start the inference service:
```bash
python -m fastdeploy.entrypoints.openai.api_server \
--model "/work/models/ernie-4_5-21b-a3b-bf16-paddle/" \
--port 8188 \
--metrics-port 8200 \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--num-gpu-blocks-override 1024
```
Query the model service:
```bash
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "The largest ocean is"}
]
}'
```
Successful execution returns inference results, e.g.:
```json
{"id":"chatcmpl-5cd96f3b-eff3-4dc0-8aa2-8b5d7b7b86f2","object":"chat.completion","created":1751167862,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"3. **Pacific Ocean**: The Pacific Ocean is the largest and deepest of the world's oceans. It covers an area of approximately 181,344,000 square kilometers, which is more than 30% of the Earth's surface. It is located between the Americas to the west and east, and Asia and Australia to the north and south. The Pacific Ocean is known for its vastness, diverse marine life, and numerous islands.\n\nIn summary, the largest ocean in the world is the Pacific Ocean.","reasoning_content":null,"tool_calls":null},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"total_tokens":127,"completion_tokens":116,"prompt_tokens_details":{"cached_tokens":0}}}
```
### 4. Accuracy Testing (Estimated time: 60180 minutes)
Place the accuracy script ```bench_gsm8k.py``` in ```/home/workspace/benchmark/``` and modify sampling parameters, e.g.:
```bash
data = {
"messages": [
{
"role": "user",
"content": prompt,
}
],
"temperature": 0.6,
"max_tokens": 2047,
"top_p": 0.95,
"do_sample": True,
}
```
Run accuracy tests:
```bash
cd /home/workspace/benchmark/
python -u bench_gsm8k.py --port 8188 --num-questions 1319 --num-shots 5 --parallel 2
```
Upon completion, accuracy results are saved in ```result.jsonl```, e.g.:
```json
{"task": "gsm8k", "backend": "paddlepaddle", "num_gpus": 1, "latency": 365.548, "accuracy": 0.967, "num_requests": 30, "other": {"num_questions": 30, "parallel": 2}}
```

View File

@@ -1 +1,8 @@
# Installation
# FastDeploy Installation
FastDeploy currently supports installation on the following hardware platforms:
- [NVIDIA GPU Installation](nvidia_gpu.md)
- [Kunlun XPU Installation](kunlunxin_xpu.md)
- [Enflame S60 GCU Installation](Enflame_gcu.md)
- [Iluvatar GPU Installation](iluvatar_gpu.md)

View File

@@ -0,0 +1,101 @@
# Run ERNIE-4.5-300B-A47B model on iluvatar machine
The current version of the software merely serves as a demonstration demo for the Iluvatar CoreX combined with the Fastdeploy inference framework for large models. There may be issues when running the latest ERNIE4.5 model, and we will conduct repairs and performance optimization in the future. Subsequent versions will provide customers with a more stable version.
## Machine Preparation
First, you need to prepare a machine with the following configurations:
| CPU | Memory | Card | Hard Disk|
|-----|------|-----|-----|
| x86 | 1TB| 8xBI150| 1TB|
Currently, the entire model needs to be loaded into the host memory, which requires more than 600GB of host memory. This issue will be optimized in subsequent versions.
## Image Preparation
Pull the Docker image
```bash
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
```
## Container Preparation
1. Start Container
```bash
docker run -itd --name paddle_infer -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
docker exec -it paddle_infer bash
```
/home/paddle contains the model files, *.whl packages, and scripts.
2. Install packages
```bash
pip3 install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
pip3 install paddle-iluvatar-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/
pip3 install fastdeploy -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
pip3 install aistudio-sdk==0.2.6
```
## Prepare the inference demo script
script list below:
`run_demo.sh`:
```bash
#!/bin/bash
export PADDLE_XCCL_BACKEND=iluvatar_gpu
export USE_WORKER_V1=1
export INFERENCE_MSG_QUEUE_ID=232132
export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
export FD_DEBUG=1
python3 run_demo.py
```
`run_demo.py`:
```python
from fastdeploy import LLM, SamplingParams
prompts = [
"Hello, my name is",
]
# sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)
# load the model
llm = LLM(model="/home/paddle/ernie-4_5-300b-a47b-bf16-paddle", tensor_parallel_size=16, max_model_len=8192)
# Perform batch inference
outputs = llm.generate(prompts, sampling_params)
# NoteRlace `/home/paddle/ernie-4_5-300b-a47b-bf16-paddle` in it with the path to the ERNIE model you have downloaded.。
for output in outputs:
prompt = output.prompt
generated_text = output.outputs.text
print(prompt, generated_text)
```
## run demo
```bash
./run_demo.sh
```
The following logs will be printed: Loading the model took approximately 470 seconds, and running the demo took approximately 90 seconds.
```
/usr/local/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:715: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
warnings.warn(warning_message)
/usr/local/lib/python3.10/site-packages/_distutils_hack/__init__.py:31: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
warnings.warn(
[2025-06-27 16:35:10,856] [ INFO] - Loading configuration file /home/paddle/ernie-45t/generation_config.json
/usr/local/lib/python3.10/site-packages/paddlenlp/generation/configuration_utils.py:250: UserWarning: using greedy search strategy. However, `temperature` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `decode_strategy="greedy_search" ` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
/usr/local/lib/python3.10/site-packages/paddlenlp/generation/configuration_utils.py:255: UserWarning: using greedy search strategy. However, `top_p` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `decode_strategy="greedy_search" ` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
INFO 2025-06-27 16:35:12,205 2717757 engine.py[line:134] Waitting worker processes ready...
Loading Weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:05<00:00, 18.13it/s]
Loading Layers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 199.50it/s]
[2025-06-27 16:35:24,030] [ WARNING] - import EventHandle and deep_ep Failed!
[2025-06-27 16:35:24,032] [ WARNING] - import EventHandle and deep_ep Failed!
INFO 2025-06-27 16:43:02,392 2717757 engine.py[line:700] Stop profile, num_gpu_blocks: 1820
INFO 2025-06-27 16:43:02,393 2717757 engine.py[line:175] Worker processes are launched with 471.5467264652252 seconds.
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:29<00:00, 89.98s/it, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Hello, my name is Hello! It's nice to meet you. I'm here to help with questions, have conversations, or assist with whatever you need. What would you like to talk about today? 😊
```

View File

@@ -0,0 +1,221 @@
# Kunlunxin XPU
## Requirements
- OS: Linux
- Python: 3.10
- XPU Model: P800
- XPU Driver Version: ≥ 5.0.21.10
- XPU Firmware Version: ≥ 1.31
Verified platform:
- CPU: INTEL(R) XEON(R) PLATINUM 8563C
- Memory: 2T
- Disk: 4T
- OS: CentOS release 7.6 (Final)
- Python: 3.10
- XPU Model: P800 (OAM Edition)
- XPU Driver Version: 5.0.21.10
- XPU Firmware Version: 1.31
**Note:** Currently, only INTEL or Hygon CPU-based P800 (OAM Edition) servers have been verified. Other CPU types and P800 (PCIe Edition) servers have not been tested yet.
## 1. Set up using Docker (Recommended)
```bash
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0
```
## 2. Set up using pre-built wheels
### Install PaddlePaddle
```bash
python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
```
Alternatively, you can install the latest version of PaddlePaddle (Not recommended)
```bash
python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
```
### Install FastDeploy (**Do NOT install via PyPI source**)
```bash
python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
```
Alternatively, you can install the latest version of FastDeploy (Not recommended)
```bash
python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
```
### 3. Build wheel from source
### Install PaddlePaddle
```bash
python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
```
Alternatively, you can install the latest version of PaddlePaddle (Not recommended)
```bash
python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
```
### Download Kunlunxin Toolkit (XTDK) and XVLLM library, then set their paths.
```bash
# XTDK
wget https://klx-sdk-release-public.su.bcebos.com/xtdk_15fusion/dev/3.2.40.1/xtdk-llvm15-ubuntu2004_x86_64.tar.gz
tar -xvf xtdk-llvm15-ubuntu2004_x86_64.tar.gz && mv xtdk-llvm15-ubuntu2004_x86_64 xtdk
export CLANG_PATH=$(pwd)/xtdk
# XVLLM
wget https://klx-sdk-release-public.su.bcebos.com/xinfer/daily/eb/20250624/output.tar.gz
tar -xvf output.tar.gz && mv output xvllm
export XVLLM_PATH=$(pwd)/xvllm
```
Alternatively, you can download the latest versions of XTDK and XVLLM (Not recommended)
```bash
XTDK: https://klx-sdk-release-public.su.bcebos.com/xtdk_15fusion/dev/latest/xtdk-llvm15-ubuntu2004_x86_64.tar.gz
XVLLM: https://klx-sdk-release-public.su.bcebos.com/xinfer/daily/eb/latest/output.tar.gz
```
### Download FastDeploy source code, checkout the stable branch/TAG, then compile and install.
```bash
git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
bash build.sh
```
The compiled outputs will be located in the ```FastDeploy/dist``` directory.
## Installation verification
```python
import paddle
from paddle.jit.marker import unified
paddle.utils.run_check()
from fastdeploy.model_executor.ops.xpu import block_attn
```
If all the above steps execute successfully, FastDeploy is installed correctly.
## Quick start
Currently, P800 has only validated deployment of the following models:
- ERNIE-4.5-300B-A47B-Paddle 32K WINT4 (8-card)
- ERNIE-4.5-300B-A47B-Paddle 128K WINT4 (8-card)
### Offline inference
After installing FastDeploy, you can perform offline text generation with user-provided prompts using the following code,
```python
from fastdeploy import LLM, SamplingParams
prompts = [
"Where is the capital of China?",
]
sampling_params = SamplingParams(top_p=0.95)
llm = LLM(model="baidu/ERNIE-4.5-300B-A47B-Paddle", tensor_parallel_size=8, max_model_len=8192, quantization='wint4')
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs.text
print(f"Prompt: {prompt}")
print(f"Generated text: {generated_text}")
```
Refer to [Parameters](../../parameters.md) for more configuration options.
## Online serving (OpenAI API-Compatible server)
Deploy an OpenAI API-compatible server using FastDeploy with the following commands:
### Start service
**ERNIE-4.5-300B-A47B-Paddle 32K WINT4 (8-card) (Recommended)**
```bash
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--max-num-seqs 64 \
--quantization "wint4" \
--gpu-memory-utilization 0.9
```
**ERNIE-4.5-300B-A47B-Paddle 128K WINT4 (8-card)**
```bash
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 8 \
--max-model-len 131072 \
--max-num-seqs 64 \
--quantization "wint4" \
--gpu-memory-utilization 0.9
```
Refer to [Parameters](../../parameters.md) for more options.
### Send requests
Send requests using either curl or Python
```bash
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Where is the capital of China?"}
]
}'
```
```python
import openai
host = "0.0.0.0"
port = "8188"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.completions.create(
model="null",
prompt="Where is the capital of China?",
stream=True,
)
for chunk in response:
print(chunk.choices[0].text, end='')
print('\n')
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
{"role": "user", "content": "Where is the capital of China?"},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```
For detailed OpenAI protocol specifications, see [OpenAI Chat Compeltion API](https://platform.openai.com/docs/api-reference/chat/create). Differences from the standard OpenAI protocol are documented in [Deployment](../../serving/README.md).

View File

@@ -0,0 +1,89 @@
# NVIDIA CUDA GPU Installation
The following installation methods are available when your environment meets these requirements:
- GPU Driver >= 535
- CUDA >= 12.3
- CUDNN >= 9.5
- Python >= 3.10
- Linux X86_64
## 1. Pre-built Docker Installation (Recommended)
```shell
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy:${fastdeploy_latest_version}
```
Where ```${fastdeploy_latest_version}``` is the FastDeploy release version number. [Check latest release here](https://github.com/PaddlePaddle/FastDeploy/releases). For example:
```shell
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy:2.0.0
```
## 2. Pre-built Pip Installation
First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
```shell
python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```
Then install fastdeploy. **Do not install from PyPI**. Use the following methods instead:
For SM80/90 architecture GPUs(e.g A100/H100):
```
# Install stable release
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# Install latest Nightly build
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
For SM86/89 architecture GPUs(e.g 4090/L20/L40):
```
# Install stable release
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# Install latest Nightly build
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
## 3. Build from Source Using Docker
- Note: ```dockerfiles/Dockerfile.gpu``` by default supports SM 80/90 architectures. To support other architectures, modify ```bash build.sh 1 python false [80,90]``` in the Dockerfile. It's recommended to specify no more than 2 architectures.
```shell
git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .
```
## 4. Build Wheel from Source
First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
```shell
python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```
Then clone the source code and build:
```shell
git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
# Argument 1: Whether to build wheel package (1 for yes, 0 for compile only)
# Argument 2: Python interpreter path
# Argument 3: Whether to compile CPU inference operators
# Argument 4: Target GPU architectures
bash build.sh 1 python false [80,90]
```
The built packages will be in the ```FastDeploy/dist``` directory.
## Environment Verification
After installation, verify the environment with this Python code:
```python
import paddle
from paddle.jit.marker import unified
# Verify GPU availability
paddle.utils.run_check()
# Verify FastDeploy custom operators compilation
from fastdeploy.model_executor.ops.gpu import beam_search_softmax
```
If the above code executes successfully, the environment is ready.

View File

@@ -0,0 +1,93 @@
# Deploy ERNIE-4.5-0.3B-Base-Paddle in 10 Minutes
Before deployment, ensure your environment meets the following requirements:
- GPU Driver ≥ 535
- CUDA ≥ 12.3
- cuDNN ≥ 9.5
- Linux X86_64
- Python ≥ 3.10
This guide uses the lightweight ERNIE-4.5-0.3B-Base-Paddle model for demonstration, which can be deployed on most hardware configurations. Docker deployment is recommended.
For more information about how to install FastDeploy, refer to the [installation document](../installation/README.md).
## 1. Launch Service
After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)
```
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Base-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32
```
> 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```baidu/ERNIE-4.5-0.3B-Base-Paddle```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md).
```--max-model-len``` indicates the maximum number of tokens supported by the currently deployed service.
```--max-num-seqs``` indicates the maximum number of concurrent processing supported by the currently deployed service.
**Related Documents**
- [Service Deployment](../online_serving/README.md)
- [Service Monitoring](../online_serving/metrics.md)
## 2. Request the Service
After starting the service, the following output indicates successful initialization:
```shell
api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
INFO: Started server process [13909]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
```
### Health Check
Verify service status (HTTP 200 indicates success):
```shell
curl -i http://0.0.0.0:8180/health
```
### cURL Request
Send requests to the service with the following command:
```shell
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write me a poem about large language model."}
]
}'
```
### Python Client (OpenAI-compatible API)
FastDeploy's API is OpenAI-compatible. You can also use Python for requests:
```python
import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
{"role": "user", "content": "Write me a poem about large language model."},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```

View File

@@ -0,0 +1,106 @@
# Deploy ERNIE-4.5-VL-28B-A3B-Paddle Multimodal Model in 10 Minutes
Before deployment, please ensure your environment meets the following requirements:
- GPU Driver >= 535
- CUDA >= 12.3
- CUDNN >= 9.5
- Linux X86_64
- Python >= 3.10
- Hardware configuration meets minimum requirements (refer to [Supported Models](../supported_models.md))
For quick deployment across different hardware, this guide uses the ERNIE-4.5-VL-28B-A3B-Paddle multimodel model as an example, which can run on most hardware configurations.
For more information about how to install FastDeploy, refer to the [installation document](./installation/README.md).
>💡 **Note**: All ERNIE multimodal models support reasoning capability. Enable/disable it by setting ```enable_thinking``` in requests (see example below).
## 1. Launch Service
After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
--reasoning-parser ernie-45-vl \
--enable-mm
```
> 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```baidu/ERNIE-4.5-0.3B-Base-Paddle```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md).
```--max-model-len``` indicates the maximum number of tokens supported by the currently deployed service.
```--max-num-seqs``` indicates the maximum number of concurrent processing supported by the currently deployed service.
```--reasoning-parser``` specifies the thinking content parser.
```--enable-mm``` indicates whether to enable multi-modal support.
**Related Documents**
- [Service Deployment](../online_serving/README.md)
- [Service Monitoring](../online_serving/metrics.md)
## 2. Request the Service
After starting the service, the following output indicates successful initialization:
```shell
api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
INFO: Started server process [13909]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
```
### Health Check
Verify service status (HTTP 200 indicates success):
```shell
curl -i http://0.0.0.0:8180/health
```
### cURL Request
Send requests to the service with the following command:
```shell
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
{"type": "text", "text": "What era does this artifact belong to?"}
]}
],
"metadata": {"enable_thinking": false}
}'
```
### Python Client (OpenAI-compatible API)
FastDeploy's API is OpenAI-compatible. You can also use Python for requests:
```python
import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
{"type": "text", "text": "What era does this artifact belong to?"},
]},
],
metadata={"enable_thinking": false},
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```