mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-12-24 13:28:13 +08:00
Sync v2.0 version of code to github repo
This commit is contained in:
128
docs/zh/get_started/installation/Enflame_gcu.md
Normal file
128
docs/zh/get_started/installation/Enflame_gcu.md
Normal file
@@ -0,0 +1,128 @@
|
||||
# 使用 FastDeploy 在燧原 S60 上运行 ERNIE-4.5-21B-A3B模型
|
||||
|
||||
燧原 S60([了解燧原](https://www.enflame-tech.com/))是面向数据中心大规模部署的新一代人工智能推理加速卡,满足大语言模型、搜广推及传统模型的需求,具有模型覆盖面广、易用性强、易迁移易部署等特点,可广泛应用于图像及文本生成等应用、搜索与推荐、文本、图像及语音识别等主流推理场景。
|
||||
|
||||
FastDeploy 在燧原 S60 上对 ernie-4_5-21b-a3b-bf16-paddle 模型进行了深度适配和优化,实现了 GCU 推理入口和 GPU 的统一,无需修改即可完成推理任务的迁移。
|
||||
|
||||
## 🚀 快速开始 🚀
|
||||
|
||||
### 0. 机器准备。快速开始之前,您需要准备一台插有燧原 S60 加速卡的机器,要求如下:
|
||||
|
||||
| 芯片类型 | 驱动版本 | TopsRider 版本 |
|
||||
| :---: | :---: | :---: |
|
||||
| 燧原 S60 | 1.5.0.5 | 3.4.623 |
|
||||
|
||||
**注:如果需要验证您的机器是否插有燧原 S60 加速卡,只需主机环境下输入以下命令,看是否有输出:**
|
||||
```bash
|
||||
lspci | grep S60
|
||||
|
||||
# 例如:lspci | grep S60 , 输出如下
|
||||
08:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
|
||||
09:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
|
||||
0e:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
|
||||
11:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
|
||||
32:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
|
||||
38:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
|
||||
3b:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
|
||||
3c:00.0 Processing accelerators: Shanghai Enflame Technology Co. Ltd S60 [Enflame] (rev 01)
|
||||
```
|
||||
### 1. 环境准备:(这将花费您 5~10min 时间)
|
||||
1. 拉取镜像
|
||||
```bash
|
||||
# 注意此镜像仅为paddle开发环境,镜像中不包含预编译的飞桨安装包
|
||||
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.4.623-ubuntu20-x86_64-gcc84
|
||||
```
|
||||
2. 参考如下命令启动容器
|
||||
```bash
|
||||
docker run --name paddle-gcu-llm -v /home:/home -v /work:/work --network=host --ipc=host -it --privileged ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.4.623-ubuntu20-x86_64-gcc84 /bin/bash
|
||||
```
|
||||
3. 获取并安装驱动<br/>
|
||||
**docker 内提前放置了全量软件包,需拷贝至 docker 外目录,如:```/home/workspace/deps/```**
|
||||
```bash
|
||||
mkdir -p /home/workspace/deps/ && cp /root/TopsRider_i3x_*/TopsRider_i3x_*_deb_amd64.run /home/workspace/deps/
|
||||
```
|
||||
4. 安装驱动<br/>
|
||||
**此操作需要在主机环境下执行**
|
||||
```bash
|
||||
cd /home/workspace/deps/
|
||||
bash TopsRider_i3x_*_deb_amd64.run --driver --no-auto-load -y
|
||||
```
|
||||
驱动安装完成后**重新进入 docker**,参考如下命令
|
||||
```bash
|
||||
docker start paddle-gcu-llm
|
||||
docker exec -it paddle-gcu-llm bash
|
||||
```
|
||||
5. 安装 PaddlePaddle<br/>
|
||||
```bash
|
||||
# PaddlePaddle『飞桨』深度学习框架,提供运算基础能力
|
||||
python -m pip install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
|
||||
```
|
||||
6. 安装 PaddleCustomDevice<br/>
|
||||
```bash
|
||||
# PaddleCustomDevice是PaddlePaddle『飞桨』深度学习框架的自定义硬件接入实现,提供GCU的算子实现
|
||||
python -m pip install paddle-custom-gcu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/gcu/
|
||||
# 如想源码编译安装,请参考https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/gcu/README_cn.md
|
||||
```
|
||||
7. 安装 FastDeploy 和 依赖<br/>
|
||||
```bash
|
||||
python -m pip install fastdeploy -i https://www.paddlepaddle.org.cn/packages/stable/gcu/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
|
||||
apt install python3.10-distutils
|
||||
```
|
||||
### 2. 数据准备:(这将花费您 2~5min 时间)
|
||||
使用训练好的模型,在 GSM8K 上推理
|
||||
```bash
|
||||
mkdir -p /home/workspace/benchmark/ && cd /home/workspace/benchmark/
|
||||
wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
|
||||
```
|
||||
准备模型和权重,置于环境目录,如:```/work/models/ernie-4_5-21b-a3b-bf16-paddle/```
|
||||
### 3. 推理:(这将花费您 2~5min 时间)
|
||||
执行如下命令启动推理服务
|
||||
```bash
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model "/work/models/ernie-4_5-21b-a3b-bf16-paddle/" \
|
||||
--port 8188 \
|
||||
--metrics-port 8200 \
|
||||
--tensor-parallel-size 4 \
|
||||
--max-model-len 8192 \
|
||||
--num-gpu-blocks-override 1024
|
||||
```
|
||||
使用如下命令请求模型服务
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": "The largest ocean is"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
成功运行后,可以查看到推理结果的生成,样例如下
|
||||
```json
|
||||
{"id":"chatcmpl-5cd96f3b-eff3-4dc0-8aa2-8b5d7b7b86f2","object":"chat.completion","created":1751167862,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"3. **Pacific Ocean**: The Pacific Ocean is the largest and deepest of the world's oceans. It covers an area of approximately 181,344,000 square kilometers, which is more than 30% of the Earth's surface. It is located between the Americas to the west and east, and Asia and Australia to the north and south. The Pacific Ocean is known for its vastness, diverse marine life, and numerous islands.\n\nIn summary, the largest ocean in the world is the Pacific Ocean.","reasoning_content":null,"tool_calls":null},"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"total_tokens":127,"completion_tokens":116,"prompt_tokens_details":{"cached_tokens":0}}}
|
||||
```
|
||||
### 4. 精度测试:(这将花费您 60~180min 时间)
|
||||
准备精度脚本 ```bench_gsm8k.py``` 置于 ```/home/workspace/benchmark/``` ,并修改采样参数,如:
|
||||
```bash
|
||||
data = {
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": prompt,
|
||||
}
|
||||
],
|
||||
"temperature": 0.6,
|
||||
"max_tokens": 2047,
|
||||
"top_p": 0.95,
|
||||
"do_sample": True,
|
||||
}
|
||||
```
|
||||
执行以下命令启动精度测试
|
||||
```bash
|
||||
cd /home/workspace/benchmark/
|
||||
python -u bench_gsm8k.py --port 8188 --num-questions 1319 --num-shots 5 --parallel 2
|
||||
```
|
||||
执行成功运行后,当前目录可以查看到精度结果的生成,文件为 ```result.jsonl```,样例如下(部分数据集,仅示例)
|
||||
```json
|
||||
{"task": "gsm8k", "backend": "paddlepaddle", "num_gpus": 1, "latency": 365.548, "accuracy": 0.967, "num_requests": 30, "other": {"num_questions": 30, "parallel": 2}}
|
||||
```
|
||||
|
||||
8
docs/zh/get_started/installation/README.md
Normal file
8
docs/zh/get_started/installation/README.md
Normal file
@@ -0,0 +1,8 @@
|
||||
# FastDeploy Installation Guide
|
||||
|
||||
FastDeploy currently supports installation on the following hardware platforms:
|
||||
|
||||
- [NVIDIA GPU Installation](nvidia_gpu.md)
|
||||
- [Kunlunxin XPU Installation](kunlunxin_xpu.md)
|
||||
- [Enflame S60 GCU Installation](Enflame_gcu.md)
|
||||
- [Iluvatar GPU Installation](iluvatar_gpu.md)
|
||||
102
docs/zh/get_started/installation/iluvatar_gpu.md
Normal file
102
docs/zh/get_started/installation/iluvatar_gpu.md
Normal file
@@ -0,0 +1,102 @@
|
||||
# 如何在天数机器上运行 ERNIE-4.5-300B-A47B-BF16
|
||||
当前版本软件只是作为天数芯片 + Fastdeploy 推理大模型的一个演示 demo,跑最新ERNIE4.5模型可能存在问题,后续进行修复和性能优化,给客户提供一个更稳定的版本。
|
||||
|
||||
## 准备机器
|
||||
首先您需要准备以下配置的机器
|
||||
| CPU | 内存 | 天数 | 硬盘|
|
||||
|-----|------|-----|-----|
|
||||
| x86 | 1TB| 8xBI150| 1TB|
|
||||
|
||||
目前需要将完整模型 load 到 host memory 中,需要需要大于 600GB 的 host memory,后续版本会优化。
|
||||
|
||||
## 镜像
|
||||
从官网获取:
|
||||
|
||||
```bash
|
||||
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
```
|
||||
|
||||
## 准备容器
|
||||
1. 启动容器
|
||||
```bash
|
||||
docker run -itd --name paddle_infer -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
|
||||
docker exec -it paddle_infer bash
|
||||
```
|
||||
/home/paddle 为模型文件、whl包、脚本所在目录
|
||||
|
||||
2. 安装whl包
|
||||
|
||||
```bash
|
||||
pip3 install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
|
||||
pip3 install paddle-iluvatar-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/
|
||||
pip3 install fastdeploy -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
|
||||
pip3 install aistudio-sdk==0.2.6
|
||||
```
|
||||
|
||||
## 准备推理demo脚本
|
||||
推理 demo 路径:/home/paddle/scripts
|
||||
脚本内容如下
|
||||
|
||||
`run_demo.sh`:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
export PADDLE_XCCL_BACKEND=iluvatar_gpu
|
||||
export USE_WORKER_V1=1
|
||||
export INFERENCE_MSG_QUEUE_ID=232132
|
||||
export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
|
||||
export FD_DEBUG=1
|
||||
python3 run_demo.py
|
||||
```
|
||||
|
||||
run_demo.py
|
||||
|
||||
|
||||
```python
|
||||
from fastdeploy import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
]
|
||||
|
||||
# 采样参数
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)
|
||||
|
||||
# 加载模型
|
||||
llm = LLM(model="/home/paddle/ernie-4_5-300b-a47b-bf16-paddle", tensor_parallel_size=16, max_model_len=8192)
|
||||
|
||||
# 批量进行推理(llm内部基于资源情况进行请求排队、动态插入处理)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
# 注意将其中`/home/paddle/ernie-4_5-300b-a47b-bf16-paddle`替换为您下载的ERNIE模型的路径。
|
||||
# 输出结果
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs.text
|
||||
print(prompt, generated_text)
|
||||
```
|
||||
|
||||
## 运行demo
|
||||
执行
|
||||
```bash
|
||||
./run_demo.sh
|
||||
```
|
||||
会有如下 log 打印;load 模型耗时约470s,demo 运行约90s。
|
||||
```
|
||||
/usr/local/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:715: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
|
||||
warnings.warn(warning_message)
|
||||
/usr/local/lib/python3.10/site-packages/_distutils_hack/__init__.py:31: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
|
||||
warnings.warn(
|
||||
[2025-06-27 16:35:10,856] [ INFO] - Loading configuration file /home/paddle/ernie-45t/generation_config.json
|
||||
/usr/local/lib/python3.10/site-packages/paddlenlp/generation/configuration_utils.py:250: UserWarning: using greedy search strategy. However, `temperature` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `decode_strategy="greedy_search" ` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
|
||||
warnings.warn(
|
||||
/usr/local/lib/python3.10/site-packages/paddlenlp/generation/configuration_utils.py:255: UserWarning: using greedy search strategy. However, `top_p` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `decode_strategy="greedy_search" ` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
|
||||
warnings.warn(
|
||||
INFO 2025-06-27 16:35:12,205 2717757 engine.py[line:134] Waitting worker processes ready...
|
||||
Loading Weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:05<00:00, 18.13it/s]
|
||||
Loading Layers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 199.50it/s]
|
||||
[2025-06-27 16:35:24,030] [ WARNING] - import EventHandle and deep_ep Failed!
|
||||
[2025-06-27 16:35:24,032] [ WARNING] - import EventHandle and deep_ep Failed!
|
||||
INFO 2025-06-27 16:43:02,392 2717757 engine.py[line:700] Stop profile, num_gpu_blocks: 1820
|
||||
INFO 2025-06-27 16:43:02,393 2717757 engine.py[line:175] Worker processes are launched with 471.5467264652252 seconds.
|
||||
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:29<00:00, 89.98s/it, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
|
||||
Hello, my name is Hello! It's nice to meet you. I'm here to help with questions, have conversations, or assist with whatever you need. What would you like to talk about today? 😊
|
||||
```
|
||||
226
docs/zh/get_started/installation/kunlunxin_xpu.md
Normal file
226
docs/zh/get_started/installation/kunlunxin_xpu.md
Normal file
@@ -0,0 +1,226 @@
|
||||
# 昆仑芯 XPU
|
||||
|
||||
## 要求
|
||||
|
||||
- OS:Linux
|
||||
- Python:3.10
|
||||
- XPU 型号:P800
|
||||
- XPU 驱动版本:≥ 5.0.21.10
|
||||
- XPU 固件版本:≥ 1.31
|
||||
|
||||
已验证的平台:
|
||||
- CPU:INTEL(R) XEON(R) PLATINUM 8563C
|
||||
- 内存:2T
|
||||
- 磁盘:4T
|
||||
- OS:CentOS release 7.6 (Final)
|
||||
- Python:3.10
|
||||
- XPU 型号:P800(OAM 版)
|
||||
- XPU 驱动版本:5.0.21.10
|
||||
- XPU 固件版本:1.31
|
||||
|
||||
**注:** 目前只验证过 INTEL 或海光 CPU OAM 版 P800 服务器,暂未验证其它 CPU 和 PCIe 版 P800 服务器。
|
||||
|
||||
## 1. 使用 Docker 安装(推荐)
|
||||
|
||||
```bash
|
||||
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0
|
||||
```
|
||||
|
||||
## 2. 使用 Pip 安装
|
||||
|
||||
### 安装 PaddlePaddle
|
||||
|
||||
```bash
|
||||
python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
|
||||
```
|
||||
|
||||
或者您也可以安装最新版 PaddlePaddle(不推荐)
|
||||
|
||||
```bash
|
||||
python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
|
||||
```
|
||||
|
||||
### 安装 FastDeploy(**注意不要通过 pypi 源安装**)
|
||||
|
||||
```bash
|
||||
python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
|
||||
```
|
||||
|
||||
或者你也可以安装最新版 FastDeploy(不推荐)
|
||||
|
||||
```bash
|
||||
python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
|
||||
```
|
||||
|
||||
## 3. 从源码编译安装
|
||||
|
||||
### 安装 PaddlePaddle
|
||||
|
||||
```bash
|
||||
python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
|
||||
```
|
||||
|
||||
或者您也可以安装最新版 PaddlePaddle(不推荐)
|
||||
|
||||
```bash
|
||||
python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/
|
||||
```
|
||||
|
||||
### 下载昆仑编译套件 XTDK 和 XVLLM 预编译算子库并设置路径
|
||||
|
||||
```bash
|
||||
# XTDK
|
||||
wget https://klx-sdk-release-public.su.bcebos.com/xtdk_15fusion/dev/3.2.40.1/xtdk-llvm15-ubuntu2004_x86_64.tar.gz
|
||||
tar -xvf xtdk-llvm15-ubuntu2004_x86_64.tar.gz && mv xtdk-llvm15-ubuntu2004_x86_64 xtdk
|
||||
export CLANG_PATH=$(pwd)/xtdk
|
||||
|
||||
# XVLLM
|
||||
wget https://klx-sdk-release-public.su.bcebos.com/xinfer/daily/eb/20250624/output.tar.gz
|
||||
tar -xvf output.tar.gz && mv output xvllm
|
||||
export XVLLM_PATH=$(pwd)/xvllm
|
||||
```
|
||||
|
||||
或者你也可以下载最新版 XTDK 和 XVLLM(不推荐)
|
||||
|
||||
```bash
|
||||
XTDK: https://klx-sdk-release-public.su.bcebos.com/xtdk_15fusion/dev/latest/xtdk-llvm15-ubuntu2004_x86_64.tar.gz
|
||||
XVLLM: https://klx-sdk-release-public.su.bcebos.com/xinfer/daily/eb/latest/output.tar.gz
|
||||
```
|
||||
|
||||
### 下载 FastDelpoy 源码,切换到稳定分支或 TAG,开始编译并安装:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/PaddlePaddle/FastDeploy
|
||||
git checkout <tag or branch>
|
||||
cd FastDeploy
|
||||
bash build.sh
|
||||
```
|
||||
编译后的产物在 ```FastDeploy/dist``` 目录下。
|
||||
|
||||
## 验证是否安装成功
|
||||
|
||||
```python
|
||||
import paddle
|
||||
from paddle.jit.marker import unified
|
||||
paddle.utils.run_check()
|
||||
from fastdeploy.model_executor.ops.xpu import block_attn
|
||||
```
|
||||
|
||||
如果上述步骤均执行成功,代表 FastDeploy 已安装成功。
|
||||
|
||||
## 快速开始
|
||||
|
||||
目前 P800 暂时仅验证了以下模型的部署:
|
||||
- ERNIE-4.5-300B-A47B-Paddle 32K WINT4(8卡)
|
||||
- ERNIE-4.5-300B-A47B-Paddle 128K WINT4(8卡)
|
||||
|
||||
### 离线推理
|
||||
|
||||
安装 FastDeploy 后,您可以通过如下代码,基于用户给定的输入完成离线推理生成文本。
|
||||
|
||||
```python
|
||||
from fastdeploy import LLM, SamplingParams
|
||||
|
||||
prompts = [
|
||||
"Where is the capital of China?",
|
||||
]
|
||||
|
||||
# 采样参数
|
||||
sampling_params = SamplingParams(top_p=0.95)
|
||||
|
||||
# 加载模型
|
||||
llm = LLM(model="baidu/ERNIE-4.5-300B-A47B-Paddle", tensor_parallel_size=8, max_model_len=8192, quantization='wint4')
|
||||
|
||||
# 批量进行推理(llm内部基于资源情况进行请求排队、动态插入处理)
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
# 输出结果
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs.text
|
||||
|
||||
print(f"Prompt: {prompt}")
|
||||
print(f"Generated text: {generated_text}")
|
||||
```
|
||||
|
||||
更多参数可以参考文档 [参数说明](../../parameters.md)。
|
||||
|
||||
### OpenAI 兼容服务器
|
||||
|
||||
您还可以通过如下命令,基于 FastDeploy 实现 OpenAI API 协议兼容的服务器部署。
|
||||
|
||||
#### 启动服务
|
||||
|
||||
**ERNIE-4.5-300B-A47B-Paddle 32K WINT4(8卡)(推荐)**
|
||||
|
||||
```bash
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
|
||||
--port 8188 \
|
||||
--tensor-parallel-size 8 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-seqs 64 \
|
||||
--quantization "wint4" \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
**ERNIE-4.5-300B-A47B-Paddle 128K WINT4(8卡)**
|
||||
|
||||
```bash
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
|
||||
--port 8188 \
|
||||
--tensor-parallel-size 8 \
|
||||
--max-model-len 131072 \
|
||||
--max-num-seqs 64 \
|
||||
--quantization "wint4" \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
更多参数可以参考 [参数说明](../../parameters.md)。
|
||||
|
||||
#### 请求服务
|
||||
|
||||
您可以基于 OpenAI 协议,通过 curl 和 python 两种方式请求服务。
|
||||
|
||||
```bash
|
||||
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": "Where is the capital of China?"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
```python
|
||||
import openai
|
||||
host = "0.0.0.0"
|
||||
port = "8188"
|
||||
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
|
||||
|
||||
response = client.completions.create(
|
||||
model="null",
|
||||
prompt="Where is the capital of China?",
|
||||
stream=True,
|
||||
)
|
||||
for chunk in response:
|
||||
print(chunk.choices[0].text, end='')
|
||||
print('\n')
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="null",
|
||||
messages=[
|
||||
{"role": "system", "content": "I'm a helpful AI assistant."},
|
||||
{"role": "user", "content": "Where is the capital of China?"},
|
||||
],
|
||||
stream=True,
|
||||
)
|
||||
for chunk in response:
|
||||
if chunk.choices[0].delta:
|
||||
print(chunk.choices[0].delta.content, end='')
|
||||
print('\n')
|
||||
```
|
||||
|
||||
OpenAI 协议的更多说明可参考文档 [OpenAI Chat Compeltion API](https://platform.openai.com/docs/api-reference/chat/create),以及与 OpenAI 协议的区别可以参考 [服务化部署](../../serving/README.md)。
|
||||
|
||||
87
docs/zh/get_started/installation/nvidia_gpu.md
Normal file
87
docs/zh/get_started/installation/nvidia_gpu.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# NVIDIA CUDA GPU Installation
|
||||
|
||||
在环境满足如下条件前提下
|
||||
|
||||
- GPU驱动 >= 535
|
||||
- CUDA >= 12.3
|
||||
- CUDNN >= 9.5
|
||||
- Python >= 3.10
|
||||
- Linux X86_64
|
||||
|
||||
可通过如下4种方式进行安装
|
||||
|
||||
## 1. 预编译Docker安装(推荐)
|
||||
``` shell
|
||||
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.0.0
|
||||
```
|
||||
|
||||
## 2. 预编译Pip安装
|
||||
|
||||
首先安装 paddlepaddle-gpu,详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
|
||||
``` shell
|
||||
python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
|
||||
```
|
||||
|
||||
再安装 fastdeploy,**注意不要通过pypi源安装**,需要通过如下方式安装
|
||||
|
||||
如你的 GPU 是 SM80/90 架构(A100/H100等),按如下方式安装
|
||||
```
|
||||
# 安装稳定版本fastdeploy
|
||||
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
|
||||
# 安装Nightly Build的最新版本fastdeploy
|
||||
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
```
|
||||
|
||||
如你的 GPU 是 SM86/89 架构(4090/L20/L40等),按如下方式安装
|
||||
```
|
||||
# 安装稳定版本fastdeploy
|
||||
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
|
||||
# 安装Nightly Build的最新版本fastdeploy
|
||||
python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/fastdeploy-gpu-86_89/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
```
|
||||
|
||||
## 3. 镜像自行构建
|
||||
|
||||
> 注意 ```dockerfiles/Dockerfile.gpu``` 默认编译的架构支持SM 80/90,如若需要支持其它架构,需自行修改Dockerfile中的 ```bash build.sh 1 python false [80,90]```,建议不超过2个架构。
|
||||
|
||||
```
|
||||
git clone https://github.com/PaddlePaddle/FastDeploy
|
||||
cd FastDeploy
|
||||
|
||||
docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .
|
||||
```
|
||||
|
||||
## 4. Wheel包源码编译
|
||||
|
||||
首先安装 paddlepaddle-gpu,详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/)
|
||||
``` shell
|
||||
python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
|
||||
```
|
||||
|
||||
接着克隆源代码,编译安装
|
||||
``` shell
|
||||
git clone https://github.com/PaddlePaddle/FastDeploy
|
||||
cd FastDeploy
|
||||
|
||||
# 第1个参数: 表示是否要构建wheel包,1表示打包,0表示只编译
|
||||
# 第2个参数: Python解释器路径
|
||||
# 第3个参数: 是否编译CPU推理算子
|
||||
# 第4个参数: 编译的GPU架构
|
||||
bash build.sh 1 python false [80,90]
|
||||
```
|
||||
编译后的产物在```FastDeploy/dist```目录下。
|
||||
|
||||
## 环境检查
|
||||
|
||||
在安装 FastDeploy 后,通过如下 Python 代码检查环境的可用性
|
||||
``` python
|
||||
import paddle
|
||||
from paddle.jit.marker import unified
|
||||
# 检查GPU卡的可用性
|
||||
paddle.utils.run_check()
|
||||
# 检查FastDeploy自定义算子编译成功与否
|
||||
from fastdeploy.model_executor.ops.gpu import beam_search_softmax
|
||||
```
|
||||
如上代码执行成功,则认为环境可用。
|
||||
Reference in New Issue
Block a user