From 92428a5ae4682867593f22013918f40e14559d14 Mon Sep 17 00:00:00 2001 From: hong19860320 <9973393+hong19860320@users.noreply.github.com> Date: Tue, 1 Jul 2025 12:28:49 +0800 Subject: [PATCH] Update kunlunxin_xpu.md (#2657) --- .../get_started/installation/kunlunxin_xpu.md | 73 ++++++++---------- .../get_started/installation/kunlunxin_xpu.md | 75 ++++++++----------- 2 files changed, 65 insertions(+), 83 deletions(-) diff --git a/docs/get_started/installation/kunlunxin_xpu.md b/docs/get_started/installation/kunlunxin_xpu.md index 9c7606714..e425d5ad7 100644 --- a/docs/get_started/installation/kunlunxin_xpu.md +++ b/docs/get_started/installation/kunlunxin_xpu.md @@ -23,7 +23,13 @@ Verified platform: ## 1. Set up using Docker (Recommended) ```bash +mkdir Work +cd Work docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0 +docker run --name fastdeploy-xpu --net=host -itd --privileged -v $PWD:/Work -w /Work \ + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0 \ + /bin/bash +docker exec -it fastdeploy-xpu /bin/bash ``` ## 2. Set up using pre-built wheels @@ -43,13 +49,13 @@ python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/ ### Install FastDeploy (**Do NOT install via PyPI source**) ```bash -python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/ +python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple ``` Alternatively, you can install the latest version of FastDeploy (Not recommended) ```bash -python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/ +python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple ``` ## 3. Build wheel from source @@ -99,47 +105,21 @@ The compiled outputs will be located in the ```FastDeploy/dist``` directory. ## Installation verification -```python -import paddle -from paddle.jit.marker import unified -paddle.utils.run_check() -from fastdeploy.model_executor.ops.xpu import block_attn +```bash +python -c "import paddle; paddle.version.show()" +python -c "import paddle; paddle.utils.run_check()" +python -c "from paddle.jit.marker import unified" +python -c "from fastdeploy.model_executor.ops.xpu import block_attn" ``` If all the above steps execute successfully, FastDeploy is installed correctly. ## Quick start -Currently, P800 has only validated deployment of the following models: -- ERNIE-4.5-300B-A47B-Paddle 32K WINT4 (8-card) -- ERNIE-4.5-300B-A47B-Paddle 128K WINT4 (8-card) - -### Offline inference - -After installing FastDeploy, you can perform offline text generation with user-provided prompts using the following code, - -```python -from fastdeploy import LLM, SamplingParams - -prompts = [ - "Where is the capital of China?", -] - -sampling_params = SamplingParams(top_p=0.95) - -llm = LLM(model="baidu/ERNIE-4.5-300B-A47B-Paddle", tensor_parallel_size=8, max_model_len=8192, quantization='wint4') - -outputs = llm.generate(prompts, sampling_params) - -for output in outputs: - prompt = output.prompt - generated_text = output.outputs.text - - print(f"Prompt: {prompt}") - print(f"Generated text: {generated_text}") -``` - -Refer to [Parameters](../../parameters.md) for more configuration options. +The P800 supports the deployment of the ```ERNIE-4.5-300B-A47B-Paddle``` model using the following configurations (Note: Different configurations may result in variations in performance). +- 32K WINT4 with 8 XPUs (Recommended) +- 128K WINT4 with 8 XPUs +- 32K WINT4 with 4 XPUs ### Online serving (OpenAI API-Compatible server) @@ -147,7 +127,7 @@ Deploy an OpenAI API-compatible server using FastDeploy with the following comma #### Start service -**ERNIE-4.5-300B-A47B-Paddle 32K WINT4 (8-card) (Recommended)** +**The ERNIE-4.5-300B-A47B-Paddle model is to be deployed with a configuration of 32K WINT4 utilizing 8 XPU cards (Recommended)** ```bash python -m fastdeploy.entrypoints.openai.api_server \ @@ -160,7 +140,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**ERNIE-4.5-300B-A47B-Paddle 128K WINT4 (8-card)** +**The ERNIE-4.5-300B-A47B-Paddle model is to be deployed with a configuration of 128K WINT4 utilizing 8 XPU cards** ```bash python -m fastdeploy.entrypoints.openai.api_server \ @@ -173,6 +153,20 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` +**The ERNIE-4.5-300B-A47B-Paddle model is to be deployed with a configuration of 32K WINT4 utilizing 4 XPU cards** + +```bash +export XPU_VISIBLE_DEVICES="0,1,2,3" +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-300B-A47B-Paddle \ + --port 8188 \ + --tensor-parallel-size 4 \ + --max-model-len 32768 \ + --max-num-seqs 64 \ + --quantization "wint4" \ + --gpu-memory-utilization 0.9 +``` + Refer to [Parameters](../../parameters.md) for more options. #### Send requests @@ -207,7 +201,6 @@ print('\n') response = client.chat.completions.create( model="null", messages=[ - {"role": "system", "content": "I'm a helpful AI assistant."}, {"role": "user", "content": "Where is the capital of China?"}, ], stream=True, diff --git a/docs/zh/get_started/installation/kunlunxin_xpu.md b/docs/zh/get_started/installation/kunlunxin_xpu.md index 5dcd43df1..479077797 100644 --- a/docs/zh/get_started/installation/kunlunxin_xpu.md +++ b/docs/zh/get_started/installation/kunlunxin_xpu.md @@ -23,7 +23,13 @@ ## 1. 使用 Docker 安装(推荐) ```bash +mkdir Work +cd Work docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0 +docker run --name fastdeploy-xpu --net=host -itd --privileged -v $PWD:/Work -w /Work \ + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0 \ + /bin/bash +docker exec -it fastdeploy-xpu /bin/bash ``` ## 2. 使用 Pip 安装 @@ -43,13 +49,13 @@ python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/ ### 安装 FastDeploy(**注意不要通过 pypi 源安装**) ```bash -python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/ +python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple ``` 或者你也可以安装最新版 FastDeploy(不推荐) ```bash -python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu-p800/ +python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple ``` ## 3. 从源码编译安装 @@ -101,50 +107,20 @@ bash build.sh ## 验证是否安装成功 ```python -import paddle -from paddle.jit.marker import unified -paddle.utils.run_check() -from fastdeploy.model_executor.ops.xpu import block_attn +python -c "import paddle; paddle.version.show()" +python -c "import paddle; paddle.utils.run_check()" +python -c "from paddle.jit.marker import unified" +python -c "from fastdeploy.model_executor.ops.xpu import block_attn" ``` 如果上述步骤均执行成功,代表 FastDeploy 已安装成功。 ## 快速开始 -目前 P800 暂时仅验证了以下模型的部署: -- ERNIE-4.5-300B-A47B-Paddle 32K WINT4(8卡) -- ERNIE-4.5-300B-A47B-Paddle 128K WINT4(8卡) - -### 离线推理 - -安装 FastDeploy 后,您可以通过如下代码,基于用户给定的输入完成离线推理生成文本。 - -```python -from fastdeploy import LLM, SamplingParams - -prompts = [ - "Where is the capital of China?", -] - -# 采样参数 -sampling_params = SamplingParams(top_p=0.95) - -# 加载模型 -llm = LLM(model="baidu/ERNIE-4.5-300B-A47B-Paddle", tensor_parallel_size=8, max_model_len=8192, quantization='wint4') - -# 批量进行推理(llm内部基于资源情况进行请求排队、动态插入处理) -outputs = llm.generate(prompts, sampling_params) - -# 输出结果 -for output in outputs: - prompt = output.prompt - generated_text = output.outputs.text - - print(f"Prompt: {prompt}") - print(f"Generated text: {generated_text}") -``` - -更多参数可以参考文档 [参数说明](../../parameters.md)。 +P800 支持 ```ERNIE-4.5-300B-A47B-Paddle``` 模型采用以下配置部署(注意:不同配置在效果、性能上可能存在差异)。 +- 32K WINT4 8 卡(推荐) +- 128K WINT4 8 卡 +- 32K WINT4 4 卡 ### OpenAI 兼容服务器 @@ -152,7 +128,7 @@ for output in outputs: #### 启动服务 -**ERNIE-4.5-300B-A47B-Paddle 32K WINT4(8卡)(推荐)** +**ERNIE-4.5-300B-A47B-Paddle 模型采用 32K WINT4 8 卡配置部署(推荐)** ```bash python -m fastdeploy.entrypoints.openai.api_server \ @@ -165,7 +141,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**ERNIE-4.5-300B-A47B-Paddle 128K WINT4(8卡)** +**ERNIE-4.5-300B-A47B-Paddle 模型采用 128K WINT4 8 卡配置部署** ```bash python -m fastdeploy.entrypoints.openai.api_server \ @@ -178,6 +154,20 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` +**ERNIE-4.5-300B-A47B-Paddle 模型采用 32K WINT4 4 卡配置部署** + +```bash +export XPU_VISIBLE_DEVICES="0,1,2,3" +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-300B-A47B-Paddle \ + --port 8188 \ + --tensor-parallel-size 4 \ + --max-model-len 32768 \ + --max-num-seqs 64 \ + --quantization "wint4" \ + --gpu-memory-utilization 0.9 +``` + 更多参数可以参考 [参数说明](../../parameters.md)。 #### 请求服务 @@ -212,7 +202,6 @@ print('\n') response = client.chat.completions.create( model="null", messages=[ - {"role": "system", "content": "I'm a helpful AI assistant."}, {"role": "user", "content": "Where is the capital of China?"}, ], stream=True,