[Docs] add data parallel (#3883)
Some checks failed
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled

* [Docs] add data parallel

* [Docs] add data parallel
This commit is contained in:
ltd0924
2025-09-04 20:33:50 +08:00
committed by GitHub
parent e0e7d68435
commit 7643e6e6b2
9 changed files with 347 additions and 5 deletions

View File

@@ -0,0 +1,151 @@
# Data Parallelism
Under the MOE model, enabling Expert Parallelism (EP) combined with Data Parallelism (DP), where EP distributes expert workloads and DP enables parallel request processing.
## Data Distribution Strategy
FastDeploy uses the splitwise scheduler to monitor the load status of each DP node and distribute incoming data accordingly.
The splitwise scheduler relies on Redis to store DP load status and distribute received data.
### Expert Parallelism + Hybrid Deployment
FastDeploy provides the splitwise scheduler that monitors DP load status and schedules incoming data.
The scheduling flow is shown below - users randomly request IP and port, obtain load status via Redis, and data is distributed to less-loaded DPs for inference.
![Scheduling Architecture](./images/scheduler_img.png)
#### Offline Inference
```python
prompts = [
"Hello, my name is",
"你好,请问今天是星期",
"请写6个以数字开头的成语",
"写一个300字的小说大纲内容是李白穿越到现代最后成为公司文职人员的故事",
"我要采访一位科幻作家创建一个包含5个问题的列表"
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=128)
llm = LLM(
model="ERNIE-4_5-300B-A47B-FP8-Paddle",
tensor_parallel_size=1,
data_parallel_size=8,
max_model_len=8192,
num_gpu_blocks_override=1024,
engine_worker_queue_port="6077,6078,6079,6080,6081,6082,6083,6084",
enable_expert_parallel=True,
scheduler_name="splitwise",
scheduler_host="127.0.0.1",
scheduler_topic="test",
scheduler_port=6379
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs.text
print("generated_text: ", generated_text)
print("\n")
```
#### Online Inference
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port "6077,6078,6079,6080,6081,6082,6083,6084" \
--data-parallel-size 8 --tensor-parallel-size 1\
--enable-expert-parallel \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-topic "test" \
--scheduler-ttl 9000
```
### User-Managed Scheduling
FastDeploy provides multi_api_server, allowing users to launch multiple API servers and manually select DPs for requests. In this case, users can add their own load balancing models for scheduling. (Currently only supports online inference)
#### Online Inference
![Scheduling Architecture](./images/no_scheduler_img.png)
```shell
export FD_ENABLE_MULTI_API_SERVER=1
python -m fastdeploy.entrypoints.openai.multi_api_server \
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
--num-servers 8 \
--metrics-ports "3101,3201,3301,3401,3501,3601,3701,3801" \
--args --model ERNIE-4_5-300B-A47B-FP8-Paddle \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--max-model-len 12288 \
--max-num-seqs 64 \
--num-gpu-blocks-override 256 \
--enable-expert-parallel
```
### Parameter Description
- num-servers: Number of API servers to launch
- ports: Ports for API servers
- args: Arguments for API servers
### Data Parallelism + Disaggregated Deployment
Refer to [Disaggregated Deployment](disaggregated.md#multi-machine-disaggregated-deployment)
#### Online Inference
For multi-machine deployment, ensure network cards support RDMA and all cluster nodes are interconnected.
**Note**:
* `KVCACHE_RDMA_NICS` specifies RDMA network cards for the current machine, multiple cards should be separated by commas.
* The repository provides an automatic RDMA network card detection script `bash scripts/get_rdma_nics.sh <device>`, where <device> can be `cpu` or `gpu`.
**Prefill Instance**
```bash
export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
--port 8180 --metrics-port 8181 \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
--cache-queue-port 8183 \
--tensor-parallel-size 1 \
--data-parallel-size 4 \
--enable-expert-parallel \
--cache-transfer-protocol "rdma,ipc" \
--rdma-comm-ports "7671,7672,7673,7674,7675,7676,7677,7678" \
--pd-comm-port "2334" \
--splitwise-role "prefill" \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-topic "test" \
--scheduler-ttl 9000
```
**Decode Instance**
```bash
export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
--cache-queue-port 8187 \
--tensor-parallel-size 1 \
--data-parallel-size 4 \
--enable-expert-parallel \
--scheduler-name "splitwise" \
--cache-transfer-protocol "rdma,ipc" \
--rdma-comm-ports "7671,7672,7673,7674,7675,7676,7677,7678" \
--pd-comm-port "2334" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-ttl 9000
--scheduler-topic "test" \
--splitwise-role "decode"
```

View File

@@ -72,6 +72,11 @@ Refer to the example code `offline_disaggregated_demo.py` in the `fastdeploy/dem
### Multi-machine Disaggregated Deployment ### Multi-machine Disaggregated Deployment
#### Prerequisite: Redis #### Prerequisite: Redis
> **⚠️ NOTE**
> **Redis requirement: version 6.2.0 or higher**
> Versions below this may not support the required commands.
>
* Installation via `conda` * Installation via `conda`
```bash ```bash
@@ -103,14 +108,17 @@ sudo systemctl start redis
For multi-machine deployment, confirm that the NIC supports RDMA and that all nodes in the cluster have network connectivity. For multi-machine deployment, confirm that the NIC supports RDMA and that all nodes in the cluster have network connectivity.
**Note**: **Note**:
* `KVCACHE_RDMA_NICS` specifies the RDMA NICs of the current machine, with multiple NICs separated by commas. * `KVCACHE_RDMA_NICS` specifies RDMA network cards for the current machine, multiple cards should be separated by commas.
* The repository provides an automatic RDMA network card detection script `bash scripts/get_rdma_nics.sh <device>`, where <device> can be `cpu` or `gpu`.
**Prefill Instance** **Prefill Instance**
```bash ```bash
export FD_LOG_DIR="log_prefill" export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3 export CUDA_VISIBLE_DEVICES=0,1,2,3
export KVCACHE_RDMA_NICS="mlx5_2,mlx5_3,mlx5_4,mlx5_5" echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \ python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \ --model ERNIE-4.5-300B-A47B-BF16 \
--port 8180 --metrics-port 8181 \ --port 8180 --metrics-port 8181 \
@@ -133,7 +141,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
```bash ```bash
export FD_LOG_DIR="log_decode" export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=4,5,6,7 export CUDA_VISIBLE_DEVICES=4,5,6,7
export KVCACHE_RDMA_NICS="mlx5_2,mlx5_3,mlx5_4,mlx5_5" echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \ python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \ --model ERNIE-4.5-300B-A47B-BF16 \
--port 8184 --metrics-port 8185 \ --port 8184 --metrics-port 8185 \

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 105 KiB

View File

@@ -0,0 +1,166 @@
# 数据并行
在MOE模型下开启专家并行EP与数据并行DP相结合EP 分摊专家负载,结合 DP 实现请求并行处理。
## 数据分发策略
FastDeploy 通过splitwise scheduler 感知各个DP的负载状态对接收到数据进行分发。
splitwise scheduler 依赖redis存储各个DP的负载状态对接收到的数据进行分发。
### 专家并行 + 混合式部署
FastDeploy 提供了splitwise scheduler可以感知各个DP的负载状态对接收到的数据进行调度。
具体调度流程如下图用户随机请求ip 与端口通过redis获取负载状态将数据分发到负载较低的DP进行推理。
![数据调度架构图](./images/scheduler_img.png)
#### 离线推理
```python
prompts = [
"Hello, my name is",
"你好,请问今天是星期",
"请写6个以数字开头的成语",
"写一个300字的小说大纲内容是李白穿越到现代最后成为公司文职人员的故事",
"我要采访一位科幻作家创建一个包含5个问题的列表"
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=128)
llm = LLM(
model="ERNIE-4_5-300B-A47B-FP8-Paddle",
tensor_parallel_size=1,
data_parallel_size=8,
max_model_len=8192,
num_gpu_blocks_override=1024,
engine_worker_queue_port="6077,6078,6079,6080,6081,6082,6083,6084",
enable_expert_parallel=True,
scheduler_name="splitwise",
scheduler_host="127.0.0.1",
scheduler_topic="test",
scheduler_port=6379
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs.text
print("generated_text: ", generated_text)
print("\n")
```
#### 在线推理
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port "6077,6078,6079,6080,6081,6082,6083,6084" \
--data-parallel-size 8 --tensor-parallel-size 1\
--enable-expert-parallel \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-topic "test" \
--scheduler-ttl 9000
```
### 用户自行调度
FastDeploy 提供了multi_api_server用户可以拉起多个api server用户自行选择dp 进行请求,在该种情况下用户可以自行添加负载均衡模型进行调度。(目前该种方式只支持在线推理)
#### 在线推理
![数据调度架构图](./images/no_scheduler_img.png)
```shell
export FD_ENABLE_MULTI_API_SERVER=1
python -m fastdeploy.entrypoints.openai.multi_api_server \
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
--num-servers 8 \
--metrics-ports "3101,3201,3301,3401,3501,3601,3701,3801" \
--args --model ERNIE-4_5-300B-A47B-FP8-Paddle \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--max-model-len 12288 \
--max-num-seqs 64 \
--num-gpu-blocks-override 256 \
--enable-expert-parallel
```
### 参数说明
- num-servers: 指定拉起的api server 的数量
- ports: 指定拉起的api server 的端口
- args: 指定拉起的api server 的参数
### 数据并行 + 分离式部署
具体可以参考[分离式部署](disaggregated.md#多机分离式部署)
#### 在线推理
多机部署时需要确认当前网卡是否支持RDMA并且需要集群中所有节点网络互通。
**注意**
* `KVCACHE_RDMA_NICS` 指定当前机器的RDMA网卡多个网卡用逗号隔开。
* 仓库中提供了自动检测RDMA网卡的脚本 `bash scripts/get_rdma_nics.sh <device>`, 其中 <device> 可以是 `cpu``gpu`
**prefill 实例**
```bash
export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
--port 8180 --metrics-port 8181 \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
--cache-queue-port 8183 \
--tensor-parallel-size 1 \
--data-parallel-size 4 \
--enable-expert-parallel \
--cache-transfer-protocol "rdma,ipc" \
--rdma-comm-ports "7671,7672,7673,7674,7675,7676,7677,7678" \
--pd-comm-port "2334" \
--splitwise-role "prefill" \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-topic "test" \
--scheduler-ttl 9000
```
**decode 实例**
```bash
export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
--cache-queue-port 8187 \
--tensor-parallel-size 1 \
--data-parallel-size 4 \
--enable-expert-parallel \
--scheduler-name "splitwise" \
--cache-transfer-protocol "rdma,ipc" \
--rdma-comm-ports "7671,7672,7673,7674,7675,7676,7677,7678" \
--pd-comm-port "2334" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-ttl 9000
--scheduler-topic "test" \
--splitwise-role "decode"
```

View File

@@ -75,6 +75,10 @@ python -m fastdeploy.entrypoints.openai.api_server \
#### 前置依赖 Redis #### 前置依赖 Redis
* 使用`conda`安装 * 使用`conda`安装
> **⚠️ 注意**
> **Redis 版本要求6.2.0 及以上**
> 低于此版本可能不支持所需的命令。
```bash ```bash
# 安装 # 安装
conda install redis conda install redis
@@ -106,13 +110,17 @@ sudo systemctl start redis
**注意** **注意**
* `KVCACHE_RDMA_NICS` 指定当前机器的RDMA网卡多个网卡用逗号隔开。 * `KVCACHE_RDMA_NICS` 指定当前机器的RDMA网卡多个网卡用逗号隔开。
* 仓库中提供了自动检测RDMA网卡的脚本 `bash scripts/get_rdma_nics.sh <device>`, 其中 <device> 可以是 `cpu``gpu`
**prefill 实例** **prefill 实例**
```bash ```bash
export FD_LOG_DIR="log_prefill" export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3 export CUDA_VISIBLE_DEVICES=0,1,2,3
export KVCACHE_RDMA_NICS="mlx5_2,mlx5_3,mlx5_4,mlx5_5" echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \ python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \ --model ERNIE-4.5-300B-A47B-BF16 \
--port 8180 --metrics-port 8181 \ --port 8180 --metrics-port 8181 \
@@ -127,6 +135,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
--scheduler-name "splitwise" \ --scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \ --scheduler-host "127.0.0.1" \
--scheduler-port 6379 \ --scheduler-port 6379 \
--scheduler-topic "test" \
--scheduler-ttl 9000 --scheduler-ttl 9000
``` ```
@@ -135,7 +144,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
```bash ```bash
export FD_LOG_DIR="log_decode" export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=4,5,6,7 export CUDA_VISIBLE_DEVICES=4,5,6,7
export KVCACHE_RDMA_NICS="mlx5_2,mlx5_3,mlx5_4,mlx5_5" echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \ python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \ --model ERNIE-4.5-300B-A47B-BF16 \
--port 8184 --metrics-port 8185 \ --port 8184 --metrics-port 8185 \
@@ -150,6 +161,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
--scheduler-host "127.0.0.1" \ --scheduler-host "127.0.0.1" \
--scheduler-port 6379 \ --scheduler-port 6379 \
--scheduler-ttl 9000 --scheduler-ttl 9000
--scheduler-topic "test" \
--splitwise-role "decode" --splitwise-role "decode"
``` ```
@@ -168,5 +180,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
* --scheduler-host: 连接的redis地址 * --scheduler-host: 连接的redis地址
* --scheduler-port: 连接的redis端口 * --scheduler-port: 连接的redis端口
* --scheduler-ttl: 指定redis的ttl时间单位为秒 * --scheduler-ttl: 指定redis的ttl时间单位为秒
* --scheduler-topic: 指定redis的topic
* --pd-comm-port: 指定pd通信的端口 * --pd-comm-port: 指定pd通信的端口
* --rdma-comm-ports: 指定RDMA通信的端口多个端口用逗号隔开数量与卡数一致 * --rdma-comm-ports: 指定RDMA通信的端口多个端口用逗号隔开数量与卡数一致

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 105 KiB

View File

@@ -83,6 +83,7 @@ plugins:
Sampling: 采样策略 Sampling: 采样策略
MultiNode Deployment: 多机部署 MultiNode Deployment: 多机部署
Graph Optimization: 图优化 Graph Optimization: 图优化
Data Parallelism: 数据并行
Supported Models: 支持模型列表 Supported Models: 支持模型列表
Benchmark: 基准测试 Benchmark: 基准测试
Usage: 用法 Usage: 用法
@@ -132,6 +133,7 @@ nav:
- 'Sampling': features/sampling.md - 'Sampling': features/sampling.md
- 'MultiNode Deployment': features/multi-node_deployment.md - 'MultiNode Deployment': features/multi-node_deployment.md
- 'Graph Optimization': features/graph_optimization.md - 'Graph Optimization': features/graph_optimization.md
- 'Data Parallelism': features/data_parallel_service.md
- 'Supported Models': supported_models.md - 'Supported Models': supported_models.md
- Benchmark: benchmark.md - Benchmark: benchmark.md
- Usage: - Usage: