mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-09-26 20:41:53 +08:00
[Docs] add data parallel (#3883)
Some checks failed
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
Some checks failed
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
* [Docs] add data parallel * [Docs] add data parallel
This commit is contained in:
151
docs/features/data_parallel_service.md
Normal file
151
docs/features/data_parallel_service.md
Normal file
@@ -0,0 +1,151 @@
|
|||||||
|
# Data Parallelism
|
||||||
|
Under the MOE model, enabling Expert Parallelism (EP) combined with Data Parallelism (DP), where EP distributes expert workloads and DP enables parallel request processing.
|
||||||
|
|
||||||
|
## Data Distribution Strategy
|
||||||
|
FastDeploy uses the splitwise scheduler to monitor the load status of each DP node and distribute incoming data accordingly.
|
||||||
|
|
||||||
|
The splitwise scheduler relies on Redis to store DP load status and distribute received data.
|
||||||
|
|
||||||
|
### Expert Parallelism + Hybrid Deployment
|
||||||
|
FastDeploy provides the splitwise scheduler that monitors DP load status and schedules incoming data.
|
||||||
|
The scheduling flow is shown below - users randomly request IP and port, obtain load status via Redis, and data is distributed to less-loaded DPs for inference.
|
||||||
|

|
||||||
|
|
||||||
|
#### Offline Inference
|
||||||
|
```python
|
||||||
|
prompts = [
|
||||||
|
"Hello, my name is",
|
||||||
|
"你好,请问今天是星期",
|
||||||
|
"请写6个以数字开头的成语",
|
||||||
|
"写一个300字的小说大纲,内容是李白穿越到现代,最后成为公司文职人员的故事",
|
||||||
|
"我要采访一位科幻作家,创建一个包含5个问题的列表"
|
||||||
|
]
|
||||||
|
|
||||||
|
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=128)
|
||||||
|
|
||||||
|
llm = LLM(
|
||||||
|
model="ERNIE-4_5-300B-A47B-FP8-Paddle",
|
||||||
|
tensor_parallel_size=1,
|
||||||
|
data_parallel_size=8,
|
||||||
|
max_model_len=8192,
|
||||||
|
num_gpu_blocks_override=1024,
|
||||||
|
engine_worker_queue_port="6077,6078,6079,6080,6081,6082,6083,6084",
|
||||||
|
enable_expert_parallel=True,
|
||||||
|
scheduler_name="splitwise",
|
||||||
|
scheduler_host="127.0.0.1",
|
||||||
|
scheduler_topic="test",
|
||||||
|
scheduler_port=6379
|
||||||
|
)
|
||||||
|
outputs = llm.generate(prompts, sampling_params)
|
||||||
|
|
||||||
|
for output in outputs:
|
||||||
|
prompt = output.prompt
|
||||||
|
generated_text = output.outputs.text
|
||||||
|
print("generated_text: ", generated_text)
|
||||||
|
print("\n")
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Online Inference
|
||||||
|
```shell
|
||||||
|
python -m fastdeploy.entrypoints.openai.api_server \
|
||||||
|
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
|
||||||
|
--port 8184 --metrics-port 8185 \
|
||||||
|
--engine-worker-queue-port "6077,6078,6079,6080,6081,6082,6083,6084" \
|
||||||
|
--data-parallel-size 8 --tensor-parallel-size 1\
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--scheduler-name "splitwise" \
|
||||||
|
--scheduler-host "127.0.0.1" \
|
||||||
|
--scheduler-port 6379 \
|
||||||
|
--scheduler-topic "test" \
|
||||||
|
--scheduler-ttl 9000
|
||||||
|
```
|
||||||
|
|
||||||
|
### User-Managed Scheduling
|
||||||
|
FastDeploy provides multi_api_server, allowing users to launch multiple API servers and manually select DPs for requests. In this case, users can add their own load balancing models for scheduling. (Currently only supports online inference)
|
||||||
|
|
||||||
|
#### Online Inference
|
||||||
|

|
||||||
|
|
||||||
|
```shell
|
||||||
|
export FD_ENABLE_MULTI_API_SERVER=1
|
||||||
|
python -m fastdeploy.entrypoints.openai.multi_api_server \
|
||||||
|
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
|
||||||
|
--num-servers 8 \
|
||||||
|
--metrics-ports "3101,3201,3301,3401,3501,3601,3701,3801" \
|
||||||
|
--args --model ERNIE-4_5-300B-A47B-FP8-Paddle \
|
||||||
|
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--data-parallel-size 8 \
|
||||||
|
--max-model-len 12288 \
|
||||||
|
--max-num-seqs 64 \
|
||||||
|
--num-gpu-blocks-override 256 \
|
||||||
|
--enable-expert-parallel
|
||||||
|
```
|
||||||
|
|
||||||
|
### Parameter Description
|
||||||
|
- num-servers: Number of API servers to launch
|
||||||
|
- ports: Ports for API servers
|
||||||
|
- args: Arguments for API servers
|
||||||
|
|
||||||
|
### Data Parallelism + Disaggregated Deployment
|
||||||
|
Refer to [Disaggregated Deployment](disaggregated.md#multi-machine-disaggregated-deployment)
|
||||||
|
|
||||||
|
#### Online Inference
|
||||||
|
For multi-machine deployment, ensure network cards support RDMA and all cluster nodes are interconnected.
|
||||||
|
|
||||||
|
**Note**:
|
||||||
|
* `KVCACHE_RDMA_NICS` specifies RDMA network cards for the current machine, multiple cards should be separated by commas.
|
||||||
|
* The repository provides an automatic RDMA network card detection script `bash scripts/get_rdma_nics.sh <device>`, where <device> can be `cpu` or `gpu`.
|
||||||
|
|
||||||
|
|
||||||
|
**Prefill Instance**
|
||||||
|
```bash
|
||||||
|
export FD_LOG_DIR="log_prefill"
|
||||||
|
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||||
|
echo "set RDMA NICS"
|
||||||
|
export $(bash scripts/get_rdma_nics.sh gpu)
|
||||||
|
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
|
||||||
|
python -m fastdeploy.entrypoints.openai.api_server \
|
||||||
|
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
|
||||||
|
--port 8180 --metrics-port 8181 \
|
||||||
|
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
|
||||||
|
--cache-queue-port 8183 \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--data-parallel-size 4 \
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--cache-transfer-protocol "rdma,ipc" \
|
||||||
|
--rdma-comm-ports "7671,7672,7673,7674,7675,7676,7677,7678" \
|
||||||
|
--pd-comm-port "2334" \
|
||||||
|
--splitwise-role "prefill" \
|
||||||
|
--scheduler-name "splitwise" \
|
||||||
|
--scheduler-host "127.0.0.1" \
|
||||||
|
--scheduler-port 6379 \
|
||||||
|
--scheduler-topic "test" \
|
||||||
|
--scheduler-ttl 9000
|
||||||
|
```
|
||||||
|
|
||||||
|
**Decode Instance**
|
||||||
|
```bash
|
||||||
|
export FD_LOG_DIR="log_decode"
|
||||||
|
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||||
|
echo "set RDMA NICS"
|
||||||
|
export $(bash scripts/get_rdma_nics.sh gpu)
|
||||||
|
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
|
||||||
|
python -m fastdeploy.entrypoints.openai.api_server \
|
||||||
|
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
|
||||||
|
--port 8184 --metrics-port 8185 \
|
||||||
|
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
|
||||||
|
--cache-queue-port 8187 \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--data-parallel-size 4 \
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--scheduler-name "splitwise" \
|
||||||
|
--cache-transfer-protocol "rdma,ipc" \
|
||||||
|
--rdma-comm-ports "7671,7672,7673,7674,7675,7676,7677,7678" \
|
||||||
|
--pd-comm-port "2334" \
|
||||||
|
--scheduler-host "127.0.0.1" \
|
||||||
|
--scheduler-port 6379 \
|
||||||
|
--scheduler-ttl 9000
|
||||||
|
--scheduler-topic "test" \
|
||||||
|
--splitwise-role "decode"
|
||||||
|
```
|
@@ -72,6 +72,11 @@ Refer to the example code `offline_disaggregated_demo.py` in the `fastdeploy/dem
|
|||||||
### Multi-machine Disaggregated Deployment
|
### Multi-machine Disaggregated Deployment
|
||||||
|
|
||||||
#### Prerequisite: Redis
|
#### Prerequisite: Redis
|
||||||
|
|
||||||
|
> **⚠️ NOTE**
|
||||||
|
> **Redis requirement: version 6.2.0 or higher**
|
||||||
|
> Versions below this may not support the required commands.
|
||||||
|
>
|
||||||
* Installation via `conda`
|
* Installation via `conda`
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@@ -103,14 +108,17 @@ sudo systemctl start redis
|
|||||||
For multi-machine deployment, confirm that the NIC supports RDMA and that all nodes in the cluster have network connectivity.
|
For multi-machine deployment, confirm that the NIC supports RDMA and that all nodes in the cluster have network connectivity.
|
||||||
|
|
||||||
**Note**:
|
**Note**:
|
||||||
* `KVCACHE_RDMA_NICS` specifies the RDMA NICs of the current machine, with multiple NICs separated by commas.
|
* `KVCACHE_RDMA_NICS` specifies RDMA network cards for the current machine, multiple cards should be separated by commas.
|
||||||
|
* The repository provides an automatic RDMA network card detection script `bash scripts/get_rdma_nics.sh <device>`, where <device> can be `cpu` or `gpu`.
|
||||||
|
|
||||||
**Prefill Instance**
|
**Prefill Instance**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
export FD_LOG_DIR="log_prefill"
|
export FD_LOG_DIR="log_prefill"
|
||||||
export CUDA_VISIBLE_DEVICES=0,1,2,3
|
export CUDA_VISIBLE_DEVICES=0,1,2,3
|
||||||
export KVCACHE_RDMA_NICS="mlx5_2,mlx5_3,mlx5_4,mlx5_5"
|
echo "set RDMA NICS"
|
||||||
|
export $(bash scripts/get_rdma_nics.sh gpu)
|
||||||
|
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
|
||||||
python -m fastdeploy.entrypoints.openai.api_server \
|
python -m fastdeploy.entrypoints.openai.api_server \
|
||||||
--model ERNIE-4.5-300B-A47B-BF16 \
|
--model ERNIE-4.5-300B-A47B-BF16 \
|
||||||
--port 8180 --metrics-port 8181 \
|
--port 8180 --metrics-port 8181 \
|
||||||
@@ -133,7 +141,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
|||||||
```bash
|
```bash
|
||||||
export FD_LOG_DIR="log_decode"
|
export FD_LOG_DIR="log_decode"
|
||||||
export CUDA_VISIBLE_DEVICES=4,5,6,7
|
export CUDA_VISIBLE_DEVICES=4,5,6,7
|
||||||
export KVCACHE_RDMA_NICS="mlx5_2,mlx5_3,mlx5_4,mlx5_5"
|
echo "set RDMA NICS"
|
||||||
|
export $(bash scripts/get_rdma_nics.sh gpu)
|
||||||
|
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
|
||||||
python -m fastdeploy.entrypoints.openai.api_server \
|
python -m fastdeploy.entrypoints.openai.api_server \
|
||||||
--model ERNIE-4.5-300B-A47B-BF16 \
|
--model ERNIE-4.5-300B-A47B-BF16 \
|
||||||
--port 8184 --metrics-port 8185 \
|
--port 8184 --metrics-port 8185 \
|
||||||
|
BIN
docs/features/images/no_scheduler_img.png
Normal file
BIN
docs/features/images/no_scheduler_img.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 118 KiB |
BIN
docs/features/images/scheduler_img.png
Normal file
BIN
docs/features/images/scheduler_img.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 105 KiB |
166
docs/zh/features/data_parallel_service.md
Normal file
166
docs/zh/features/data_parallel_service.md
Normal file
@@ -0,0 +1,166 @@
|
|||||||
|
# 数据并行
|
||||||
|
在MOE模型下,开启专家并行(EP)与数据并行(DP)相结合,EP 分摊专家负载,结合 DP 实现请求并行处理。
|
||||||
|
|
||||||
|
## 数据分发策略
|
||||||
|
FastDeploy 通过splitwise scheduler 感知各个DP的负载状态,对接收到数据进行分发。
|
||||||
|
|
||||||
|
splitwise scheduler 依赖redis存储各个DP的负载状态,对接收到的数据进行分发。
|
||||||
|
|
||||||
|
### 专家并行 + 混合式部署
|
||||||
|
|
||||||
|
FastDeploy 提供了splitwise scheduler,可以感知各个DP的负载状态,对接收到的数据进行调度。
|
||||||
|
具体调度流程如下图,用户随机请求ip 与端口,通过redis获取负载状态,将数据分发到负载较低的DP进行推理。
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
#### 离线推理
|
||||||
|
```python
|
||||||
|
|
||||||
|
prompts = [
|
||||||
|
"Hello, my name is",
|
||||||
|
"你好,请问今天是星期",
|
||||||
|
"请写6个以数字开头的成语",
|
||||||
|
"写一个300字的小说大纲,内容是李白穿越到现代,最后成为公司文职人员的故事",
|
||||||
|
"我要采访一位科幻作家,创建一个包含5个问题的列表"
|
||||||
|
]
|
||||||
|
|
||||||
|
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=128)
|
||||||
|
|
||||||
|
llm = LLM(
|
||||||
|
model="ERNIE-4_5-300B-A47B-FP8-Paddle",
|
||||||
|
tensor_parallel_size=1,
|
||||||
|
data_parallel_size=8,
|
||||||
|
max_model_len=8192,
|
||||||
|
num_gpu_blocks_override=1024,
|
||||||
|
engine_worker_queue_port="6077,6078,6079,6080,6081,6082,6083,6084",
|
||||||
|
enable_expert_parallel=True,
|
||||||
|
scheduler_name="splitwise",
|
||||||
|
scheduler_host="127.0.0.1",
|
||||||
|
scheduler_topic="test",
|
||||||
|
scheduler_port=6379
|
||||||
|
)
|
||||||
|
outputs = llm.generate(prompts, sampling_params)
|
||||||
|
|
||||||
|
for output in outputs:
|
||||||
|
prompt = output.prompt
|
||||||
|
generated_text = output.outputs.text
|
||||||
|
print("generated_text: ", generated_text)
|
||||||
|
print("\n")
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 在线推理
|
||||||
|
```shell
|
||||||
|
python -m fastdeploy.entrypoints.openai.api_server \
|
||||||
|
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
|
||||||
|
--port 8184 --metrics-port 8185 \
|
||||||
|
--engine-worker-queue-port "6077,6078,6079,6080,6081,6082,6083,6084" \
|
||||||
|
--data-parallel-size 8 --tensor-parallel-size 1\
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--scheduler-name "splitwise" \
|
||||||
|
--scheduler-host "127.0.0.1" \
|
||||||
|
--scheduler-port 6379 \
|
||||||
|
--scheduler-topic "test" \
|
||||||
|
--scheduler-ttl 9000
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
### 用户自行调度
|
||||||
|
FastDeploy 提供了multi_api_server,用户可以拉起多个api server,用户自行选择dp 进行请求,在该种情况下用户可以自行添加负载均衡模型进行调度。(目前该种方式只支持在线推理)
|
||||||
|
|
||||||
|
|
||||||
|
#### 在线推理
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
```shell
|
||||||
|
export FD_ENABLE_MULTI_API_SERVER=1
|
||||||
|
python -m fastdeploy.entrypoints.openai.multi_api_server \
|
||||||
|
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
|
||||||
|
--num-servers 8 \
|
||||||
|
--metrics-ports "3101,3201,3301,3401,3501,3601,3701,3801" \
|
||||||
|
--args --model ERNIE-4_5-300B-A47B-FP8-Paddle \
|
||||||
|
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--data-parallel-size 8 \
|
||||||
|
--max-model-len 12288 \
|
||||||
|
--max-num-seqs 64 \
|
||||||
|
--num-gpu-blocks-override 256 \
|
||||||
|
--enable-expert-parallel
|
||||||
|
```
|
||||||
|
|
||||||
|
### 参数说明
|
||||||
|
- num-servers: 指定拉起的api server 的数量
|
||||||
|
- ports: 指定拉起的api server 的端口
|
||||||
|
- args: 指定拉起的api server 的参数
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 数据并行 + 分离式部署
|
||||||
|
|
||||||
|
具体可以参考[分离式部署](disaggregated.md#多机分离式部署)
|
||||||
|
|
||||||
|
#### 在线推理
|
||||||
|
|
||||||
|
多机部署时需要确认当前网卡是否支持RDMA,并且需要集群中所有节点网络互通。
|
||||||
|
|
||||||
|
**注意**:
|
||||||
|
* `KVCACHE_RDMA_NICS` 指定当前机器的RDMA网卡,多个网卡用逗号隔开。
|
||||||
|
* 仓库中提供了自动检测RDMA网卡的脚本 `bash scripts/get_rdma_nics.sh <device>`, 其中 <device> 可以是 `cpu` 或 `gpu`。
|
||||||
|
|
||||||
|
**prefill 实例**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export FD_LOG_DIR="log_prefill"
|
||||||
|
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||||
|
echo "set RDMA NICS"
|
||||||
|
export $(bash scripts/get_rdma_nics.sh gpu)
|
||||||
|
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
|
||||||
|
|
||||||
|
python -m fastdeploy.entrypoints.openai.api_server \
|
||||||
|
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
|
||||||
|
--port 8180 --metrics-port 8181 \
|
||||||
|
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
|
||||||
|
--cache-queue-port 8183 \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--data-parallel-size 4 \
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--cache-transfer-protocol "rdma,ipc" \
|
||||||
|
--rdma-comm-ports "7671,7672,7673,7674,7675,7676,7677,7678" \
|
||||||
|
--pd-comm-port "2334" \
|
||||||
|
--splitwise-role "prefill" \
|
||||||
|
--scheduler-name "splitwise" \
|
||||||
|
--scheduler-host "127.0.0.1" \
|
||||||
|
--scheduler-port 6379 \
|
||||||
|
--scheduler-topic "test" \
|
||||||
|
--scheduler-ttl 9000
|
||||||
|
```
|
||||||
|
|
||||||
|
**decode 实例**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export FD_LOG_DIR="log_decode"
|
||||||
|
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||||
|
echo "set RDMA NICS"
|
||||||
|
export $(bash scripts/get_rdma_nics.sh gpu)
|
||||||
|
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
|
||||||
|
python -m fastdeploy.entrypoints.openai.api_server \
|
||||||
|
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
|
||||||
|
--port 8184 --metrics-port 8185 \
|
||||||
|
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
|
||||||
|
--cache-queue-port 8187 \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--data-parallel-size 4 \
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--scheduler-name "splitwise" \
|
||||||
|
--cache-transfer-protocol "rdma,ipc" \
|
||||||
|
--rdma-comm-ports "7671,7672,7673,7674,7675,7676,7677,7678" \
|
||||||
|
--pd-comm-port "2334" \
|
||||||
|
--scheduler-host "127.0.0.1" \
|
||||||
|
--scheduler-port 6379 \
|
||||||
|
--scheduler-ttl 9000
|
||||||
|
--scheduler-topic "test" \
|
||||||
|
--splitwise-role "decode"
|
||||||
|
```
|
||||||
|
|
@@ -75,6 +75,10 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
|||||||
#### 前置依赖 Redis
|
#### 前置依赖 Redis
|
||||||
* 使用`conda`安装
|
* 使用`conda`安装
|
||||||
|
|
||||||
|
> **⚠️ 注意**
|
||||||
|
> **Redis 版本要求:6.2.0 及以上**
|
||||||
|
> 低于此版本可能不支持所需的命令。
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 安装
|
# 安装
|
||||||
conda install redis
|
conda install redis
|
||||||
@@ -106,13 +110,17 @@ sudo systemctl start redis
|
|||||||
|
|
||||||
**注意**:
|
**注意**:
|
||||||
* `KVCACHE_RDMA_NICS` 指定当前机器的RDMA网卡,多个网卡用逗号隔开。
|
* `KVCACHE_RDMA_NICS` 指定当前机器的RDMA网卡,多个网卡用逗号隔开。
|
||||||
|
* 仓库中提供了自动检测RDMA网卡的脚本 `bash scripts/get_rdma_nics.sh <device>`, 其中 <device> 可以是 `cpu` 或 `gpu`。
|
||||||
|
|
||||||
**prefill 实例**
|
**prefill 实例**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
||||||
export FD_LOG_DIR="log_prefill"
|
export FD_LOG_DIR="log_prefill"
|
||||||
export CUDA_VISIBLE_DEVICES=0,1,2,3
|
export CUDA_VISIBLE_DEVICES=0,1,2,3
|
||||||
export KVCACHE_RDMA_NICS="mlx5_2,mlx5_3,mlx5_4,mlx5_5"
|
echo "set RDMA NICS"
|
||||||
|
export $(bash scripts/get_rdma_nics.sh gpu)
|
||||||
|
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
|
||||||
python -m fastdeploy.entrypoints.openai.api_server \
|
python -m fastdeploy.entrypoints.openai.api_server \
|
||||||
--model ERNIE-4.5-300B-A47B-BF16 \
|
--model ERNIE-4.5-300B-A47B-BF16 \
|
||||||
--port 8180 --metrics-port 8181 \
|
--port 8180 --metrics-port 8181 \
|
||||||
@@ -127,6 +135,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
|||||||
--scheduler-name "splitwise" \
|
--scheduler-name "splitwise" \
|
||||||
--scheduler-host "127.0.0.1" \
|
--scheduler-host "127.0.0.1" \
|
||||||
--scheduler-port 6379 \
|
--scheduler-port 6379 \
|
||||||
|
--scheduler-topic "test" \
|
||||||
--scheduler-ttl 9000
|
--scheduler-ttl 9000
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -135,7 +144,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
|||||||
```bash
|
```bash
|
||||||
export FD_LOG_DIR="log_decode"
|
export FD_LOG_DIR="log_decode"
|
||||||
export CUDA_VISIBLE_DEVICES=4,5,6,7
|
export CUDA_VISIBLE_DEVICES=4,5,6,7
|
||||||
export KVCACHE_RDMA_NICS="mlx5_2,mlx5_3,mlx5_4,mlx5_5"
|
echo "set RDMA NICS"
|
||||||
|
export $(bash scripts/get_rdma_nics.sh gpu)
|
||||||
|
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
|
||||||
python -m fastdeploy.entrypoints.openai.api_server \
|
python -m fastdeploy.entrypoints.openai.api_server \
|
||||||
--model ERNIE-4.5-300B-A47B-BF16 \
|
--model ERNIE-4.5-300B-A47B-BF16 \
|
||||||
--port 8184 --metrics-port 8185 \
|
--port 8184 --metrics-port 8185 \
|
||||||
@@ -150,6 +161,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
|||||||
--scheduler-host "127.0.0.1" \
|
--scheduler-host "127.0.0.1" \
|
||||||
--scheduler-port 6379 \
|
--scheduler-port 6379 \
|
||||||
--scheduler-ttl 9000
|
--scheduler-ttl 9000
|
||||||
|
--scheduler-topic "test" \
|
||||||
--splitwise-role "decode"
|
--splitwise-role "decode"
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -168,5 +180,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
|||||||
* --scheduler-host: 连接的redis地址
|
* --scheduler-host: 连接的redis地址
|
||||||
* --scheduler-port: 连接的redis端口
|
* --scheduler-port: 连接的redis端口
|
||||||
* --scheduler-ttl: 指定redis的ttl时间,单位为秒
|
* --scheduler-ttl: 指定redis的ttl时间,单位为秒
|
||||||
|
* --scheduler-topic: 指定redis的topic
|
||||||
* --pd-comm-port: 指定pd通信的端口
|
* --pd-comm-port: 指定pd通信的端口
|
||||||
* --rdma-comm-ports: 指定RDMA通信的端口,多个端口用逗号隔开,数量与卡数一致
|
* --rdma-comm-ports: 指定RDMA通信的端口,多个端口用逗号隔开,数量与卡数一致
|
||||||
|
BIN
docs/zh/features/images/no_scheduler_img.png
Normal file
BIN
docs/zh/features/images/no_scheduler_img.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 118 KiB |
BIN
docs/zh/features/images/scheduler_img.png
Normal file
BIN
docs/zh/features/images/scheduler_img.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 105 KiB |
@@ -83,6 +83,7 @@ plugins:
|
|||||||
Sampling: 采样策略
|
Sampling: 采样策略
|
||||||
MultiNode Deployment: 多机部署
|
MultiNode Deployment: 多机部署
|
||||||
Graph Optimization: 图优化
|
Graph Optimization: 图优化
|
||||||
|
Data Parallelism: 数据并行
|
||||||
Supported Models: 支持模型列表
|
Supported Models: 支持模型列表
|
||||||
Benchmark: 基准测试
|
Benchmark: 基准测试
|
||||||
Usage: 用法
|
Usage: 用法
|
||||||
@@ -132,6 +133,7 @@ nav:
|
|||||||
- 'Sampling': features/sampling.md
|
- 'Sampling': features/sampling.md
|
||||||
- 'MultiNode Deployment': features/multi-node_deployment.md
|
- 'MultiNode Deployment': features/multi-node_deployment.md
|
||||||
- 'Graph Optimization': features/graph_optimization.md
|
- 'Graph Optimization': features/graph_optimization.md
|
||||||
|
- 'Data Parallelism': features/data_parallel_service.md
|
||||||
- 'Supported Models': supported_models.md
|
- 'Supported Models': supported_models.md
|
||||||
- Benchmark: benchmark.md
|
- Benchmark: benchmark.md
|
||||||
- Usage:
|
- Usage:
|
||||||
|
Reference in New Issue
Block a user