[PD Disaggregation] remove splitwise deployment on single node and refine the code (#4891)

* remove splitwise deployment on single node and refine the code * up * up * up * add test * up
2025-12-24 13:28:13 +08:00 · 2025-11-14 09:56:53 +08:00
parent 9703108c28
commit 36822fa49c
24 changed files with 626 additions and 963 deletions
--- a/docs/features/disaggregated.md
+++ b/docs/features/disaggregated.md
@@ -29,48 +29,6 @@ In multi-instance scenarios, each incoming request needs to be assigned to diffe

 ## Usage Instructions

-### Single-machine Disaggregated Deployment
-
-#### Online Inference Service
-Use the following commands for service deployment:
-
-**Prefill Instance**
-
-```bash
-export FD_LOG_DIR="log_prefill"
-export CUDA_VISIBLE_DEVICES=0,1,2,3
-python -m fastdeploy.entrypoints.openai.api_server \
-       --model ERNIE-4.5-300B-A47B-BF16 \
-       --port 8180 --metrics-port 8181 \
-       --engine-worker-queue-port 8182 \
-       --cache-queue-port 8183 \
-       --tensor-parallel-size 4 \
-       --quantization wint4 \
-       --splitwise-role "prefill"
-```
-
-**Decode Instance**
-
-```bash
-export FD_LOG_DIR="log_decode"
-export CUDA_VISIBLE_DEVICES=4,5,6,7
-# Note: innode-prefill-ports should specify the engine-worker-queue-port of the Prefill service
-python -m fastdeploy.entrypoints.openai.api_server \
-       --model ERNIE-4.5-300B-A47B-BF16 \
-       --port 8184 --metrics-port 8185 \
-       --engine-worker-queue-port 8186 \
-       --cache-queue-port 8187 \
-       --tensor-parallel-size 4 \
-       --quantization wint4 \
-       --innode-prefill-ports 8182 \
-       --splitwise-role "decode"
-```
-
-Note: When requesting single-machine PD disaggregated service, **users should request the Decode service's port**.
-
-#### Offline Inference Service
-Refer to the example code `offline_disaggregated_demo.py` in the `fastdeploy/demo` directory for offline inference service deployment.
-
 ### Multi-machine Disaggregated Deployment

 #### Prerequisite: Redis
@@ -118,12 +76,14 @@ For multi-machine deployment, confirm that the NIC supports RDMA and that all no
 ```bash
 export FD_LOG_DIR="log_prefill"
 export CUDA_VISIBLE_DEVICES=0,1,2,3
+export ENABLE_V1_KVCACHE_SCHEDULER=0
 echo "set RDMA NICS"
 export $(bash scripts/get_rdma_nics.sh gpu)
 echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
 python -m fastdeploy.entrypoints.openai.api_server \
       --model ERNIE-4.5-300B-A47B-BF16 \
-       --port 8180 --metrics-port 8181 \
+       --port 8180 \
+       --metrics-port 8181 \
       --engine-worker-queue-port 8182 \
       --cache-queue-port 8183 \
       --tensor-parallel-size 4 \
@@ -143,12 +103,14 @@ python -m fastdeploy.entrypoints.openai.api_server \
 ```bash
 export FD_LOG_DIR="log_decode"
 export CUDA_VISIBLE_DEVICES=4,5,6,7
+export ENABLE_V1_KVCACHE_SCHEDULER=0
 echo "set RDMA NICS"
 export $(bash scripts/get_rdma_nics.sh gpu)
 echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
 python -m fastdeploy.entrypoints.openai.api_server \
       --model ERNIE-4.5-300B-A47B-BF16 \
-       --port 8184 --metrics-port 8185 \
+       --port 8184 \
+       --metrics-port 8185 \
       --engine-worker-queue-port 8186 \
       --cache-queue-port 8187 \
       --tensor-parallel-size 4 \
--- a/docs/zh/features/disaggregated.md
+++ b/docs/zh/features/disaggregated.md
@@ -29,49 +29,6 @@

 ## 使用说明

-### 单机分离式部署
-
-#### 在线推理服务
-使用如下命令进行服务部署
-
-**prefill 实例**
-
-```bash
-export FD_LOG_DIR="log_prefill"
-export CUDA_VISIBLE_DEVICES=0,1,2,3
-python -m fastdeploy.entrypoints.openai.api_server \
-       --model ERNIE-4.5-300B-A47B-BF16 \
-       --port 8180 --metrics-port 8181 \
-       --engine-worker-queue-port 8182 \
-       --cache-queue-port 8183 \
-       --tensor-parallel-size 4 \
-       --quantization wint4 \
-       --splitwise-role "prefill"
-```
-
-**decode 实例**
-
-```bash
-export FD_LOG_DIR="log_decode"
-export CUDA_VISIBLE_DEVICES=4,5,6,7
-# 注意innode-prefill-ports指定为Prefill服务的engine-worker-queue-port
-python -m fastdeploy.entrypoints.openai.api_server \
-       --model ERNIE-4.5-300B-A47B-BF16 \
-       --port 8184 --metrics-port 8185 \
-       --engine-worker-queue-port 8186 \
-       --cache-queue-port 8187 \
-       --tensor-parallel-size 4 \
-       --quantization wint4 \
-       --innode-prefill-ports 8182 \
-       --splitwise-role "decode"
-```
-
-注意在请求单机PD分离服务时，**用户需请求Decode服务的端口**。
-
-#### 离线推理服务
-
-参考`fastdeploy/demo` 目录下 `offline_disaggregated_demo.py` 示例代码，进行离线推理服务部署
-
 ### 多机分离式部署

 #### 前置依赖 Redis
@@ -120,6 +77,7 @@ sudo systemctl start redis

 export FD_LOG_DIR="log_prefill"
 export CUDA_VISIBLE_DEVICES=0,1,2,3
+export ENABLE_V1_KVCACHE_SCHEDULER=0
 echo "set RDMA NICS"
 export $(bash scripts/get_rdma_nics.sh gpu)
 echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
@@ -146,6 +104,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
 ```bash
 export FD_LOG_DIR="log_decode"
 export CUDA_VISIBLE_DEVICES=4,5,6,7
+export ENABLE_V1_KVCACHE_SCHEDULER=0
 echo "set RDMA NICS"
 export $(bash scripts/get_rdma_nics.sh gpu)
 echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"