mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-12-24 13:28:13 +08:00
Some checks failed
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
Publish Job / publish_pre_check (push) Has been cancelled
Publish Job / print_publish_pre_check_outputs (push) Has been cancelled
Publish Job / FD-Clone-Linux (push) Has been cancelled
Publish Job / Show Code Archive Output (push) Has been cancelled
Publish Job / BUILD_SM8090 (push) Has been cancelled
Publish Job / BUILD_SM8689 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled
Publish Job / Run FD Image Build (push) Has been cancelled
Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled
Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
Publish Job / Run Base Tests (push) Has been cancelled
Publish Job / Run Accuracy Tests (push) Has been cancelled
Publish Job / Run Stable Tests (push) Has been cancelled
CI Images Build / FD-Clone-Linux (push) Has been cancelled
CI Images Build / Show Code Archive Output (push) Has been cancelled
CI Images Build / CI Images Build (push) Has been cancelled
CI Images Build / BUILD_SM8090 (push) Has been cancelled
CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled
CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
CI Images Build / Run Base Tests (push) Has been cancelled
CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled
2.9 KiB
2.9 KiB
Multi-Node Deployment
Overview
Multi-node deployment addresses scenarios where a single machine's GPU memory is insufficient to support deployment of large models by enabling tensor parallelism across multiple machines.
Environment Preparation
Network Requirements
- All nodes must be within the same local network
- Ensure bidirectional connectivity between all nodes (test using
pingandnc -zv)
Software Requirements
- Install the same version of FastDeploy on all nodes
- [Recommended] Install and configure MPI (OpenMPI or MPICH)
Tensor Parallel Deployment
Recommended Launch Method
We recommend using mpirun for one-command startup without manually starting each node.
Usage Instructions
- Execute the same command on all machines
- The IP order in the
ipsparameter determines the node startup sequence - The first IP will be designated as the master node
- Ensure all nodes can resolve each other's hostnames
-
Online inference startup example:
python -m fastdeploy.entrypoints.openai.api_server \ --model baidu/ERNIE-4.5-300B-A47B-Paddle \ --port 8180 \ --metrics-port 8181 \ --engine-worker-queue-port 8182 \ --max-model-len 32768 \ --max-num-seqs 32 \ --tensor-parallel-size 16 \ --graph-optimization-config '{"use_cudagraph":false}' \ --no-enable-prefix-caching \ --disable-custom-all-reduce \ --ips 192.168.1.101,192.168.1.102
💡 Multi-node tensor parallel deployment currently does not support CUDA Graphs, Prefix Caching, or Custom AllReduce, and these features must be explicitly disabled in the deployment command.
-
Offline startup example:
from fastdeploy.engine.sampling_params import SamplingParams from fastdeploy.entrypoints.llm import LLM model_name_or_path = "baidu/ERNIE-4.5-300B-A47B-Paddle" sampling_params = SamplingParams(temperature=0.1, max_tokens=30) llm = LLM(model=model_name_or_path, tensor_parallel_size=16, ips="192.168.1.101,192.168.1.102") if llm._check_master(): output = llm.generate(prompts="Who are you?", use_tqdm=True, sampling_params=sampling_params) print(output) -
Notes:
-
Only the master node can receive completion requests
-
Always send requests to the master node (the first IP in the ips list)
-
The master node will distribute workloads across all nodes
Parameter Description
ips Parameter
- Type:
string - Format: Comma-separated IPv4 addresses
- Description: Specifies the IP addresses of all nodes in the deployment group
- Required: Only for multi-node deployments
- Example:
"192.168.1.101,192.168.1.102,192.168.1.103"
tensor_parallel_size Parameter
- Type:
integer - Description: Total number of GPUs across all nodes
- Required: Yes
- Example: For 2 nodes with 8 GPUs each, set to 16