mirror of
				https://github.com/PaddlePaddle/FastDeploy.git
				synced 2025-10-25 09:31:38 +08:00 
			
		
		
		
	
		
			
				
	
	
	
		
			2.4 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			2.4 KiB
		
	
	
	
	
	
	
	
Global Scheduler: Multi-Instance Load Balancing
Design Overview
Cluster nodes autonomously pull tasks from other nodes during idle periods based on real-time workload, then push execution results back to the originating node.
What Problem Does the Global Scheduler Solve?
Standard load balancing distributes requests using round-robin strategy, ensuring equal request counts per inference instance.
In LLM scenarios, request processing time varies significantly based on input/output token counts. Even with equal request distribution, inference completion times differ substantially across instances.
The global scheduler solves this imbalance through cluster-level optimization.
How the Global Scheduler Works
As shown:
- Nodes 1, 2, and n each receive 3 requests
- At time T:
- Node 1 completes all tasks
- Nodes 2 and n process their second requests (1 request remaining each)
 
- Node 1 steals a task from Node 2, processes it, and pushes results to Node 2’s response queue
This secondary load balancing strategy will ✅ Maximizes cluster resource utilization ✅ Reduces Time-To-First-Token (TTFT)
How to use Global Scheduler
Prerequisite: Redis Setup
conda installation
conda install redis
# Launch
nohup redis-server > redis.log 2>&1 &
apt installation (Debian/Ubuntu)
# Install
sudo apt install redis-server -y
# Launch
sudo systemctl start redis-server
yum installation (RHEL/CentOS)
# Install
sudo yum install redis -y
# Launch
sudo systemctl start redis
Launching FastDeploy
python -m fastdeploy.entrypoints.openai.api_server \
       --port 8801 \
       --metrics-port 8802 \
       --engine-worker-queue-port 8803 \
       --scheduler-name global \
       --scheduler-ttl 900 \
       --scheduler-host "127.0.0.1" \
       --scheduler-port 6379 \
       --scheduler-db 0 \
       --scheduler-password "" \
       --scheduler-topic "default" \
       --scheduler-min-load_score 3 \
       --scheduler-load-shards-num 1
Deployment notes:
- Execute this command on multiple machines to create inference instances
- For single-machine multi-instance deployments: ensure port uniqueness
- Use Nginx for external load balancing alongside the global scheduler’s internal balancing
 
			
