[BugFix] fix double shutdown of comm group when rank0 clears weights slower than other ranks (#5710)
Some checks failed
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled

This commit is contained in:
Yonghua Li
2025-12-23 17:51:35 +08:00
committed by GitHub
parent ceafd757f0
commit 9ff99d2b03

View File

@@ -281,10 +281,14 @@ class DynamicWeightManager:
logger.info("infer engine stopped! start to load new checkpoint...")
model_runner.clear_requests()
model_runner.update_parameters(pid)
while model_weights_status.value[0] != ModelWeightsStatus.NORMAL:
time.sleep(0.01)
logger.info("finished loading new checkpoint")
elif model_weights_status.value[0] == ModelWeightsStatus.CLEARING:
logger.info("infer engine stopped! start to clear checkpoint...")
model_runner.clear_requests()
model_runner.clear_parameters(pid)
while model_weights_status.value[0] != ModelWeightsStatus.CLEARED:
time.sleep(0.01)
logger.info("finished clearing checkpoint")
time.sleep(0.01)