From 90a5b18742747b8a04fa79068bf9be0f61699b94 Mon Sep 17 00:00:00 2001 From: qingqing01 Date: Mon, 30 Jun 2025 11:57:12 +0800 Subject: [PATCH] Update disaggregated.md --- docs/features/disaggregated.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/features/disaggregated.md b/docs/features/disaggregated.md index 93d03ec08..4fddfc84a 100644 --- a/docs/features/disaggregated.md +++ b/docs/features/disaggregated.md @@ -1,6 +1,6 @@ # Disaggregated Deployment -Large model inference consists of two phases: Prefill and Decode, which are compute-intensive (Prefill) and Memory access-intensive(Decode) respectively. Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency. +Large model inference consists of two phases: Prefill and Decode, which are compute-intensive and memory access-intensive respectively. Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency. * Prefill phase: Processes all input Tokens (such as user prompts), completes the model's forward propagation, and generates the first token. * Decode phase: Starting from the first generated token, it generates one token at a time autoregressively until reaching the stop token. For N output tokens, the Decode phase requires (N-1) forward propagations that must be executed serially. During generation, the number of tokens to attend to increases, and computational requirements gradually grow.