[Doc] add repetition early stopping doc (#3078)

* add repetition early stop doc * add the early_stop.md
2025-12-24 13:28:13 +08:00 · 2025-07-30 13:01:57 +08:00
parent 99a70fc722
commit 4dc130c5a9
3 changed files with 143 additions and 3 deletions
--- a/docs/zh/features/early_stop.md
+++ b/docs/zh/features/early_stop.md
@@ -0,0 +1,70 @@
+
+# 早停功能
+
+早停功能用于提前结束模型生成token的过程，具体来说早停功能会采取不同的策略，判断当前生成的token序列是否满足早停条件，如果满足则提前结束token生成。FastDeploy目前只支持repetition策略。
+
+1. Repetition策略
+   * Repetition策略通过检查生成高概率token的次数决定是否需要触发早停功能。
+   * 具体来说，当某个batch生成token的概率连续超过用户设置的概率阈值达到用户指定的次数，将提前结束该batch的token生成过程。
+
+## 使用说明
+
+在启动服务时，添加早停功能的启动项。
+
+* 在线推理启动示例：
+  * 使用默认超参数：--enable-early-stop
+    ```shell
+    python -m fastdeploy.entrypoints.openai.api_server \
+        --model baidu/ERNIE-4.5-0.3B-Paddle \
+        --port 8180 \
+        --metrics-port 8181 \
+        --engine-worker-queue-port 8182 \
+        --max-model-len 32768 \
+        --max-num-seqs 32 \
+        --enable-early-stop
+    ```
+  * 使用自定义超参数：--early-stop-config
+    ```shell
+    python -m fastdeploy.entrypoints.openai.api_server \
+          --model baidu/ERNIE-4.5-0.3B-Paddle \
+          --port 8180 \
+          --metrics-port 8181 \
+          --engine-worker-queue-port 8182 \
+          --max-model-len 32768 \
+          --max-num-seqs 32 \
+          --early-stop-config '{"enable_early_stop":true, "window_size": 1000, "threshold": 0.9}'
+    ```
+* 离线推理示例
+  * 使用默认超参数：enable_early_stop
+    ```python
+    from fastdeploy.engine.sampling_params import SamplingParams
+    from fastdeploy.entrypoints.llm import LLM
+
+    model_name_or_path = "baidu/ERNIE-4.5-0.3B-Paddle"
+
+    sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
+    llm = LLM(model=model_name_or_path, tensor_parallel_size=1, enable_early_stop=True)
+    output = llm.generate(prompts="who are you?", use_tqdm=True, sampling_params=sampling_params)
+
+    print(output)
+    ```
+  * 使用自定义超参数：early_stop_config
+    ```python
+    from fastdeploy.engine.sampling_params import SamplingParams
+    from fastdeploy.entrypoints.llm import LLM
+
+    model_name_or_path = "baidu/ERNIE-4.5-0.3B-Paddle"
+    early_stop_config = {"enable_early_stop":True, "window_size":1000, "threshold":0.9}
+    sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
+    llm = LLM(model=model_name_or_path, tensor_parallel_size=1, early_stop_config=early_stop_config)
+    output = llm.generate(prompts="who are you?", use_tqdm=True, sampling_params=sampling_params)
+
+    print(output)
+    ```
+
+## 参数说明
+
+* `enable_early_stop`: (bool) 是否启用早停功能，默认设置为False。
+* `strategy`: (str) 早停功能使用的策略，目前仅支持repetition策略，默认设置为"repetition"。
+* `window_size`: (int) repetition策略中连续出现高概率token的次数上限，超过该次数将触发早停功能，默认设置为3000。
+* `threshold`: (float) repetition策略中的高概率阈值，默认设置为0.99。
--- a/docs/zh/features/sampling.md
+++ b/docs/zh/features/sampling.md
@@ -6,12 +6,10 @@

   * Top-p 采样根据概率累积分布进行截断，仅考虑累计概率达到指定阈值 p 的最可能 token 集合。
   * 动态选择考虑的 token 数量，保证了结果的多样性，同时避免了不太可能的 token。
-
 2. Top-k_top-p 采样

   * 首先进行 top-k 采样，然后在 top-k 的结果上进行归一化，再进行 top-p 采样。
   * 通过限制初始选择范围（top-k）并在其中进行概率累积选择（top-p），提高了生成文本的质量和连贯性。
-
 3. Min-p 采样

   * Min-p 采样首先计算 pivot=max_prob * min_p，然后只保留概率大于pivot的token(其余设置为0)进行后续的采样。
@@ -19,7 +17,7 @@

 ## 使用说明

-在部署时，可以通过设置环境变量 `FD_SAMPLING_CLASS` 来选择采样算法。可选择的值有`base`, `base_non_truncated`, `air`或 `rejection`。
+在部署时，可以通过设置环境变量 `FD_SAMPLING_CLASS` 来选择采样算法。可选择的值有 `base`, `base_non_truncated`, `air`或 `rejection`。

 **仅支持 Top-p Sampling 的算法**