[Doc] add repetition early stopping doc (#3078)

* add repetition early stop doc * add the early_stop.md
2025-12-24 13:28:13 +08:00 · 2025-07-30 13:01:57 +08:00
parent 99a70fc722
commit 4dc130c5a9
3 changed files with 143 additions and 3 deletions
--- a/docs/features/early_stop.md
+++ b/docs/features/early_stop.md
@@ -0,0 +1,72 @@
+
+# Early Stopping
+
+The early stopping is used to prematurely terminate the token generation of the model. Specifically, the early stopping uses different strategies to determine whether the currently generated token sequence meets the early stopping criteria. If so, token generation is terminated prematurely. FastDeploy currently only supports the repetition strategy.
+
+1. Repetition Strategy
+* The repetition strategy determines whether to trigger the early stopping function by checking the number of times a high-probability token is generated.
+* Specifically, if the probability of generating a token for a batch exceeds a user-set probability threshold for a specified number of consecutive times, token generation for that batch is terminated prematurely.
+
+## Usage Instructions
+
+When starting the service, add the early stopping function startup option.
+
+* Online inference startup example:
+  * Using default hyperparameters: --enable-early-stop
+    ```shell
+    python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-0.3B-Paddle \
+    --port 8180 \
+    --metrics-port 8181 \
+    --engine-worker-queue-port 8182 \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --enable-early-stop
+    ```
+  * Using custom hyperparameters: --early-stop-config
+    ```shell
+    python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-0.3B-Paddle \
+    --port 8180 \
+    --metrics-port 8181 \
+    --engine-worker-queue-port 8182 \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --early-stop-config '{"enable_early_stop":true, "window_size": 1000, "threshold": 0.9}'
+    ```
+* Offline reasoning example
+  * Use default hyperparameter: enable_early_stop
+    ```python
+    from fastdeploy.engine.sampling_params import SamplingParams
+    from fastdeploy.entrypoints.llm import LLM
+
+    model_name_or_path = "baidu/ERNIE-4.5-0.3B-Paddle"
+
+    sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
+    llm = LLM(model=model_name_or_path, tensor_parallel_size=1, enable_early_stop=True)
+    output = llm.generate(prompts="who are you?", use_tqdm=True, sampling_params=sampling_params)
+
+    print(output)
+    ```
+  * Use custom hyperparameters: early_stop_config
+    ```python
+    from fastdeploy.engine.sampling_params import SamplingParams
+    from fastdeploy.entrypoints.llm import LLM
+
+    model_name_or_path = "baidu/ERNIE-4.5-0.3B-Paddle"
+    early_stop_config = {"enable_early_stop":True, "window_size":1000, "threshold":0.9}
+    sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
+    llm = LLM(model=model_name_or_path, tensor_parallel_size=1, early_stop_config=early_stop_config) output = llm.generate(prompts="who are you?", use_tqdm=True, sampling_params=sampling_params)
+
+    print(output)
+    ```
+
+## Parameter Description
+
+* `enable_early_stop`: (bool) Whether to enable the early stopping. Default False.
+
+* `strategy`: (str) The strategy used by the early stopping. Currently, only the repetition strategy is supported. Default "repetition".
+
+* `window_size`: (int) The upper limit of the number of consecutive high-probability tokens in the repetition strategy. If the number exceeds this limit, the early stopping will be triggered. Default 3000.
+
+* `threshold`: (float) The high-probability threshold in the repetition strategy. Default 0.99.
--- a/docs/zh/features/early_stop.md
+++ b/docs/zh/features/early_stop.md
@@ -0,0 +1,70 @@
+
+# 早停功能
+
+早停功能用于提前结束模型生成token的过程，具体来说早停功能会采取不同的策略，判断当前生成的token序列是否满足早停条件，如果满足则提前结束token生成。FastDeploy目前只支持repetition策略。
+
+1. Repetition策略
+   * Repetition策略通过检查生成高概率token的次数决定是否需要触发早停功能。
+   * 具体来说，当某个batch生成token的概率连续超过用户设置的概率阈值达到用户指定的次数，将提前结束该batch的token生成过程。
+
+## 使用说明
+
+在启动服务时，添加早停功能的启动项。
+
+* 在线推理启动示例：
+  * 使用默认超参数：--enable-early-stop
+    ```shell
+    python -m fastdeploy.entrypoints.openai.api_server \
+        --model baidu/ERNIE-4.5-0.3B-Paddle \
+        --port 8180 \
+        --metrics-port 8181 \
+        --engine-worker-queue-port 8182 \
+        --max-model-len 32768 \
+        --max-num-seqs 32 \
+        --enable-early-stop
+    ```
+  * 使用自定义超参数：--early-stop-config
+    ```shell
+    python -m fastdeploy.entrypoints.openai.api_server \
+          --model baidu/ERNIE-4.5-0.3B-Paddle \
+          --port 8180 \
+          --metrics-port 8181 \
+          --engine-worker-queue-port 8182 \
+          --max-model-len 32768 \
+          --max-num-seqs 32 \
+          --early-stop-config '{"enable_early_stop":true, "window_size": 1000, "threshold": 0.9}'
+    ```
+* 离线推理示例
+  * 使用默认超参数：enable_early_stop
+    ```python
+    from fastdeploy.engine.sampling_params import SamplingParams
+    from fastdeploy.entrypoints.llm import LLM
+
+    model_name_or_path = "baidu/ERNIE-4.5-0.3B-Paddle"
+
+    sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
+    llm = LLM(model=model_name_or_path, tensor_parallel_size=1, enable_early_stop=True)
+    output = llm.generate(prompts="who are you?", use_tqdm=True, sampling_params=sampling_params)
+
+    print(output)
+    ```
+  * 使用自定义超参数：early_stop_config
+    ```python
+    from fastdeploy.engine.sampling_params import SamplingParams
+    from fastdeploy.entrypoints.llm import LLM
+
+    model_name_or_path = "baidu/ERNIE-4.5-0.3B-Paddle"
+    early_stop_config = {"enable_early_stop":True, "window_size":1000, "threshold":0.9}
+    sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
+    llm = LLM(model=model_name_or_path, tensor_parallel_size=1, early_stop_config=early_stop_config)
+    output = llm.generate(prompts="who are you?", use_tqdm=True, sampling_params=sampling_params)
+
+    print(output)
+    ```
+
+## 参数说明
+
+* `enable_early_stop`: (bool) 是否启用早停功能，默认设置为False。
+* `strategy`: (str) 早停功能使用的策略，目前仅支持repetition策略，默认设置为"repetition"。
+* `window_size`: (int) repetition策略中连续出现高概率token的次数上限，超过该次数将触发早停功能，默认设置为3000。
+* `threshold`: (float) repetition策略中的高概率阈值，默认设置为0.99。
--- a/docs/zh/features/sampling.md
+++ b/docs/zh/features/sampling.md
@@ -6,12 +6,10 @@

   * Top-p 采样根据概率累积分布进行截断，仅考虑累计概率达到指定阈值 p 的最可能 token 集合。
   * 动态选择考虑的 token 数量，保证了结果的多样性，同时避免了不太可能的 token。
-
 2. Top-k_top-p 采样

   * 首先进行 top-k 采样，然后在 top-k 的结果上进行归一化，再进行 top-p 采样。
   * 通过限制初始选择范围（top-k）并在其中进行概率累积选择（top-p），提高了生成文本的质量和连贯性。
-
 3. Min-p 采样

   * Min-p 采样首先计算 pivot=max_prob * min_p，然后只保留概率大于pivot的token(其余设置为0)进行后续的采样。
@@ -19,7 +17,7 @@

 ## 使用说明

-在部署时，可以通过设置环境变量 `FD_SAMPLING_CLASS` 来选择采样算法。可选择的值有`base`, `base_non_truncated`, `air`或 `rejection`。
+在部署时，可以通过设置环境变量 `FD_SAMPLING_CLASS` 来选择采样算法。可选择的值有 `base`, `base_non_truncated`, `air`或 `rejection`。

 **仅支持 Top-p Sampling 的算法**