Files
FastDeploy/docs/features/logits_processor.md
李泳桦 a012e3608b [Feature] support logits processors (#4515)
* [feat] provide an interface for logits processors and a builtin LogitBiasLogitsProcessor

* [chore] fix code style

* [fix] add unit test & fix existing bugs

* [feat] add engine/worker arg --logits-processors

* [fix] redefine user args as logits_processors_args and fix some bugs

* [fix] fix test_sampler

* Update fastdeploy/model_executor/logits_processor/builtin.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fastdeploy/model_executor/logits_processor/__init__.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/model_executor/test_logits_processor.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [fix] fix typo

* Update fastdeploy/engine/sampling_params.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [fix] fix bracelet

* [chore] redefine logits processor interface: pass the entire share_inputs into LP, do not copy share_inputs and logits

* [doc] add docs

* [fix] fix logit bias processor not applied when decoding is too fast & add docs and tests

* [fix] fix redundant code

* [feat] skip apply() if no bias is specified

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-29 00:08:53 +08:00

201 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Logits Processors
## Overview
A **Logits Processor (LP)** sits between *model output logits* and the *sampler* (top-k/top-p/temperature…). It applies pluggable transformations to logits **before** sampling (e.g., weighting, masking, penalties, biases).
## Key Features
- **Server-level registration**: declare available processors at startup via `--logits-processors`. The declaration order is the execution order.
- **Per-request control**: enable and configure processors via the `logits_processors_args` field in the request body.
- **Built-in processor**: commonly used processors are provided, e.g., `LogitBiasLogitsProcessor`, which can be loaded directly by class name.
- **Extensible interface**: a standard `LogitsProcessor` interface is provided for user-defined processors, which can be loaded by FQCN `module.path:ClassName`.
## Usage
### Online Service
#### 1. Start the service (register logits processors)
Register processors with `--logits-processors` when starting the service. For a built-in processor like `LogitBiasLogitsProcessor`, pass the class name directly:
```bash
python -m fastdeploy.entrypoints.openai.api_server \
--model /path/to/model \
--port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --cache-queue-port 8183 \
--logits-processors LogitBiasLogitsProcessor
```
#### 2. Send a request (enable and configure as needed)
Use the `logits_processors_args` field in the REST request body to enable and configure processors. Example with `LogitBiasLogitsProcessor`, which adds a bias to specified tokens. It accepts a `logit_bias` dictionary mapping *token_id**bias value*:
```bash
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" -H "Content-Type: application/json" -d '{
"messages": [{"role":"user", "content":"Who is Lu Xun?"}],
"logits_processors_args": {
"logit_bias": {"128": 5.0, "50256": -10.0}
}
}'
```
When using the OpenAI Python SDK, pass `logits_processors_args` through `extra_body`:
```python
import openai
client = openai.Client(base_url="http://0.0.0.0:8180/v1", api_key="EMPTY_API_KEY")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Who is Lu Xun?"}],
extra_body={
"logits_processors_args": {
"logit_bias": {"128": 5.0, "50256": -10.0}
}
}
)
```
### Offline Inference
For offline inference, pass the `logits_processors` argument (type `list[str]`) when initializing the `LLM` instance. When generating text via the offline `chat()` or `generate()` APIs, provide the logits-processor parameters through `sampling_params`.`logits_processors_args` to enable and pass arguments to the corresponding processors.
```python
from fastdeploy import LLM, SamplingParams
llm = LLM(
model="path/to/model",
engine_worker_queue_port=8282,
cache_queue_port=8383,
logits_processors=['LogitBiasLogitsProcessor'],
)
messages = [{"role": "user", "content": "Who is Lu Xun?"}]
sampling_params = SamplingParams(
top_p=0.95,
max_tokens=128,
logits_processors_args={"logit_bias": {128: 5.0, 50256: -10.0}},
)
outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs.text)
```
## Custom Logits Processor
### 1. Define your own LogitsProcessor class
Inherit the `fastdeploy.openai.logits_processor.LogitsProcessor` class and implement the `update_state()` and `apply()` methods.
- **`update_state()` is used to update the logits processor state.** The input is the inference backends runtime state `share_inputs`, and it returns nothing. You need to extract useful information from the runtime state to update the logits processors internal state.
- For example, in the following example, we retrieve the current batchs `logits_processors_args` from `share_inputs`, and then bulk-modify the enablement status of the logits processor for the current batch;
- When writing your class, you should predefine the parameter names for your logits processor, such as adding a request parameter `enable_your_logits_processor` to control whether your logits processor is enabled for a request;
- **`apply()` is used to actually modify the logits tensor.** Before `apply()` runs, the model will call `update_state()` to refresh the logits processor state. Therefore, ensure your `update_state()` correctly updates the state variables used by the logits processor.
- In the following example, we use `self.enabled`to determine whether each request in the current batch enables your logits processor, and adjust the logits tensor dynamically.
```python
from paddle import Tensor
from fastdeploy.config import FDConfig
from fastdeploy.openai.logits_processor import LogitsProcessor
class YourLogitsProcessor(LogitsProcessor):
def __init__(self, fd_config: FDConfig) -> None:
# Initialize your state variables here, for example obtain dtype, device, etc.
# from fd_config. You can freely set the state variables you need, and update them
# during each step of inference update_state()
self.enabled = None
return
def update_state(self, share_inputs: dict) -> None:
"""Called when there are new output tokens, prior to each forward pass.
Each field in the `share_inputs` dict typically stores information for all request
slots. It has a `stop_flags` array that indicates whether a slot currently has a
running request (`False` means the slot is active). Therefore, it is recommended to
filter entries by `stop_flags` to keep only data for the current batch.
"""
stop_flags = share_inputs["stop_flags"]
logits_processors_args = share_inputs["logits_processors_args"]
logits_processors_args = [a for a, f in zip(logits_processors_args, stop_flags) if not f]
# Update your state variables here to facilitate dynamically
# adjusting your logits processor behavior at each step of inference.
# The latest state should be read and used in the apply() method
self.enabled = [a.enable_your_logits_processor for a in logits_processors_args]
return
def apply(self, logits: Tensor) -> Tensor:
"""Apply LogitsProcessor to batch logits tensor.
The updated tensor must be returned but may be modified in-place.
"""
for i, e in enumerate(self.enabled):
# Implement your core logits transformation here, and return the modified logits tensor
logits[i] = ...
return logits
```
### 2. Use your logits processor via online service
#### 2.2. Start the service (register your logits processor)
When registering a custom processor, pass its **FQCN** (`module.path:ClassName`) to `--logits-processors`:
```bash
python -m fastdeploy.entrypoints.openai.api_server \
--model /path/to/model \
--port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --cache-queue-port 8183 \
--logits-processors your.dotted.path.to.module:YourLogitsProcessor
```
#### 2.2. Send a request (enable and configure as needed)
Enable your processor per request via `logits_processors_args`:
```bash
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" -H "Content-Type: application/json" -d '{
"messages": [{"role":"user", "content":"Who is Lu Xun?"}],
"logits_processors_args": {
"enable_your_logits_processor": true
}
}'
```
Using the OpenAI Python SDK:
```python
import openai
client = openai.Client(base_url="http://0.0.0.0:8180/v1", api_key="EMPTY_API_KEY")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Who is Lu Xun?"}],
extra_body={
"logits_processors_args": {
"enable_your_logits_processor": True
}
}
)
```
### 3. Use your logits processor via offline inference
For offline inference, pass the `logits_processors` argument (type `list[str]`) when initializing the `LLM` instance. To specify your custom logits processor, pass its FQCN (`module.path:ClassName`). When generating text via the offline `chat()` or `generate()` APIs, provide the logits-processor parameters through `sampling_params`.`logits_processors_args` to enable and pass arguments to the corresponding processors.
```python
from fastdeploy import LLM, SamplingParams
llm = LLM(
model="path/to/model",
engine_worker_queue_port=8282,
cache_queue_port=8383,
logits_processors=['your.dotted.path.to.module:YourLogitsProcessor'],
)
messages = [{"role": "user", "content": "Who is Lu Xun?"}]
sampling_params = SamplingParams(
top_p=0.95,
max_tokens=128,
logits_processors_args={"enable_your_logits_processor": True},
)
outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs.text)
```