mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-12-24 13:28:13 +08:00
* [feat] provide an interface for logits processors and a builtin LogitBiasLogitsProcessor * [chore] fix code style * [fix] add unit test & fix existing bugs * [feat] add engine/worker arg --logits-processors * [fix] redefine user args as logits_processors_args and fix some bugs * [fix] fix test_sampler * Update fastdeploy/model_executor/logits_processor/builtin.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update fastdeploy/model_executor/logits_processor/__init__.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/model_executor/test_logits_processor.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [fix] fix typo * Update fastdeploy/engine/sampling_params.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [fix] fix bracelet * [chore] redefine logits processor interface: pass the entire share_inputs into LP, do not copy share_inputs and logits * [doc] add docs * [fix] fix logit bias processor not applied when decoding is too fast & add docs and tests * [fix] fix redundant code * [feat] skip apply() if no bias is specified --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
201 lines
8.5 KiB
Markdown
201 lines
8.5 KiB
Markdown
# Logits Processors
|
||
|
||
## Overview
|
||
|
||
A **Logits Processor (LP)** sits between *model output logits* and the *sampler* (top-k/top-p/temperature…). It applies pluggable transformations to logits **before** sampling (e.g., weighting, masking, penalties, biases).
|
||
|
||
## Key Features
|
||
|
||
- **Server-level registration**: declare available processors at startup via `--logits-processors`. The declaration order is the execution order.
|
||
- **Per-request control**: enable and configure processors via the `logits_processors_args` field in the request body.
|
||
- **Built-in processor**: commonly used processors are provided, e.g., `LogitBiasLogitsProcessor`, which can be loaded directly by class name.
|
||
- **Extensible interface**: a standard `LogitsProcessor` interface is provided for user-defined processors, which can be loaded by FQCN `module.path:ClassName`.
|
||
|
||
## Usage
|
||
|
||
### Online Service
|
||
|
||
#### 1. Start the service (register logits processors)
|
||
|
||
Register processors with `--logits-processors` when starting the service. For a built-in processor like `LogitBiasLogitsProcessor`, pass the class name directly:
|
||
|
||
```bash
|
||
python -m fastdeploy.entrypoints.openai.api_server \
|
||
--model /path/to/model \
|
||
--port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --cache-queue-port 8183 \
|
||
--logits-processors LogitBiasLogitsProcessor
|
||
```
|
||
|
||
#### 2. Send a request (enable and configure as needed)
|
||
|
||
Use the `logits_processors_args` field in the REST request body to enable and configure processors. Example with `LogitBiasLogitsProcessor`, which adds a bias to specified tokens. It accepts a `logit_bias` dictionary mapping *token_id* → *bias value*:
|
||
|
||
```bash
|
||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" -H "Content-Type: application/json" -d '{
|
||
"messages": [{"role":"user", "content":"Who is Lu Xun?"}],
|
||
"logits_processors_args": {
|
||
"logit_bias": {"128": 5.0, "50256": -10.0}
|
||
}
|
||
}'
|
||
```
|
||
|
||
When using the OpenAI Python SDK, pass `logits_processors_args` through `extra_body`:
|
||
|
||
```python
|
||
import openai
|
||
|
||
client = openai.Client(base_url="http://0.0.0.0:8180/v1", api_key="EMPTY_API_KEY")
|
||
response = client.chat.completions.create(
|
||
model="default",
|
||
messages=[{"role": "user", "content": "Who is Lu Xun?"}],
|
||
extra_body={
|
||
"logits_processors_args": {
|
||
"logit_bias": {"128": 5.0, "50256": -10.0}
|
||
}
|
||
}
|
||
)
|
||
```
|
||
|
||
### Offline Inference
|
||
|
||
For offline inference, pass the `logits_processors` argument (type `list[str]`) when initializing the `LLM` instance. When generating text via the offline `chat()` or `generate()` APIs, provide the logits-processor parameters through `sampling_params`.`logits_processors_args` to enable and pass arguments to the corresponding processors.
|
||
|
||
```python
|
||
from fastdeploy import LLM, SamplingParams
|
||
|
||
llm = LLM(
|
||
model="path/to/model",
|
||
engine_worker_queue_port=8282,
|
||
cache_queue_port=8383,
|
||
logits_processors=['LogitBiasLogitsProcessor'],
|
||
)
|
||
|
||
messages = [{"role": "user", "content": "Who is Lu Xun?"}]
|
||
sampling_params = SamplingParams(
|
||
top_p=0.95,
|
||
max_tokens=128,
|
||
logits_processors_args={"logit_bias": {128: 5.0, 50256: -10.0}},
|
||
)
|
||
outputs = llm.chat(messages, sampling_params)
|
||
print(outputs[0].outputs.text)
|
||
```
|
||
|
||
## Custom Logits Processor
|
||
|
||
### 1. Define your own LogitsProcessor class
|
||
Inherit the `fastdeploy.openai.logits_processor.LogitsProcessor` class and implement the `update_state()` and `apply()` methods.
|
||
|
||
- **`update_state()` is used to update the logits processor state.** The input is the inference backend’s runtime state `share_inputs`, and it returns nothing. You need to extract useful information from the runtime state to update the logits processor’s internal state.
|
||
- For example, in the following example, we retrieve the current batch’s `logits_processors_args` from `share_inputs`, and then bulk-modify the enablement status of the logits processor for the current batch;
|
||
- When writing your class, you should predefine the parameter names for your logits processor, such as adding a request parameter `enable_your_logits_processor` to control whether your logits processor is enabled for a request;
|
||
- **`apply()` is used to actually modify the logits tensor.** Before `apply()` runs, the model will call `update_state()` to refresh the logits processor state. Therefore, ensure your `update_state()` correctly updates the state variables used by the logits processor.
|
||
- In the following example, we use `self.enabled`to determine whether each request in the current batch enables your logits processor, and adjust the logits tensor dynamically.
|
||
```python
|
||
from paddle import Tensor
|
||
from fastdeploy.config import FDConfig
|
||
from fastdeploy.openai.logits_processor import LogitsProcessor
|
||
|
||
class YourLogitsProcessor(LogitsProcessor):
|
||
|
||
def __init__(self, fd_config: FDConfig) -> None:
|
||
# Initialize your state variables here, for example obtain dtype, device, etc.
|
||
# from fd_config. You can freely set the state variables you need, and update them
|
||
# during each step of inference update_state()
|
||
self.enabled = None
|
||
return
|
||
|
||
def update_state(self, share_inputs: dict) -> None:
|
||
"""Called when there are new output tokens, prior to each forward pass.
|
||
|
||
Each field in the `share_inputs` dict typically stores information for all request
|
||
slots. It has a `stop_flags` array that indicates whether a slot currently has a
|
||
running request (`False` means the slot is active). Therefore, it is recommended to
|
||
filter entries by `stop_flags` to keep only data for the current batch.
|
||
"""
|
||
stop_flags = share_inputs["stop_flags"]
|
||
logits_processors_args = share_inputs["logits_processors_args"]
|
||
logits_processors_args = [a for a, f in zip(logits_processors_args, stop_flags) if not f]
|
||
# Update your state variables here to facilitate dynamically
|
||
# adjusting your logits processor behavior at each step of inference.
|
||
# The latest state should be read and used in the apply() method
|
||
self.enabled = [a.enable_your_logits_processor for a in logits_processors_args]
|
||
return
|
||
|
||
def apply(self, logits: Tensor) -> Tensor:
|
||
"""Apply LogitsProcessor to batch logits tensor.
|
||
|
||
The updated tensor must be returned but may be modified in-place.
|
||
"""
|
||
for i, e in enumerate(self.enabled):
|
||
# Implement your core logits transformation here, and return the modified logits tensor
|
||
logits[i] = ...
|
||
return logits
|
||
```
|
||
|
||
### 2. Use your logits processor via online service
|
||
|
||
#### 2.2. Start the service (register your logits processor)
|
||
|
||
When registering a custom processor, pass its **FQCN** (`module.path:ClassName`) to `--logits-processors`:
|
||
|
||
```bash
|
||
python -m fastdeploy.entrypoints.openai.api_server \
|
||
--model /path/to/model \
|
||
--port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --cache-queue-port 8183 \
|
||
--logits-processors your.dotted.path.to.module:YourLogitsProcessor
|
||
```
|
||
|
||
#### 2.2. Send a request (enable and configure as needed)
|
||
|
||
Enable your processor per request via `logits_processors_args`:
|
||
|
||
```bash
|
||
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" -H "Content-Type: application/json" -d '{
|
||
"messages": [{"role":"user", "content":"Who is Lu Xun?"}],
|
||
"logits_processors_args": {
|
||
"enable_your_logits_processor": true
|
||
}
|
||
}'
|
||
```
|
||
|
||
Using the OpenAI Python SDK:
|
||
|
||
```python
|
||
import openai
|
||
|
||
client = openai.Client(base_url="http://0.0.0.0:8180/v1", api_key="EMPTY_API_KEY")
|
||
response = client.chat.completions.create(
|
||
model="default",
|
||
messages=[{"role": "user", "content": "Who is Lu Xun?"}],
|
||
extra_body={
|
||
"logits_processors_args": {
|
||
"enable_your_logits_processor": True
|
||
}
|
||
}
|
||
)
|
||
```
|
||
|
||
### 3. Use your logits processor via offline inference
|
||
|
||
For offline inference, pass the `logits_processors` argument (type `list[str]`) when initializing the `LLM` instance. To specify your custom logits processor, pass its FQCN (`module.path:ClassName`). When generating text via the offline `chat()` or `generate()` APIs, provide the logits-processor parameters through `sampling_params`.`logits_processors_args` to enable and pass arguments to the corresponding processors.
|
||
|
||
```python
|
||
from fastdeploy import LLM, SamplingParams
|
||
|
||
llm = LLM(
|
||
model="path/to/model",
|
||
engine_worker_queue_port=8282,
|
||
cache_queue_port=8383,
|
||
logits_processors=['your.dotted.path.to.module:YourLogitsProcessor'],
|
||
)
|
||
|
||
messages = [{"role": "user", "content": "Who is Lu Xun?"}]
|
||
sampling_params = SamplingParams(
|
||
top_p=0.95,
|
||
max_tokens=128,
|
||
logits_processors_args={"enable_your_logits_processor": True},
|
||
)
|
||
outputs = llm.chat(messages, sampling_params)
|
||
print(outputs[0].outputs.text)
|
||
```
|