FastDeploy/docs/features/logits_processor.md

# Logits Processors

## Overview

A **Logits Processor (LP)** sits between *model output logits* and the *sampler* (top-k/top-p/temperature…). It applies pluggable transformations to logits **before** sampling (e.g., weighting, masking, penalties, biases).

## Key Features

- **Server-level registration**: declare available processors at startup via `--logits-processors`. The declaration order is the execution order.
- **Per-request control**: enable and configure processors via the `logits_processors_args` field in the request body.
- **Built-in processor**: commonly used processors are provided, e.g., `LogitBiasLogitsProcessor`, which can be loaded directly by class name.
- **Extensible interface**: a standard `LogitsProcessor` interface is provided for user-defined processors, which can be loaded by FQCN `module.path:ClassName`.

## Usage

### Online Service

#### 1. Start the service (register logits processors)

Register processors with `--logits-processors` when starting the service. For a built-in processor like `LogitBiasLogitsProcessor`, pass the class name directly:

```bash
python -m fastdeploy.entrypoints.openai.api_server \
  --model /path/to/model \
  --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --cache-queue-port 8183 \
  --logits-processors LogitBiasLogitsProcessor
```

#### 2. Send a request (enable and configure as needed)

Use the `logits_processors_args` field in the REST request body to enable and configure processors. Example with `LogitBiasLogitsProcessor`, which adds a bias to specified tokens. It accepts a `logit_bias` dictionary mapping *token_id* → *bias value*:

```bash
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" -H "Content-Type: application/json" -d '{
  "messages": [{"role":"user", "content":"Who is Lu Xun?"}],
  "logits_processors_args": {
    "logit_bias": {"128": 5.0, "50256": -10.0}
  }
}'
```

When using the OpenAI Python SDK, pass `logits_processors_args` through `extra_body`:

```python
import openai

client = openai.Client(base_url="http://0.0.0.0:8180/v1", api_key="EMPTY_API_KEY")
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Who is Lu Xun?"}],
    extra_body={
        "logits_processors_args": {
           "logit_bias": {"128": 5.0, "50256": -10.0}
        }
    }
)
```

### Offline Inference

For offline inference, pass the `logits_processors` argument (type `list[str]`) when initializing the `LLM` instance. When generating text via the offline `chat()` or `generate()` APIs, provide the logits-processor parameters through `sampling_params`.`logits_processors_args` to enable and pass arguments to the corresponding processors.

```python
from fastdeploy import LLM, SamplingParams

llm = LLM(
    model="path/to/model",
    engine_worker_queue_port=8282,
    cache_queue_port=8383,
    logits_processors=['LogitBiasLogitsProcessor'],
)

messages = [{"role": "user", "content": "Who is Lu Xun?"}]
sampling_params = SamplingParams(
    top_p=0.95,
    max_tokens=128,
    logits_processors_args={"logit_bias": {128: 5.0, 50256: -10.0}},
)
outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs.text)
```

## Custom Logits Processor

### 1. Define your own LogitsProcessor class
Inherit the `fastdeploy.openai.logits_processor.LogitsProcessor` class and implement the `update_state()` and `apply()` methods.

- **`update_state()` is used to update the logits processor state.** The input is the inference backend’s runtime state `share_inputs`, and it returns nothing. You need to extract useful information from the runtime state to update the logits processor’s internal state.
  - For example, in the following example, we retrieve the current batch’s `logits_processors_args` from `share_inputs`, and then bulk-modify the enablement status of the logits processor for the current batch;
  - When writing your class, you should predefine the parameter names for your logits processor, such as adding a request parameter `enable_your_logits_processor` to control whether your logits processor is enabled for a request;
- **`apply()` is used to actually modify the logits tensor.** Before `apply()` runs, the model will call `update_state()` to refresh the logits processor state. Therefore, ensure your `update_state()` correctly updates the state variables used by the logits processor.
  - In the following example, we use `self.enabled`to determine whether each request in the current batch enables your logits processor, and adjust the logits tensor dynamically.
```python
from paddle import Tensor
from fastdeploy.config import FDConfig
from fastdeploy.openai.logits_processor import LogitsProcessor

class YourLogitsProcessor(LogitsProcessor):

    def __init__(self, fd_config: FDConfig) -> None:
        # Initialize your state variables here, for example obtain dtype, device, etc.
        # from fd_config. You can freely set the state variables you need, and update them
        # during each step of inference update_state()
        self.enabled = None
        return

    def update_state(self, share_inputs: dict) -> None:
        """Called when there are new output tokens, prior to each forward pass.

        Each field in the `share_inputs` dict typically stores information for all request
        slots. It has a `stop_flags` array that indicates whether a slot currently has a
        running request (`False` means the slot is active). Therefore, it is recommended to
        filter entries by `stop_flags` to keep only data for the current batch.
        """
        stop_flags = share_inputs["stop_flags"]
        logits_processors_args = share_inputs["logits_processors_args"]
        logits_processors_args = [a for a, f in zip(logits_processors_args, stop_flags) if not f]
        # Update your state variables here to facilitate dynamically
        # adjusting your logits processor behavior at each step of inference.
        # The latest state should be read and used in the apply() method
        self.enabled = [a.enable_your_logits_processor for a in logits_processors_args]
        return

    def apply(self, logits: Tensor) -> Tensor:
        """Apply LogitsProcessor to batch logits tensor.

        The updated tensor must be returned but may be modified in-place.
        """
        for i, e in enumerate(self.enabled):
            # Implement your core logits transformation here, and return the modified logits tensor
            logits[i] = ...
        return logits
```

### 2. Use your logits processor via online service

#### 2.2. Start the service (register your logits processor)

When registering a custom processor, pass its **FQCN** (`module.path:ClassName`) to `--logits-processors`:

```bash
python -m fastdeploy.entrypoints.openai.api_server \
  --model /path/to/model \
  --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --cache-queue-port 8183 \
  --logits-processors your.dotted.path.to.module:YourLogitsProcessor
```

#### 2.2. Send a request (enable and configure as needed)

Enable your processor per request via `logits_processors_args`:

```bash
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" -H "Content-Type: application/json" -d '{
  "messages": [{"role":"user", "content":"Who is Lu Xun?"}],
  "logits_processors_args": {
    "enable_your_logits_processor": true
  }
}'
```

Using the OpenAI Python SDK:

```python
import openai

client = openai.Client(base_url="http://0.0.0.0:8180/v1", api_key="EMPTY_API_KEY")
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Who is Lu Xun?"}],
    extra_body={
        "logits_processors_args": {
            "enable_your_logits_processor": True
        }
    }
)
```

### 3. Use your logits processor via offline inference

For offline inference, pass the `logits_processors` argument (type `list[str]`) when initializing the `LLM` instance. To specify your custom logits processor, pass its FQCN (`module.path:ClassName`). When generating text via the offline `chat()` or `generate()` APIs, provide the logits-processor parameters through `sampling_params`.`logits_processors_args` to enable and pass arguments to the corresponding processors.

```python
from fastdeploy import LLM, SamplingParams

llm = LLM(
    model="path/to/model",
    engine_worker_queue_port=8282,
    cache_queue_port=8383,
    logits_processors=['your.dotted.path.to.module:YourLogitsProcessor'],
)

messages = [{"role": "user", "content": "Who is Lu Xun?"}]
sampling_params = SamplingParams(
    top_p=0.95,
    max_tokens=128,
    logits_processors_args={"enable_your_logits_processor": True},
)
outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs.text)
```