mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2025-12-24 13:28:13 +08:00

Files

李泳桦 a012e3608b [Feature] support logits processors (#4515 )

* [feat] provide an interface for logits processors and a builtin LogitBiasLogitsProcessor

* [chore] fix code style

* [fix] add unit test & fix existing bugs

* [feat] add engine/worker arg --logits-processors

* [fix] redefine user args as logits_processors_args and fix some bugs

* [fix] fix test_sampler

* Update fastdeploy/model_executor/logits_processor/builtin.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fastdeploy/model_executor/logits_processor/__init__.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/model_executor/test_logits_processor.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [fix] fix typo

* Update fastdeploy/engine/sampling_params.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [fix] fix bracelet

* [chore] redefine logits processor interface: pass the entire share_inputs into LP, do not copy share_inputs and logits

* [doc] add docs

* [fix] fix logit bias processor not applied when decoding is too fast & add docs and tests

* [fix] fix redundant code

* [feat] skip apply() if no bias is specified

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

2025-10-29 00:08:53 +08:00

8.5 KiB

Raw Blame History

Logits Processors

Overview

A Logits Processor (LP) sits between model output logits and the sampler (top-k/top-p/temperature…). It applies pluggable transformations to logits before sampling (e.g., weighting, masking, penalties, biases).

Key Features

Server-level registration: declare available processors at startup via --logits-processors. The declaration order is the execution order.
Per-request control: enable and configure processors via the logits_processors_args field in the request body.
Built-in processor: commonly used processors are provided, e.g., LogitBiasLogitsProcessor, which can be loaded directly by class name.
Extensible interface: a standard LogitsProcessor interface is provided for user-defined processors, which can be loaded by FQCN module.path:ClassName.

Usage

Online Service

1. Start the service (register logits processors)

Register processors with --logits-processors when starting the service. For a built-in processor like LogitBiasLogitsProcessor, pass the class name directly:

python -m fastdeploy.entrypoints.openai.api_server \
  --model /path/to/model \
  --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --cache-queue-port 8183 \
  --logits-processors LogitBiasLogitsProcessor

2. Send a request (enable and configure as needed)

Use the logits_processors_args field in the REST request body to enable and configure processors. Example with LogitBiasLogitsProcessor, which adds a bias to specified tokens. It accepts a logit_bias dictionary mapping token_id → bias value:

curl -X POST "http://0.0.0.0:8180/v1/chat/completions" -H "Content-Type: application/json" -d '{
  "messages": [{"role":"user", "content":"Who is Lu Xun?"}],
  "logits_processors_args": {
    "logit_bias": {"128": 5.0, "50256": -10.0}
  }
}'

When using the OpenAI Python SDK, pass logits_processors_args through extra_body:

import openai

client = openai.Client(base_url="http://0.0.0.0:8180/v1", api_key="EMPTY_API_KEY")
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Who is Lu Xun?"}],
    extra_body={
        "logits_processors_args": {
           "logit_bias": {"128": 5.0, "50256": -10.0}
        }
    }
)

Offline Inference

For offline inference, pass the logits_processors argument (type list[str]) when initializing the LLM instance. When generating text via the offline chat() or generate() APIs, provide the logits-processor parameters through sampling_params.logits_processors_args to enable and pass arguments to the corresponding processors.

from fastdeploy import LLM, SamplingParams

llm = LLM(
    model="path/to/model",
    engine_worker_queue_port=8282,
    cache_queue_port=8383,
    logits_processors=['LogitBiasLogitsProcessor'],
)

messages = [{"role": "user", "content": "Who is Lu Xun?"}]
sampling_params = SamplingParams(
    top_p=0.95,
    max_tokens=128,
    logits_processors_args={"logit_bias": {128: 5.0, 50256: -10.0}},
)
outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs.text)

Custom Logits Processor

1. Define your own LogitsProcessor class

Inherit the fastdeploy.openai.logits_processor.LogitsProcessor class and implement the update_state() and apply() methods.

update_state() is used to update the logits processor state. The input is the inference backend’s runtime state share_inputs, and it returns nothing. You need to extract useful information from the runtime state to update the logits processor’s internal state.
- For example, in the following example, we retrieve the current batch’s logits_processors_args from share_inputs, and then bulk-modify the enablement status of the logits processor for the current batch;
- When writing your class, you should predefine the parameter names for your logits processor, such as adding a request parameter enable_your_logits_processor to control whether your logits processor is enabled for a request;
apply() is used to actually modify the logits tensor. Before apply() runs, the model will call update_state() to refresh the logits processor state. Therefore, ensure your update_state() correctly updates the state variables used by the logits processor.
- In the following example, we use self.enabledto determine whether each request in the current batch enables your logits processor, and adjust the logits tensor dynamically.

from paddle import Tensor
from fastdeploy.config import FDConfig
from fastdeploy.openai.logits_processor import LogitsProcessor

class YourLogitsProcessor(LogitsProcessor):

    def __init__(self, fd_config: FDConfig) -> None:
        # Initialize your state variables here, for example obtain dtype, device, etc.
        # from fd_config. You can freely set the state variables you need, and update them
        # during each step of inference update_state()
        self.enabled = None
        return

    def update_state(self, share_inputs: dict) -> None:
        """Called when there are new output tokens, prior to each forward pass.

        Each field in the `share_inputs` dict typically stores information for all request
        slots. It has a `stop_flags` array that indicates whether a slot currently has a
        running request (`False` means the slot is active). Therefore, it is recommended to
        filter entries by `stop_flags` to keep only data for the current batch.
        """
        stop_flags = share_inputs["stop_flags"]
        logits_processors_args = share_inputs["logits_processors_args"]
        logits_processors_args = [a for a, f in zip(logits_processors_args, stop_flags) if not f]
        # Update your state variables here to facilitate dynamically
        # adjusting your logits processor behavior at each step of inference.
        # The latest state should be read and used in the apply() method
        self.enabled = [a.enable_your_logits_processor for a in logits_processors_args]
        return

    def apply(self, logits: Tensor) -> Tensor:
        """Apply LogitsProcessor to batch logits tensor.

        The updated tensor must be returned but may be modified in-place.
        """
        for i, e in enumerate(self.enabled):
            # Implement your core logits transformation here, and return the modified logits tensor
            logits[i] = ...
        return logits

2. Use your logits processor via online service

2.2. Start the service (register your logits processor)

When registering a custom processor, pass its FQCN (module.path:ClassName) to --logits-processors:

python -m fastdeploy.entrypoints.openai.api_server \
  --model /path/to/model \
  --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --cache-queue-port 8183 \
  --logits-processors your.dotted.path.to.module:YourLogitsProcessor

2.2. Send a request (enable and configure as needed)

Enable your processor per request via logits_processors_args:

curl -X POST "http://0.0.0.0:8180/v1/chat/completions" -H "Content-Type: application/json" -d '{
  "messages": [{"role":"user", "content":"Who is Lu Xun?"}],
  "logits_processors_args": {
    "enable_your_logits_processor": true
  }
}'

Using the OpenAI Python SDK:

import openai

client = openai.Client(base_url="http://0.0.0.0:8180/v1", api_key="EMPTY_API_KEY")
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Who is Lu Xun?"}],
    extra_body={
        "logits_processors_args": {
            "enable_your_logits_processor": True
        }
    }
)

3. Use your logits processor via offline inference

For offline inference, pass the logits_processors argument (type list[str]) when initializing the LLM instance. To specify your custom logits processor, pass its FQCN (module.path:ClassName). When generating text via the offline chat() or generate() APIs, provide the logits-processor parameters through sampling_params.logits_processors_args to enable and pass arguments to the corresponding processors.

from fastdeploy import LLM, SamplingParams

llm = LLM(
    model="path/to/model",
    engine_worker_queue_port=8282,
    cache_queue_port=8383,
    logits_processors=['your.dotted.path.to.module:YourLogitsProcessor'],
)

messages = [{"role": "user", "content": "Who is Lu Xun?"}]
sampling_params = SamplingParams(
    top_p=0.95,
    max_tokens=128,
    logits_processors_args={"enable_your_logits_processor": True},
)
outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs.text)

8.5 KiB Raw Blame History Unescape Escape