Files
FastDeploy/docs/features/pooling_models.md
lizexu123 c563eca791 [Feature] support reward model (#5301)
* Your commit message here

* add test

* update develop

* support reward

* support enable_chunk_prefill

* support bingfa

* support convert is reward

* update test

* delete print

* fix enable_thinking

* add document

* fix place

* fix test

* fix

* support enable_prefix_caching

* add no-enable_prefix-caching test

* fix

* support enable_prefix_caching

* delete print

* fix document

* fix

* fix test

* fix document and delete chinese

* udpate

* enable_thinking

* fix test
2025-12-02 14:55:31 +08:00

4.8 KiB
Raw Permalink Blame History

简体中文

Pooling Models

FastDeploy also supports pooling models, such as embedding models.

In FastDeploy, pooling models implement the FdModelForPooling interface. These models use a Pooler to extract the final hidden states of the input before returning them.

Configuration

Model Runner

Run a model in pooling mode via the option --runner pooling.

!!! tip
There is no need to set this option in the vast majority of cases as Fastdeploy can automatically detect the appropriate model runner via --runner auto.

Model Conversion

FastDeploy can adapt models for various pooling tasks via the option --convert <type>.

If --runner pooling has been set (manually or automatically) but the model does not implement the FdModelForPooling interface, vLLM will attempt to automatically convert the model according to the architecture names shown in the table below.

Architecture --convert Supported pooling tasks
*ForTextEncoding, *EmbeddingModel, *Model *ForProcessRewardModel embed embed

!!! tip
You can explicitly set --convert <type> to specify how to convert the model.

Pooler Configuration

Predefined models

If the Pooler defined by the model accepts pooler_config, you can override some of its attributes via the --pooler-config option.

Converted models

If the model has been converted via --convert (see above), the pooler assigned to each task has the following attributes by default:

Task Pooling Type Normalization Softmax
embed LAST

When loading Sentence Transformers models, its Sentence Transformers configuration file (modules.json) takes priority over the model's defaults and It can also be specified during model network construction through @default_pooling_type("LAST").

Pooling Type

1.LastPool(PoolingType.LAST)

Purpose:Extracts the hidden state of the last token in each sequence

2.AllPool(PoolingType.ALL)

Purpose:Returns the hidden states of all tokens in each sequence

3.CLSPool(PoolingType.CLS)

Purpose:Returns the hidden state of the first token in each sequence (CLS token)

4.MeanPool(PoolingType.MEAN)

Purpose:Computes the average of all token hidden states in each sequence

Online Serving

FastDeploy's OpenAI-compatible server provides API endpoints and custom reward interfaces.

[Embeddings API], supports text and multi-modal inputs

[Reward API], scores specific content

Embedding Model:

model_path=Qwen/Qwen3-Embedding-0.6B

python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
    --max-num-seqs 256 --max-model-len 32768 \
    --port 9412 --engine-worker-queue-port 7142 \
    --metrics-port 7211 --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --runner pooling

Request Methods: A. EmbeddingCompletionRequest Example (Standard Text Input)

curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "text-embedding-chat-model",
    "input": [
      "This is a sentence for pooling embedding.",
      "Another input text."
    ],
    "user": "test_client"
  }'

B. EmbeddingChatRequest Example (Message Sequence Input)

curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "text-embedding-chat-model",
    "messages": [
      {"role": "user", "content": "Generate embedding for user query."}
    ]
  }'

Pooling Model and reward score

model_path=RM_v1008
python -m fastdeploy.entrypoints.openai.api_server \
    --model ${model_path} \
    --max-num-seqs 256 \
    --max-model-len 8192 \
    --port 13351 \
    --engine-worker-queue-port 7562 \
    --metrics-port 7531 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.9 \
    --runner pooling \
    --convert embed

Request Method: ChatRewardRequest

curl --location 'http://xxxx/v1/chat/reward' \
--header 'Content-Type: application/json' \
--data '{
  "model": "",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://xxx/a.png"
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "图里有几个人"
        }
      ]
    }
  ],
  "user": "user-123",
  "chat_template": null,
  "chat_template_kwargs": {
    "custom_var": "value"
  },
  "mm_processor_kwargs": {
    "image_size": 224
  }
}'