Files
FastDeploy/docs/features/pooling_models.md
lizexu123 c563eca791 [Feature] support reward model (#5301)
* Your commit message here

* add test

* update develop

* support reward

* support enable_chunk_prefill

* support bingfa

* support convert is reward

* update test

* delete print

* fix enable_thinking

* add document

* fix place

* fix test

* fix

* support enable_prefix_caching

* add no-enable_prefix-caching test

* fix

* support enable_prefix_caching

* delete print

* fix document

* fix

* fix test

* fix document and delete chinese

* udpate

* enable_thinking

* fix test
2025-12-02 14:55:31 +08:00

176 lines
4.8 KiB
Markdown

[简体中文](../zh/features//pooling_models.md)
# Pooling Models
FastDeploy also supports pooling models, such as embedding models.
In FastDeploy, pooling models implement the `FdModelForPooling` interface.
These models use a `Pooler` to extract the final hidden states of the input
before returning them.
## Configuration
### Model Runner
Run a model in pooling mode via the option `--runner pooling`.
!!! tip<br>
There is no need to set this option in the vast majority of cases as Fastdeploy can automatically
detect the appropriate model runner via `--runner auto`.
### Model Conversion
FastDeploy can adapt models for various pooling tasks via the option `--convert <type>`.
If `--runner pooling` has been set (manually or automatically) but the model does not implement the
`FdModelForPooling` interface,
vLLM will attempt to automatically convert the model according to the architecture names
shown in the table below.
| Architecture | `--convert` | Supported pooling tasks |
|-------------------------------------------------|-------------|---------------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `*ForProcessRewardModel` | `embed` | `embed` |
!!! tip<br>
You can explicitly set `--convert <type>` to specify how to convert the model.
### Pooler Configuration
#### Predefined models
If the `Pooler` defined by the model accepts `pooler_config`,
you can override some of its attributes via the `--pooler-config` option.
#### Converted models
If the model has been converted via `--convert` (see above),
the pooler assigned to each task has the following attributes by default:
| Task | Pooling Type | Normalization | Softmax |
|------------|--------------|---------------|---------|
| `embed` | `LAST` | ✅︎ | ❌ |
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults and It can also be specified during model network construction through @default_pooling_type("LAST").
##### Pooling Type
1.LastPool(PoolingType.LAST)
Purpose:Extracts the hidden state of the last token in each sequence
2.AllPool(PoolingType.ALL)
Purpose:Returns the hidden states of all tokens in each sequence
3.CLSPool(PoolingType.CLS)
Purpose:Returns the hidden state of the first token in each sequence (CLS token)
4.MeanPool(PoolingType.MEAN)
Purpose:Computes the average of all token hidden states in each sequence
## Online Serving
FastDeploy's OpenAI-compatible server provides API endpoints and custom reward interfaces.
[Embeddings API], supports text and multi-modal inputs
[Reward API], scores specific content
### Embedding Model:
```python
model_path=Qwen/Qwen3-Embedding-0.6B
python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
--max-num-seqs 256 --max-model-len 32768 \
--port 9412 --engine-worker-queue-port 7142 \
--metrics-port 7211 --tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--runner pooling
```
Request Methods:
A. EmbeddingCompletionRequest Example (Standard Text Input)
```bash
curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
-H 'Content-Type: application/json' \
-d '{
"model": "text-embedding-chat-model",
"input": [
"This is a sentence for pooling embedding.",
"Another input text."
],
"user": "test_client"
}'
```
B. EmbeddingChatRequest Example (Message Sequence Input)
```bash
curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
-H 'Content-Type: application/json' \
-d '{
"model": "text-embedding-chat-model",
"messages": [
{"role": "user", "content": "Generate embedding for user query."}
]
}'
```
### Pooling Model and reward score
```python
model_path=RM_v1008
python -m fastdeploy.entrypoints.openai.api_server \
--model ${model_path} \
--max-num-seqs 256 \
--max-model-len 8192 \
--port 13351 \
--engine-worker-queue-port 7562 \
--metrics-port 7531 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--runner pooling \
--convert embed
```
Request Method: ChatRewardRequest
```bash
curl --location 'http://xxxx/v1/chat/reward' \
--header 'Content-Type: application/json' \
--data '{
"model": "",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://xxx/a.png"
}
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "图里有几个人"
}
]
}
],
"user": "user-123",
"chat_template": null,
"chat_template_kwargs": {
"custom_var": "value"
},
"mm_processor_kwargs": {
"image_size": 224
}
}'
```