mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-12-24 13:28:13 +08:00
* Your commit message here * add test * update develop * support reward * support enable_chunk_prefill * support bingfa * support convert is reward * update test * delete print * fix enable_thinking * add document * fix place * fix test * fix * support enable_prefix_caching * add no-enable_prefix-caching test * fix * support enable_prefix_caching * delete print * fix document * fix * fix test * fix document and delete chinese * udpate * enable_thinking * fix test
176 lines
4.8 KiB
Markdown
176 lines
4.8 KiB
Markdown
[简体中文](../zh/features//pooling_models.md)
|
|
|
|
# Pooling Models
|
|
|
|
FastDeploy also supports pooling models, such as embedding models.
|
|
|
|
In FastDeploy, pooling models implement the `FdModelForPooling` interface.
|
|
These models use a `Pooler` to extract the final hidden states of the input
|
|
before returning them.
|
|
|
|
## Configuration
|
|
|
|
### Model Runner
|
|
|
|
Run a model in pooling mode via the option `--runner pooling`.
|
|
|
|
!!! tip<br>
|
|
There is no need to set this option in the vast majority of cases as Fastdeploy can automatically
|
|
detect the appropriate model runner via `--runner auto`.
|
|
|
|
### Model Conversion
|
|
|
|
FastDeploy can adapt models for various pooling tasks via the option `--convert <type>`.
|
|
|
|
If `--runner pooling` has been set (manually or automatically) but the model does not implement the
|
|
`FdModelForPooling` interface,
|
|
vLLM will attempt to automatically convert the model according to the architecture names
|
|
shown in the table below.
|
|
|
|
| Architecture | `--convert` | Supported pooling tasks |
|
|
|-------------------------------------------------|-------------|---------------------------------------|
|
|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `*ForProcessRewardModel` | `embed` | `embed` |
|
|
|
|
!!! tip<br>
|
|
You can explicitly set `--convert <type>` to specify how to convert the model.
|
|
|
|
### Pooler Configuration
|
|
|
|
#### Predefined models
|
|
|
|
If the `Pooler` defined by the model accepts `pooler_config`,
|
|
you can override some of its attributes via the `--pooler-config` option.
|
|
|
|
#### Converted models
|
|
|
|
If the model has been converted via `--convert` (see above),
|
|
the pooler assigned to each task has the following attributes by default:
|
|
|
|
| Task | Pooling Type | Normalization | Softmax |
|
|
|------------|--------------|---------------|---------|
|
|
| `embed` | `LAST` | ✅︎ | ❌ |
|
|
|
|
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
|
|
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults and It can also be specified during model network construction through @default_pooling_type("LAST").
|
|
|
|
##### Pooling Type
|
|
|
|
1.LastPool(PoolingType.LAST)
|
|
|
|
Purpose:Extracts the hidden state of the last token in each sequence
|
|
|
|
2.AllPool(PoolingType.ALL)
|
|
|
|
Purpose:Returns the hidden states of all tokens in each sequence
|
|
|
|
3.CLSPool(PoolingType.CLS)
|
|
|
|
Purpose:Returns the hidden state of the first token in each sequence (CLS token)
|
|
|
|
4.MeanPool(PoolingType.MEAN)
|
|
|
|
Purpose:Computes the average of all token hidden states in each sequence
|
|
|
|
## Online Serving
|
|
|
|
FastDeploy's OpenAI-compatible server provides API endpoints and custom reward interfaces.
|
|
|
|
[Embeddings API], supports text and multi-modal inputs
|
|
|
|
[Reward API], scores specific content
|
|
|
|
### Embedding Model:
|
|
```python
|
|
model_path=Qwen/Qwen3-Embedding-0.6B
|
|
|
|
python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
|
|
--max-num-seqs 256 --max-model-len 32768 \
|
|
--port 9412 --engine-worker-queue-port 7142 \
|
|
--metrics-port 7211 --tensor-parallel-size 1 \
|
|
--gpu-memory-utilization 0.9 \
|
|
--runner pooling
|
|
```
|
|
|
|
Request Methods:
|
|
A. EmbeddingCompletionRequest Example (Standard Text Input)
|
|
|
|
```bash
|
|
curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"model": "text-embedding-chat-model",
|
|
"input": [
|
|
"This is a sentence for pooling embedding.",
|
|
"Another input text."
|
|
],
|
|
"user": "test_client"
|
|
}'
|
|
```
|
|
|
|
B. EmbeddingChatRequest Example (Message Sequence Input)
|
|
|
|
```bash
|
|
curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"model": "text-embedding-chat-model",
|
|
"messages": [
|
|
{"role": "user", "content": "Generate embedding for user query."}
|
|
]
|
|
}'
|
|
```
|
|
|
|
### Pooling Model and reward score
|
|
```python
|
|
model_path=RM_v1008
|
|
python -m fastdeploy.entrypoints.openai.api_server \
|
|
--model ${model_path} \
|
|
--max-num-seqs 256 \
|
|
--max-model-len 8192 \
|
|
--port 13351 \
|
|
--engine-worker-queue-port 7562 \
|
|
--metrics-port 7531 \
|
|
--tensor-parallel-size 8 \
|
|
--gpu-memory-utilization 0.9 \
|
|
--runner pooling \
|
|
--convert embed
|
|
```
|
|
Request Method: ChatRewardRequest
|
|
```bash
|
|
curl --location 'http://xxxx/v1/chat/reward' \
|
|
--header 'Content-Type: application/json' \
|
|
--data '{
|
|
"model": "",
|
|
"messages": [
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{
|
|
"type": "image_url",
|
|
"image_url": {
|
|
"url": "https://xxx/a.png"
|
|
}
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"role": "assistant",
|
|
"content": [
|
|
{
|
|
"type": "text",
|
|
"text": "图里有几个人"
|
|
}
|
|
]
|
|
}
|
|
],
|
|
"user": "user-123",
|
|
"chat_template": null,
|
|
"chat_template_kwargs": {
|
|
"custom_var": "value"
|
|
},
|
|
"mm_processor_kwargs": {
|
|
"image_size": 224
|
|
}
|
|
}'
|
|
```
|