mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-10-22 08:09:28 +08:00
[Feature] remove dependency on enable_mm and refine multimodal's code (#3014)
* remove dependency on enable_mm * fix codestyle check error * fix codestyle check error * update docs * resolve conflicts on model config * fix unit test error * fix code style check error --------- Co-authored-by: shige <1021937542@qq.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
This commit is contained in:
@@ -31,7 +31,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--quantization wint4 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-seqs 32 \
|
||||
--enable-mm \
|
||||
--mm-processor-kwargs '{"video_max_frames": 30}' \
|
||||
--limit-mm-per-prompt '{"image": 10, "video": 3}' \
|
||||
--reasoning-parser ernie-45-vl
|
||||
|
@@ -26,8 +26,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--engine-worker-queue-port 8182 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-seqs 32 \
|
||||
--reasoning-parser ernie-45-vl \
|
||||
--enable-mm
|
||||
--reasoning-parser ernie-45-vl
|
||||
```
|
||||
|
||||
> 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```baidu/ERNIE-4.5-0.3B-Base-Paddle```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md).
|
||||
|
@@ -39,7 +39,7 @@ Documentation for `SamplingParams`, `LLM.generate`, `LLM.chat`, and output struc
|
||||
```python
|
||||
from fastdeploy.entrypoints.llm import LLM
|
||||
# 加载模型
|
||||
llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
|
||||
llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
|
||||
|
||||
outputs = llm.chat(
|
||||
messages=[
|
||||
@@ -127,7 +127,7 @@ for message in messages:
|
||||
})
|
||||
|
||||
sampling_params = SamplingParams(temperature=0.1, max_tokens=6400)
|
||||
llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
|
||||
llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
|
||||
outputs = llm.generate(prompts={
|
||||
"prompt": prompt,
|
||||
"multimodal_data": {
|
||||
|
@@ -19,7 +19,7 @@ When using FastDeploy to deploy models (including offline inference and service
|
||||
| ```tokenizer``` | `str` | Tokenizer name or path, defaults to model path |
|
||||
| ```use_warmup``` | `int` | Whether to perform warmup at startup, will automatically generate maximum length data for warmup, enabled by default when automatically calculating KV Cache |
|
||||
| ```limit_mm_per_prompt``` | `dict[str]` | Limit the amount of multimodal data per prompt, e.g.: {"image": 10, "video": 3}, default: 1 for all |
|
||||
| ```enable_mm``` | `bool` | Whether to support multimodal data (for multimodal models only), default: False |
|
||||
| ```enable_mm``` | `bool` | __[DEPRECATED]__ Whether to support multimodal data (for multimodal models only), default: False |
|
||||
| ```quantization``` | `str` | Model quantization strategy, when loading BF16 CKPT, specifying wint4 or wint8 supports lossless online 4bit/8bit quantization |
|
||||
| ```gpu_memory_utilization``` | `float` | GPU memory utilization, default: 0.9 |
|
||||
| ```num_gpu_blocks_override``` | `int` | Preallocated KVCache blocks, this parameter can be automatically calculated by FastDeploy based on memory situation, no need for user configuration, default: None |
|
||||
|
@@ -31,7 +31,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--quantization wint4 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-seqs 32 \
|
||||
--enable-mm \
|
||||
--mm-processor-kwargs '{"video_max_frames": 30}' \
|
||||
--limit-mm-per-prompt '{"image": 10, "video": 3}' \
|
||||
--reasoning-parser ernie-45-vl
|
||||
|
@@ -26,8 +26,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--engine-worker-queue-port 8182 \
|
||||
--max-model-len 32768 \
|
||||
--max-num-seqs 32 \
|
||||
--reasoning-parser ernie-45-vl \
|
||||
--enable-mm
|
||||
--reasoning-parser ernie-45-vl
|
||||
```
|
||||
|
||||
>💡 注意:在 ```--model``` 指定的路径中,若当前目录下不存在该路径对应的子目录,则会尝试根据指定的模型名称(如 ```baidu/ERNIE-4.5-0.3B-Base-Paddle```)查询AIStudio是否存在预置模型,若存在,则自动启动下载。默认的下载路径为:```~/xx```。关于模型自动下载的说明和配置参阅[模型下载](../supported_models.md)。
|
||||
|
@@ -39,7 +39,7 @@ for output in outputs:
|
||||
```python
|
||||
from fastdeploy.entrypoints.llm import LLM
|
||||
# 加载模型
|
||||
llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
|
||||
llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
|
||||
|
||||
outputs = llm.chat(
|
||||
messages=[
|
||||
@@ -127,7 +127,7 @@ for message in messages:
|
||||
})
|
||||
|
||||
sampling_params = SamplingParams(temperature=0.1, max_tokens=6400)
|
||||
llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
|
||||
llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
|
||||
outputs = llm.generate(prompts={
|
||||
"prompt": prompt,
|
||||
"multimodal_data": {
|
||||
|
@@ -17,7 +17,7 @@
|
||||
| ```tokenizer``` | `str` | tokenizer 名或路径,默认为模型路径 |
|
||||
| ```use_warmup``` | `int` | 是否在启动时进行warmup,会自动生成极限长度数据进行warmup,默认自动计算KV Cache时会使用 |
|
||||
| ```limit_mm_per_prompt``` | `dict[str]` | 限制每个prompt中多模态数据的数量,如:{"image": 10, "video": 3},默认都为1 |
|
||||
| ```enable_mm``` | `bool` | 是否支持多模态数据(仅针对多模模型),默认False |
|
||||
| ```enable_mm``` | `bool` | __[已废弃]__ 是否支持多模态数据(仅针对多模模型),默认False |
|
||||
| ```quantization``` | `str` | 模型量化策略,当在加载BF16 CKPT时,指定wint4或wint8时,支持无损在线4bit/8bit量化 |
|
||||
| ```gpu_memory_utilization``` | `float` | GPU显存利用率,默认0.9 |
|
||||
| ```num_gpu_blocks_override``` | `int` | 预分配KVCache块数,此参数可由FastDeploy自动根据显存情况计算,无需用户配置,默认为None |
|
||||
|
Reference in New Issue
Block a user