[Feature] remove dependency on enable_mm and refine multimodal's code (#3014)

* remove dependency on enable_mm * fix codestyle check error * fix codestyle check error * update docs * resolve conflicts on model config * fix unit test error * fix code style check error --------- Co-authored-by: shige <1021937542@qq.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-10-22 08:09:28 +08:00 · 2025-08-01 20:01:18 +08:00
parent 243394044d
commit b71cbb466d
24 changed files with 118 additions and 29 deletions
--- a/docs/get_started/ernie-4.5-vl.md
+++ b/docs/get_started/ernie-4.5-vl.md
@@ -31,7 +31,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
       --quantization wint4 \
       --max-model-len 32768 \
       --max-num-seqs 32 \
-       --enable-mm \
       --mm-processor-kwargs '{"video_max_frames": 30}' \
       --limit-mm-per-prompt '{"image": 10, "video": 3}' \
       --reasoning-parser ernie-45-vl
--- a/docs/get_started/quick_start_vl.md
+++ b/docs/get_started/quick_start_vl.md
@@ -26,8 +26,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
       --engine-worker-queue-port 8182 \
       --max-model-len 32768 \
       --max-num-seqs 32 \
-       --reasoning-parser ernie-45-vl \
-       --enable-mm
+       --reasoning-parser ernie-45-vl
 ```

 > 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```baidu/ERNIE-4.5-0.3B-Base-Paddle```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md).
--- a/docs/offline_inference.md
+++ b/docs/offline_inference.md
@@ -39,7 +39,7 @@ Documentation for `SamplingParams`, `LLM.generate`, `LLM.chat`, and output struc
 ```python
 from fastdeploy.entrypoints.llm import LLM
 # 加载模型
-llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
+llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")

 outputs = llm.chat(
    messages=[
@@ -127,7 +127,7 @@ for message in messages:
            })

 sampling_params = SamplingParams(temperature=0.1, max_tokens=6400)
-llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
+llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
 outputs = llm.generate(prompts={
    "prompt": prompt,
    "multimodal_data": {
--- a/docs/parameters.md
+++ b/docs/parameters.md
@@ -19,7 +19,7 @@ When using FastDeploy to deploy models (including offline inference and service
 | ```tokenizer``` | `str` | Tokenizer name or path, defaults to model path |
 | ```use_warmup``` | `int` | Whether to perform warmup at startup, will automatically generate maximum length data for warmup, enabled by default when automatically calculating KV Cache |
 | ```limit_mm_per_prompt``` | `dict[str]` | Limit the amount of multimodal data per prompt, e.g.: {"image": 10, "video": 3}, default: 1 for all |
-| ```enable_mm``` | `bool` | Whether to support multimodal data (for multimodal models only), default: False |
+| ```enable_mm``` | `bool` | __[DEPRECATED]__ Whether to support multimodal data (for multimodal models only), default: False |
 | ```quantization``` | `str` | Model quantization strategy, when loading BF16 CKPT, specifying wint4 or wint8 supports lossless online 4bit/8bit quantization |
 | ```gpu_memory_utilization``` | `float` | GPU memory utilization, default: 0.9 |
 | ```num_gpu_blocks_override``` | `int` | Preallocated KVCache blocks, this parameter can be automatically calculated by FastDeploy based on memory situation, no need for user configuration, default: None |
--- a/docs/zh/get_started/ernie-4.5-vl.md
+++ b/docs/zh/get_started/ernie-4.5-vl.md
@@ -31,7 +31,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
       --quantization wint4 \
       --max-model-len 32768 \
       --max-num-seqs 32 \
-       --enable-mm \
       --mm-processor-kwargs '{"video_max_frames": 30}' \
       --limit-mm-per-prompt '{"image": 10, "video": 3}' \
       --reasoning-parser ernie-45-vl
--- a/docs/zh/get_started/quick_start_vl.md
+++ b/docs/zh/get_started/quick_start_vl.md
@@ -26,8 +26,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
       --engine-worker-queue-port 8182 \
       --max-model-len 32768 \
       --max-num-seqs 32 \
-       --reasoning-parser ernie-45-vl \
-       --enable-mm
+       --reasoning-parser ernie-45-vl
 ```

 >💡 注意：在 ```--model``` 指定的路径中，若当前目录下不存在该路径对应的子目录，则会尝试根据指定的模型名称（如 ```baidu/ERNIE-4.5-0.3B-Base-Paddle```）查询AIStudio是否存在预置模型，若存在，则自动启动下载。默认的下载路径为：```~/xx```。关于模型自动下载的说明和配置参阅[模型下载](../supported_models.md)。
--- a/docs/zh/offline_inference.md
+++ b/docs/zh/offline_inference.md
@@ -39,7 +39,7 @@ for output in outputs:
 ```python
 from fastdeploy.entrypoints.llm import LLM
 # 加载模型
-llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
+llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")

 outputs = llm.chat(
    messages=[
@@ -127,7 +127,7 @@ for message in messages:
            })

 sampling_params = SamplingParams(temperature=0.1, max_tokens=6400)
-llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
+llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
 outputs = llm.generate(prompts={
    "prompt": prompt,
    "multimodal_data": {
--- a/docs/zh/parameters.md
+++ b/docs/zh/parameters.md
@@ -17,7 +17,7 @@
 | ```tokenizer```                    | `str`      | tokenizer 名或路径，默认为模型路径 |
 | ```use_warmup```                   | `int`      | 是否在启动时进行warmup，会自动生成极限长度数据进行warmup，默认自动计算KV Cache时会使用 |
 | ```limit_mm_per_prompt```          | `dict[str]` | 限制每个prompt中多模态数据的数量，如：{"image": 10, "video": 3}，默认都为1 |
-| ```enable_mm```                    | `bool`      | 是否支持多模态数据（仅针对多模模型），默认False |
+| ```enable_mm```                    | `bool`      | __[已废弃]__ 是否支持多模态数据（仅针对多模模型），默认False |
 | ```quantization```                 | `str`       | 模型量化策略，当在加载BF16 CKPT时，指定wint4或wint8时，支持无损在线4bit/8bit量化 |
 | ```gpu_memory_utilization```       | `float`     | GPU显存利用率，默认0.9 |
 | ```num_gpu_blocks_override```      | `int`       | 预分配KVCache块数，此参数可由FastDeploy自动根据显存情况计算，无需用户配置，默认为None |