mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2025-12-24 13:28:13 +08:00

Go to file

zhouchong ccd52b5596 [Model]support qwen2_5_vl (#3557 )

* adapt qwen_2_5_vl model

* adapt qwen_2_5_vl VIT model

* adapt qwen2_5_vl images_embeds

* adapt qwen2_5_vl 3D rope

* adapt qwen2_5_vl 3D rope v2

* adapt qwen2_5_vl processor

* adapt qwen2_5_vl bypass resampler_model

* adapt qwen2_5_vl 绕过部分ernie逻辑

* adapt qwen2_5_vl 绕过部分ernie逻辑 v2

* adapt qwen2_5_vl 权重加载与命名修改

* adapt qwen2_5_vl 非必须think_end_id

* adapt qwen2_5_vl 区分多种模型的extract_vision_features

* fix:adapt qwen2_5_vl model

* adapt qwen2_5_vl norm

* adapt qwen2_5_vl  processor 更新

* adapt qwen2_5_vl image and video success

* adapt qwen2_5_vl 部分整理代码

* adapt qwen2_5_vl 支持多卡

* adapt qwen2_5_vl on latest develop

* adapt qwen2_5_vl RL

* adapt qwen2_5_vl 整理代码

* support noex rope3d

* adapt qwen2_5_vl add init.py

* adapt qwen2_5_vl add init.py v2

* adapt qwen2_5_vl remove space

* adapt qwen2_5_vl remove space v2

* adapt qwen2_5_vl pre-commit

* adapt qwen2_5_vl update

* adapt qwen2_5_vl pre-commit v2

* adapt qwen2_5_vl modify comments

* adapt qwen2_5_vl fix indentation

* adapt qwen2_5_vl fix indentation v2

---------

Co-authored-by: wangyafeng <wangyafeng@baidu.com>
Co-authored-by: xiaoxiaohehe001 <49090790+xiaoxiaohehe001@users.noreply.github.com>
Co-authored-by: CSWYF3634076 <58356743+CSWYF3634076@users.noreply.github.com>

2025-08-29 18:28:39 +08:00

.github/workflows

[CI] update paddle version to nightly (#3698 )

2025-08-29 18:16:13 +08:00

benchmarks

[CE]add x1 w4a8c8 benchamrk config (#3607 )

2025-08-26 11:27:32 +08:00

custom_ops

add w4afp8 offline script (#3636 )

2025-08-29 17:56:05 +08:00

dockerfiles

Update README (#3426 )

2025-08-15 18:46:28 +08:00

docs

add w4afp8 offline script (#3636 )

2025-08-29 17:56:05 +08:00

fastdeploy

[Model]support qwen2_5_vl (#3557 )

2025-08-29 18:28:39 +08:00

scripts

[CI] update paddle version to nightly (#3698 )

2025-08-29 18:16:13 +08:00

test/ci_use/DCU

enable dcu ci (#3402 )

2025-08-29 10:23:08 +08:00

tests

add w4afp8 offline script (#3636 )

2025-08-29 17:56:05 +08:00

tools

【CI case】for echo finish_reason text_after_process and raw_prediction check (#3630 )

2025-08-27 15:21:16 +08:00

.clang-format

Sync v2.0 version of code to github repo

2025-06-29 23:29:37 +00:00

.flake8

update flake8 version to support pre-commit in python3.12 (#3000 )

2025-07-24 01:43:31 -07:00

.gitignore

[Optimize]support machete weight only gemm (#3561 )

2025-08-28 09:49:58 +08:00

.pre-commit-config.yaml

update flake8 version to support pre-commit in python3.12 (#3000 )

2025-07-24 01:43:31 -07:00

build.sh

[OPs] Universal optimization and Fix early_stop cuda 700 (#3375 )

2025-08-14 22:40:44 +08:00

LICENSE

[LLM] First commit the llm deployment code

2025-06-09 19:20:15 +08:00

mkdocs.yml

[Docs] add fastdeploy_unit_test_guide.md (#3484 )

2025-08-28 14:12:25 +08:00

pyproject.toml

Fix target_version (#3159 )

2025-08-28 14:17:54 +08:00

README_CN.md

Update README (#3426 )

2025-08-15 18:46:28 +08:00

README.md

Update README (#3426 )

2025-08-15 18:46:28 +08:00

requirements_dcu.txt

[Feature] add custom chat template (#3251 )

2025-08-18 16:34:08 +08:00

requirements_iluvatar.txt

[Feature] add custom chat template (#3251 )

2025-08-18 16:34:08 +08:00

requirements_metaxgpu.txt

[Feature] add custom chat template (#3251 )

2025-08-18 16:34:08 +08:00

requirements.txt

[CI] update paddle version to nightly (#3698 )

2025-08-29 18:16:13 +08:00

setup.py

rename ernie_xxx to ernie4_5_xxx (#3621 )

2025-08-26 19:29:27 +08:00

README.md

English | 简体中文

Installation | Quick Start | Supported Models

FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

News

[2025-08] 🔥 Released FastDeploy v2.1: A brand-new KV Cache scheduling strategy has been introduced, and expanded support for PD separation and CUDA Graph across more models. Enhanced hardware support has been added for platforms like Kunlun and Hygon, along with comprehensive optimizations to improve the performance of both the service and inference engine.

[2025-07] The FastDeploy 2.0 Inference Deployment Challenge is now live! Complete the inference deployment task for the ERNIE 4.5 series open-source models to win official FastDeploy 2.0 merch and generous prizes! 🎁 You're welcome to try it out and share your feedback! 📌Sign up here 📌Event details

[2025-06] 🔥 Released FastDeploy v2.0: Supports inference and deployment for ERNIE 4.5. Furthermore, we open-source an industrial-grade PD disaggregation with context caching, dynamic role switching for effective resource utilization to further enhance inference performance for MoE models.

About

FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:

🚀 Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
🔄 Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
🤝 OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.
🧮 Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
⏩ Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
🖥️ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU etc.

Requirements

OS: Linux
Python: 3.10 ~ 3.12

Installation

FastDeploy supports inference deployment on NVIDIA GPUs, Kunlunxin XPUs, Iluvatar GPUs, Enflame GCUs, Hygon DCUs and other hardware. For detailed installation instructions:

Note: We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU and MetaX GPU are currently under development and testing. Stay tuned for updates!

Get Started

Learn how to use FastDeploy through our documentation:

Supported Models

Model	Data Type	PD Disaggregation	Chunked Prefill	Prefix Caching	MTP	CUDA Graph	Maximum Context Length
ERNIE-4.5-300B-A47B	BF16/WINT4/WINT8/W4A8C8/WINT2/FP8	✅	✅	✅	✅	✅	128K
ERNIE-4.5-300B-A47B-Base	BF16/WINT4/WINT8	✅	✅	✅	❌	✅	128K
ERNIE-4.5-VL-424B-A47B	BF16/WINT4/WINT8	WIP	✅	WIP	❌	WIP	128K
ERNIE-4.5-VL-28B-A3B	BF16/WINT4/WINT8	❌	✅	WIP	❌	WIP	128K
ERNIE-4.5-21B-A3B	BF16/WINT4/WINT8/FP8	❌	✅	✅	✅	✅	128K
ERNIE-4.5-21B-A3B-Base	BF16/WINT4/WINT8/FP8	✅	✅	✅	❌	✅	128K
ERNIE-4.5-0.3B	BF16/WINT8/FP8	✅	✅	✅	❌	✅	128K

Advanced Usage

Acknowledgement

FastDeploy is licensed under the Apache-2.0 open-source license. During development, portions of vLLM code were referenced and incorporated to maintain interface compatibility, for which we express our gratitude.

Description

⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.

android graphcore intel jetson kunlun object-detection onnx onnxruntime openvino picodet rockchip serving stable-diffusion tensorrt uie yolov5 yolov8

Readme Apache-2.0 410 MiB

Languages

Python 54.3%

C++ 24.1%

Cuda 20.6%

Shell 0.8%

C 0.1%