mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2025-12-24 13:28:13 +08:00

Go to file

Jiang-Jia-Jun 4b3e41c665 [Optim] Improve task-checking performance in engine-worker-queue (#5376 )

* [Optim] Optimize costtime in checking tasks in engine-worker-queue

* Update fastdeploy/engine/common_engine.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fastdeploy/inter_communicator/engine_worker_queue.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Docs] Add docstring to set_exist_tasks method (#5382)

* Initial plan

* Add docstring to set_exist_tasks method

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* [Docs] Add docstring documentation to exist_tasks() method (#5381)

* Initial plan

* Add comprehensive docstring to exist_tasks() method

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* [Optimization] Conditionally initialize shared memory for single-node deployments only (#5383)

* Initial plan

* Conditionally initialize exist_tasks_intra_signal for single-node deployments

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Use is_single_node flag for consistent deployment type checking

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Remove redundant None checks in exist_tasks methods

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* format code

---------

Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>

2025-12-11 10:33:32 +08:00

.github

[BugFix] fix mm cudagraph (#5266 )

2025-12-09 11:51:00 +08:00

benchmarks

[Feature] [Benchmark]: add ZMQ-based FMQ implementation and benchmark tools (#5418 )

2025-12-08 22:04:49 +08:00

custom_ops

[Metax] refactor cutlass moe and optimize flash attention (#5361 )

2025-12-10 17:15:17 +08:00

dockerfiles

uodate docx (#4938 )

2025-11-11 10:28:46 +08:00

docs

add detoken switch (#5463 )

2025-12-10 21:44:02 +08:00

examples

[PD Disaggregation] support DP via v1 router and decouple DP and EP (#5197 )

2025-12-04 15:38:43 +08:00

fastdeploy

[Optim] Improve task-checking performance in engine-worker-queue (#5376 )

2025-12-11 10:33:32 +08:00

scripts

modify approve (#5443 )

2025-12-09 16:52:10 +08:00

tests

add detoken switch (#5463 )

2025-12-10 21:44:02 +08:00

tools

[CI] Update redis download source (#5198 )

2025-11-24 21:14:59 +08:00

.clang-format

c++ code format (#4527 )

2025-10-22 17:59:50 +08:00

.flake8

update flake8 version to support pre-commit in python3.12 (#3000 )

2025-07-24 01:43:31 -07:00

.gitignore

【Hackathon 9th No.109】[CppExtension] Support build Custom OP in setuptools 80+ (#4977 )

2025-11-17 11:46:27 +08:00

.gitmodules

add ignore=all for deepgemm (#4118 )

2025-09-15 21:52:00 +08:00

.pre-commit-config.yaml

c++ code format (#4527 )

2025-10-22 17:59:50 +08:00

build.sh

[Metax] optimize mla attention (#5258 )

2025-12-09 11:18:19 +08:00

LICENSE

[LLM] First commit the llm deployment code

2025-06-09 19:20:15 +08:00

mkdocs.yml

[Docs] add qwen25-vl docs (#5243 )

2025-11-27 15:05:57 +08:00

pyproject.toml

Fix target_version (#3159 )

2025-08-28 14:17:54 +08:00

README_CN.md

Modify README

2025-11-12 11:03:23 +08:00

README_EN.md

Modify README

2025-11-12 11:03:23 +08:00

README.md

[Doc] Update docs for v2.3.0rc0 (#4828 )

2025-11-05 19:45:53 +08:00

requirements_dcu.txt

remove fastsafetensors (#5371 )

2025-12-04 19:22:04 +08:00

requirements_guided_decoding.txt

[Feature] Guided Decoding add LLguidance backend (#5124 )

2025-12-03 20:23:57 +08:00

requirements_iluvatar.txt

remove fastsafetensors (#5371 )

2025-12-04 19:22:04 +08:00

requirements_metaxgpu.txt

remove fastsafetensors (#5371 )

2025-12-04 19:22:04 +08:00

requirements.txt

remove fastsafetensors (#5371 )

2025-12-04 19:22:04 +08:00

setup.py

Update FASTDEPLOY_VERSION to 2.4.0-dev

2025-12-08 11:21:46 +08:00

README_EN.md

English | 简体中文

Installation | Quick Start | Supported Models

FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

News

[2025-11] FastDeploy v2.3 is newly released! It adds deployment support for two major models, ERNIE-4.5-VL-28B-A3B-Thinking and PaddleOCR-VL-0.9B, across multiple hardware platforms. It further optimizes comprehensive inference performance and brings more deployment features and usability enhancements. For all the upgrade details, refer to the v2.3 Release Note.

[2025-09] FastDeploy v2.2: It now offers compatibility with models in the HuggingFace ecosystem, has further optimized performance, and newly adds support for baidu/ERNIE-21B-A3B-Thinking!

About

FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:

🚀 Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
🔄 Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
🤝 OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.
🧮 Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
⏩ Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
🖥️ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Iluvatar GPU, Enflame GCU, MetaX GPU, Intel Gaudi etc.

Requirements

OS: Linux
Python: 3.10 ~ 3.12

Installation

FastDeploy supports inference deployment on NVIDIA GPUs, Kunlunxin XPUs, Iluvatar GPUs, Enflame GCUs, Hygon DCUs and other hardware. For detailed installation instructions:

Get Started

Learn how to use FastDeploy through our documentation:

Supported Models

Learn how to download models, enable using the torch format, and more:

Full Supported Models List

Advanced Usage

Acknowledgement

FastDeploy is licensed under the Apache-2.0 open-source license. During development, portions of vLLM code were referenced and incorporated to maintain interface compatibility, for which we express our gratitude.

Description

⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.

android graphcore intel jetson kunlun object-detection onnx onnxruntime openvino picodet rockchip serving stable-diffusion tensorrt uie yolov5 yolov8

Readme Apache-2.0 410 MiB

Languages

Python 54.3%

C++ 24.1%

Cuda 20.6%

Shell 0.8%

C 0.1%