mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2025-12-24 13:28:13 +08:00

Go to file

李泳桦 6265f4385f [feat] support prefix cache clearing when /clear_load_weight is called (#4008 )

* [feat] support clearing prefix cache (cherry-picked from release/2.1)

* [fix] fix ipc suffix, use port instead

* [fix] fix prefix caching not enabled

* [fix] fix key/value_cache_scales indent

* [fix] fix ep group all-reduce

* [fix] fix clear/update lock not working when workers > 1

* [chore] add preemption triggered info log

* [fix] fix code style

* [fix] fix max_num_seqs config

* [fix] do not force enable_prefix_caching=False in dynamic loading

* [fix] fix ci

* Revert "[fix] fix ci"

This reverts commit 0bc6d55cc8.

* [fix] initialize available_gpu_block_num with max_gpu_block_num

* [fix] fix config splitwise_role

* [fix] fix clearing caches synchronization and add more logs

* [chore] print cache_ready_signal in log

* [fix] fix scheduler_config.splitwise_role

* [fix] fix cache_messager cache_ready_signal create=True

* [fix] stop cache messager from launching in mixed deployment

2025-09-28 19:42:53 +08:00

.github/workflows

increase ccache size (#4255 )

2025-09-26 17:40:07 +08:00

benchmarks

add glm benchmark yaml (#4289 )

2025-09-26 14:23:29 +08:00

custom_ops

[feat] support prefix cache clearing when /clear_load_weight is called (#4008 )

2025-09-28 19:42:53 +08:00

dockerfiles

更新文档 (#3998 )

2025-09-09 10:44:15 +08:00

docs

[Docs]When XPU starts the service, the model loader uses the default version (#4292 )

2025-09-26 15:58:12 +08:00

fastdeploy

[feat] support prefix cache clearing when /clear_load_weight is called (#4008 )

2025-09-28 19:42:53 +08:00

scripts

Set approve checking for config.py, worker, model and cudagraph (#4276 )

2025-09-26 14:50:54 +08:00

tests

Add cli run batch (#4237 )

2025-09-26 14:27:25 +08:00

tools

deepgemm pre-compile tool support mixed parallel (#4282 )

2025-09-26 18:43:39 +08:00

.clang-format

Sync v2.0 version of code to github repo

2025-06-29 23:29:37 +00:00

.flake8

update flake8 version to support pre-commit in python3.12 (#3000 )

2025-07-24 01:43:31 -07:00

.gitignore

[Optimize]support machete weight only gemm (#3561 )

2025-08-28 09:49:58 +08:00

.gitmodules

add ignore=all for deepgemm (#4118 )

2025-09-15 21:52:00 +08:00

.pre-commit-config.yaml

update flake8 version to support pre-commit in python3.12 (#3000 )

2025-07-24 01:43:31 -07:00

build.sh

[Intel HPU] Support intel hpu platform (#4161 )

2025-09-24 12:27:50 +08:00

LICENSE

[LLM] First commit the llm deployment code

2025-06-09 19:20:15 +08:00

mkdocs.yml

fix typos (#4176 )

2025-09-22 14:27:17 +08:00

pyproject.toml

Fix target_version (#3159 )

2025-08-28 14:17:54 +08:00

README_CN.md

[Intel HPU] Support intel hpu platform (#4161 )

2025-09-24 12:27:50 +08:00

README.md

[Intel HPU] Support intel hpu platform (#4161 )

2025-09-24 12:27:50 +08:00

requirements_dcu.txt

[Feature] add custom chat template (#3251 )

2025-08-18 16:34:08 +08:00

requirements_iluvatar.txt

[Feature] add custom chat template (#3251 )

2025-08-18 16:34:08 +08:00

requirements_metaxgpu.txt

[Feature] add custom chat template (#3251 )

2025-08-18 16:34:08 +08:00

requirements.txt

[Feature] support pool (#3827 )

2025-09-22 14:09:09 +08:00

setup.py

[Feature] add cli command serve (#4226 )

2025-09-24 14:50:45 +08:00

README.md

English | 简体中文

Installation | Quick Start | Supported Models

FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

News

[2025-09] 🔥 FastDeploy v2.2 is newly released! It now offers compatibility with models in the HuggingFace ecosystem, has further optimized performance, and newly adds support for baidu/ERNIE-21B-A3B-Thinking!

[2025-08] 🔥 Released FastDeploy v2.1: A brand-new KV Cache scheduling strategy has been introduced, and expanded support for PD separation and CUDA Graph across more models. Enhanced hardware support has been added for platforms like Kunlun and Hygon, along with comprehensive optimizations to improve the performance of both the service and inference engine.

[2025-07] The FastDeploy 2.0 Inference Deployment Challenge is now live! Complete the inference deployment task for the ERNIE 4.5 series open-source models to win official FastDeploy 2.0 merch and generous prizes! 🎁 You're welcome to try it out and share your feedback! 📌Sign up here 📌Event details

[2025-06] 🔥 Released FastDeploy v2.0: Supports inference and deployment for ERNIE 4.5. Furthermore, we open-source an industrial-grade PD disaggregation with context caching, dynamic role switching for effective resource utilization to further enhance inference performance for MoE models.

About

FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:

🚀 Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
🔄 Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
🤝 OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.
🧮 Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
⏩ Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
🖥️ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU, Intel Gaudi etc.

Requirements

OS: Linux
Python: 3.10 ~ 3.12

Installation

FastDeploy supports inference deployment on NVIDIA GPUs, Kunlunxin XPUs, Iluvatar GPUs, Enflame GCUs, Hygon DCUs and other hardware. For detailed installation instructions:

Note: We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU are currently under development and testing. Stay tuned for updates!

Get Started

Learn how to use FastDeploy through our documentation:

Supported Models

Learn how to download models, enable using the torch format, and more:

Full Supported Models List

Advanced Usage

Acknowledgement

FastDeploy is licensed under the Apache-2.0 open-source license. During development, portions of vLLM code were referenced and incorporated to maintain interface compatibility, for which we express our gratitude.

Description

⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.

android graphcore intel jetson kunlun object-detection onnx onnxruntime openvino picodet rockchip serving stable-diffusion tensorrt uie yolov5 yolov8

Readme Apache-2.0 410 MiB

Languages

Python 54.3%

C++ 24.1%

Cuda 20.6%

Shell 0.8%

C 0.1%