mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2025-12-24 13:28:13 +08:00

Go to file

zhuzixuan c95b3395e9 【BugFix】completion接口echo回显支持 (#3245 )

* wenxin-tools-511,修复v1/completion无法回显的问题。

* 支持多prompt的回显

* 支持多prompt情况下的流式回显

* 补充了 completion 接口支持 echo 的单元测试

* pre-commit

* 移除了多余的test文件

* 修复了completion接口echo支持的单测方法

* 补充了单元测试文件

* 补充单测

* unittest

* 补充单测

* 修复单测

* 删除不必要的assert.

* 重新提交

* 更新测试方法

* ut

* 验证是否是正确思路单测

* 验证是否是正确思路单测

* 验证是否是正确思路单测3

* 优化单测代码，有针对性地缩小单测范围。

* 优化单测代码2，有针对性地缩小单测范围。

* 优化单测代码3，有针对性地缩小单测范围。

* support 'echo' in chat/completion.

* update

* update

* update

* update

* update

* update

* 补充了关于tokenid的单元测试

* update

* 修正index错误

* 修正index错误

2025-08-19 10:41:51 +08:00

.github/workflows

add publish workflow (#3063 )

2025-08-18 16:42:36 +08:00

benchmarks

add error info (#3040 )

2025-07-28 15:10:28 +08:00

custom_ops

[Code Simplification] remove cum_offsets (#3410 )

2025-08-18 20:21:25 +08:00

dockerfiles

Update README (#3426 )

2025-08-15 18:46:28 +08:00

docs

[Feature] add custom chat template (#3251 )

2025-08-18 16:34:08 +08:00

fastdeploy

【BugFix】completion接口echo回显支持 (#3245 )

2025-08-19 10:41:51 +08:00

scripts

[Excutor] Increase buffer size to prevent address corruption; add forward metadata debug tool (#3404 )

2025-08-18 16:14:09 +08:00

test

【BugFix】completion接口echo回显支持 (#3245 )

2025-08-19 10:41:51 +08:00

tools

add publish workflow (#3063 )

2025-08-18 16:42:36 +08:00

.clang-format

Sync v2.0 version of code to github repo

2025-06-29 23:29:37 +00:00

.flake8

update flake8 version to support pre-commit in python3.12 (#3000 )

2025-07-24 01:43:31 -07:00

.gitignore

[New Feature] Support W4Afp8 MoE GroupGemm (#3171 )

2025-08-06 10:34:05 +08:00

.pre-commit-config.yaml

update flake8 version to support pre-commit in python3.12 (#3000 )

2025-07-24 01:43:31 -07:00

build.sh

[OPs] Universal optimization and Fix early_stop cuda 700 (#3375 )

2025-08-14 22:40:44 +08:00

LICENSE

[LLM] First commit the llm deployment code

2025-06-09 19:20:15 +08:00

mkdocs.yml

[Docs] Update mkdocs.yml (#3444 )

2025-08-15 21:57:40 +08:00

pyproject.toml

update flake8 version to support pre-commit in python3.12 (#3000 )

2025-07-24 01:43:31 -07:00

README_CN.md

Update README (#3426 )

2025-08-15 18:46:28 +08:00

README.md

Update README (#3426 )

2025-08-15 18:46:28 +08:00

requirements_dcu.txt

[Feature] add custom chat template (#3251 )

2025-08-18 16:34:08 +08:00

requirements_iluvatar.txt

[Feature] add custom chat template (#3251 )

2025-08-18 16:34:08 +08:00

requirements_metaxgpu.txt

[Feature] add custom chat template (#3251 )

2025-08-18 16:34:08 +08:00

requirements.txt

add Tool Parser (#3272 )

2025-08-13 01:06:55 +08:00

setup.py

[MetaxGPU] Support FastDeploy on metax gpu (#3241 )

2025-08-13 11:11:54 +08:00

README.md

English | 简体中文

Installation | Quick Start | Supported Models

FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

News

[2025-08] 🔥 Released FastDeploy v2.1: A brand-new KV Cache scheduling strategy has been introduced, and expanded support for PD separation and CUDA Graph across more models. Enhanced hardware support has been added for platforms like Kunlun and Hygon, along with comprehensive optimizations to improve the performance of both the service and inference engine.

[2025-07] The FastDeploy 2.0 Inference Deployment Challenge is now live! Complete the inference deployment task for the ERNIE 4.5 series open-source models to win official FastDeploy 2.0 merch and generous prizes! 🎁 You're welcome to try it out and share your feedback! 📌Sign up here 📌Event details

[2025-06] 🔥 Released FastDeploy v2.0: Supports inference and deployment for ERNIE 4.5. Furthermore, we open-source an industrial-grade PD disaggregation with context caching, dynamic role switching for effective resource utilization to further enhance inference performance for MoE models.

About

FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:

🚀 Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
🔄 Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
🤝 OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.
🧮 Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
⏩ Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
🖥️ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU etc.

Requirements

OS: Linux
Python: 3.10 ~ 3.12

Installation

FastDeploy supports inference deployment on NVIDIA GPUs, Kunlunxin XPUs, Iluvatar GPUs, Enflame GCUs, Hygon DCUs and other hardware. For detailed installation instructions:

Note: We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU and MetaX GPU are currently under development and testing. Stay tuned for updates!

Get Started

Learn how to use FastDeploy through our documentation:

Supported Models

Model	Data Type	PD Disaggregation	Chunked Prefill	Prefix Caching	MTP	CUDA Graph	Maximum Context Length
ERNIE-4.5-300B-A47B	BF16/WINT4/WINT8/W4A8C8/WINT2/FP8	✅	✅	✅	✅	✅	128K
ERNIE-4.5-300B-A47B-Base	BF16/WINT4/WINT8	✅	✅	✅	❌	✅	128K
ERNIE-4.5-VL-424B-A47B	BF16/WINT4/WINT8	WIP	✅	WIP	❌	WIP	128K
ERNIE-4.5-VL-28B-A3B	BF16/WINT4/WINT8	❌	✅	WIP	❌	WIP	128K
ERNIE-4.5-21B-A3B	BF16/WINT4/WINT8/FP8	❌	✅	✅	✅	✅	128K
ERNIE-4.5-21B-A3B-Base	BF16/WINT4/WINT8/FP8	✅	✅	✅	❌	✅	128K
ERNIE-4.5-0.3B	BF16/WINT8/FP8	✅	✅	✅	❌	✅	128K

Advanced Usage

Acknowledgement

FastDeploy is licensed under the Apache-2.0 open-source license. During development, portions of vLLM code were referenced and incorporated to maintain interface compatibility, for which we express our gratitude.

Description

⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.

android graphcore intel jetson kunlun object-detection onnx onnxruntime openvino picodet rockchip serving stable-diffusion tensorrt uie yolov5 yolov8

Readme Apache-2.0 410 MiB

Languages

Python 54.3%

C++ 24.1%

Cuda 20.6%

Shell 0.8%

C 0.1%