zhuzixuan c95b3395e9 【BugFix】completion接口echo回显支持 (#3245)
* wenxin-tools-511,修复v1/completion无法回显的问题。

* 支持多prompt的回显

* 支持多prompt情况下的流式回显

* 补充了 completion 接口支持 echo 的单元测试

* pre-commit

* 移除了多余的test文件

* 修复了completion接口echo支持的单测方法

* 补充了单元测试文件

* 补充单测

* unittest

* 补充单测

* 修复单测

* 删除不必要的assert.

* 重新提交

* 更新测试方法

* ut

* 验证是否是正确思路单测

* 验证是否是正确思路单测

* 验证是否是正确思路单测3

* 优化单测代码,有针对性地缩小单测范围。

* 优化单测代码2,有针对性地缩小单测范围。

* 优化单测代码3,有针对性地缩小单测范围。

* support 'echo' in chat/completion.

* update

* update

* update

* update

* update

* update

* 补充了关于tokenid的单元测试

* update

* 修正index错误

* 修正index错误
2025-08-19 10:41:51 +08:00
2025-07-28 15:10:28 +08:00
2025-08-15 18:46:28 +08:00
2025-08-18 16:42:36 +08:00
2025-08-15 21:57:40 +08:00
2025-08-15 18:46:28 +08:00
2025-08-15 18:46:28 +08:00
2025-08-13 01:06:55 +08:00

English | 简体中文

PaddlePaddle%2FFastDeploy | Trendshift
Installation | Quick Start | Supported Models


FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

News

[2025-08] 🔥 Released FastDeploy v2.1: A brand-new KV Cache scheduling strategy has been introduced, and expanded support for PD separation and CUDA Graph across more models. Enhanced hardware support has been added for platforms like Kunlun and Hygon, along with comprehensive optimizations to improve the performance of both the service and inference engine.

[2025-07] The FastDeploy 2.0 Inference Deployment Challenge is now live! Complete the inference deployment task for the ERNIE 4.5 series open-source models to win official FastDeploy 2.0 merch and generous prizes! 🎁 You're welcome to try it out and share your feedback! 📌Sign up here 📌Event details

[2025-06] 🔥 Released FastDeploy v2.0: Supports inference and deployment for ERNIE 4.5. Furthermore, we open-source an industrial-grade PD disaggregation with context caching, dynamic role switching for effective resource utilization to further enhance inference performance for MoE models.

About

FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:

  • 🚀 Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
  • 🔄 Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
  • 🤝 OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.
  • 🧮 Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
  • Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
  • 🖥️ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU etc.

Requirements

  • OS: Linux
  • Python: 3.10 ~ 3.12

Installation

FastDeploy supports inference deployment on NVIDIA GPUs, Kunlunxin XPUs, Iluvatar GPUs, Enflame GCUs, Hygon DCUs and other hardware. For detailed installation instructions:

Note: We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU and MetaX GPU are currently under development and testing. Stay tuned for updates!

Get Started

Learn how to use FastDeploy through our documentation:

Supported Models

Model Data Type PD Disaggregation Chunked Prefill Prefix Caching MTP CUDA Graph Maximum Context Length
ERNIE-4.5-300B-A47B BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 128K
ERNIE-4.5-300B-A47B-Base BF16/WINT4/WINT8 128K
ERNIE-4.5-VL-424B-A47B BF16/WINT4/WINT8 WIP WIP WIP 128K
ERNIE-4.5-VL-28B-A3B BF16/WINT4/WINT8 WIP WIP 128K
ERNIE-4.5-21B-A3B BF16/WINT4/WINT8/FP8 128K
ERNIE-4.5-21B-A3B-Base BF16/WINT4/WINT8/FP8 128K
ERNIE-4.5-0.3B BF16/WINT8/FP8 128K

Advanced Usage

Acknowledgement

FastDeploy is licensed under the Apache-2.0 open-source license. During development, portions of vLLM code were referenced and incorporated to maintain interface compatibility, for which we express our gratitude.

Description
️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.
Readme Apache-2.0 410 MiB
Languages
Python 54.3%
C++ 24.1%
Cuda 20.6%
Shell 0.8%
C 0.1%