English | [įŽäŊ䏿](README_CN.md)
-------------------------------------------------------------------------------- # FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle ## News **[2025-08] đĨ Released FastDeploy v2.1:** A brand-new KV Cache scheduling strategy has been introduced, and expanded support for PD separation and CUDA Graph across more models. Enhanced hardware support has been added for platforms like Kunlun and Hygon, along with comprehensive optimizations to improve the performance of both the service and inference engine. **[2025-07] The FastDeploy 2.0 Inference Deployment Challenge is now live!** Complete the inference deployment task for the ERNIE 4.5 series open-source models to win official FastDeploy 2.0 merch and generous prizes! đ You're welcome to try it out and share your feedback! đ[Sign up here](https://www.wjx.top/vm/meSsp3L.aspx#) đ[Event details](https://github.com/PaddlePaddle/FastDeploy/discussions/2728) **[2025-06] đĨ Released FastDeploy v2.0:** Supports inference and deployment for ERNIE 4.5. Furthermore, we open-source an industrial-grade PD disaggregation with context caching, dynamic role switching for effective resource utilization to further enhance inference performance for MoE models. ## About **FastDeploy** is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers **production-ready, out-of-the-box deployment solutions** with core acceleration technologies: - đ **Load-Balanced PD Disaggregation**: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput. - đ **Unified KV Cache Transmission**: Lightweight high-performance transport library with intelligent NVLink/RDMA selection. - đ¤ **OpenAI API Server and vLLM Compatible**: One-command deployment with [vLLM](https://github.com/vllm-project/vllm/) interface compatibility. - đ§Ž **Comprehensive Quantization Format Support**: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more. - ⊠**Advanced Acceleration Techniques**: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill. - đĨī¸ **Multi-Hardware Support**: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU etc. ## Requirements - OS: Linux - Python: 3.10 ~ 3.12 ## Installation FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**, **Iluvatar GPUs**, **Enflame GCUs**, **Hygon DCUs** and other hardware. For detailed installation instructions: - [NVIDIA GPU](./docs/get_started/installation/nvidia_gpu.md) - [Kunlunxin XPU](./docs/get_started/installation/kunlunxin_xpu.md) - [Iluvatar GPU](./docs/get_started/installation/iluvatar_gpu.md) - [Enflame GCU](./docs/get_started/installation/Enflame_gcu.md) - [Hygon DCU](./docs/get_started/installation/hygon_dcu.md) **Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU and MetaX GPU are currently under development and testing. Stay tuned for updates! ## Get Started Learn how to use FastDeploy through our documentation: - [10-Minutes Quick Deployment](./docs/get_started/quick_start.md) - [ERNIE-4.5 Large Language Model Deployment](./docs/get_started/ernie-4.5.md) - [ERNIE-4.5-VL Multimodal Model Deployment](./docs/get_started/ernie-4.5-vl.md) - [Offline Inference Development](./docs/offline_inference.md) - [Online Service Deployment](./docs/online_serving/README.md) - [Full Supported Models List](./docs/supported_models.md) - [Best Practices](./docs/best_practices/README.md) ## Supported Models | Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length | |:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- | |ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | â | â | â |â | â |128K | |ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | â | â | â |â| â | 128K | |ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | â | WIP | â | WIP |128K | |ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | â | â | WIP | â | WIP |128K | |ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | â | â | â | â | â |128K | |ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | â | â | â | â | â |128K | |ERNIE-4.5-0.3B | BF16/WINT8/FP8 | â | â | â | â | â | 128K | ## Advanced Usage - [Quantization](./docs/quantization/README.md) - [PD Disaggregation Deployment](./docs/features/disaggregated.md) - [Speculative Decoding](./docs/features/speculative_decoding.md) - [Prefix Caching](./docs/features/prefix_caching.md) - [Chunked Prefill](./docs/features/chunked_prefill.md) ## Acknowledgement FastDeploy is licensed under the [Apache-2.0 open-source license](./LICENSE). During development, portions of [vLLM](https://github.com/vllm-project/vllm) code were referenced and incorporated to maintain interface compatibility, for which we express our gratitude.