mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2025-12-24 13:28:13 +08:00

Go to file

celsowm 771e71a24d Feat/blackwell sm100 support (#2670 )

* Add initial support for NVIDIA Blackwell (SM100) architecture

This change introduces initial support for the NVIDIA Blackwell GPU
architecture, specifically targeting SM100 (Compute Capability 10.x)
with '100a' architecture-specific features (e.g., for CUTLASS).

Key changes:
- Updated custom_ops/setup_ops.py to generate appropriate gencode
flags (arch=compute_100a,code=sm_100a) when '100' is specified
in FD_BUILDING_ARCS. Requires CUDA 12.9+.
- Updated custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h:
- Added CutlassTileConfigSM100 enum (with placeholder tile shapes).
- Added BLACKWELL to CandidateConfigTypeParam.
- Updated CutlassGemmConfig struct with is_sm100 flag,
tile_config_sm100, and new constructor for SM100.
- Modified toString() and fromString() for SM100 support.
- Updated custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu:
- Added get_candidate_tiles_sm100() (with placeholder tiles).
- Added placeholder mcast support functions for SM100.
- Updated get_candidate_configs() to include SM100 paths using
the BLACKWELL flag and new SM100 config types.
- Updated build.sh with comments to guide users on specifying '100'
for Blackwell in FD_BUILDING_ARCS.

Further work:
- Optimal CUTLASS tile configurations for SM100 need to be researched
and updated in cutlass_heuristic.cu.
- Kernel auto-generation scripts in custom_ops/utils/ may need
SM100-specific versions if Blackwell's hardware features for FP8/TMA
differ significantly from SM90.
- Compatibility of third-party libraries (CUTLASS v3.8.0, DeepGEMM)
with Blackwell should be fully verified.

* Feat: Implement detailed Blackwell (SM100) CUTLASS heuristics

This change integrates specific, expert-provided CUTLASS heuristic
configurations for the NVIDIA Blackwell (SM100) GPU architecture,
replacing previous placeholders. This includes:

- Updated `custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h`:
- Populated `CutlassTileConfigSM100` enum with specific tile shapes
(e.g., CtaShape64x64x128B, CtaShape128x128x128B) suitable for SM100.
- Added `FP4_ONLY` to `CandidateConfigTypeParam` for new FP4 paths.

- Updated `custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu`:
- Implemented `get_candidate_tiles_sm100` with detailed logic for
selecting tile configurations based on GROUPED_GEMM and FP4_ONLY flags,
using the new SM100 tile enums.
- Implemented `supports_mcast_along_m_sm100` and
`supports_mcast_along_n_sm100` with specific tile checks for Blackwell.
- Updated the `sm == 100` (Blackwell) block in `get_candidate_configs`
to use these new helper functions and accurately populate candidate
kernel configurations for various cluster shapes.

- `custom_ops/setup_ops.py` remains configured to compile for
`arch=compute_100a,code=sm_100a` with CUDA 12.9+ for these features.

This aligns the codebase with heuristic configurations similar to those
in upstream TensorRT-LLM / CUTLASS for Blackwell, enabling more
performant kernel selection on this new architecture.

---------

Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

2025-07-09 15:29:42 +08:00

.github/workflows

修改XPU CI, test=model (#2721 )

2025-07-06 10:19:04 +08:00

benchmarks

[Feature] Add speculative decoding simulation benchmark. (#2751 )

2025-07-09 12:08:43 +08:00

custom_ops

Feat/blackwell sm100 support (#2670 )

2025-07-09 15:29:42 +08:00

dockerfiles

Update Dockerfile.xpu

2025-07-03 10:14:50 +08:00

docs

[GCU] Support gcu platform (#2702 )

2025-07-08 13:00:52 +08:00

fastdeploy

fix xpu offline demo garbled output (#2763 )

2025-07-09 14:51:20 +08:00

scripts

添加XPU CI, test=model (#2701 )

2025-07-04 16:16:06 +08:00

test

add precision check for ci (#2732 )

2025-07-08 18:43:53 +08:00

tools/dockerfile

Add ci (#2650 )

2025-06-30 20:20:49 +08:00

.clang-format

Sync v2.0 version of code to github repo

2025-06-29 23:29:37 +00:00

.gitignore

Sync v2.0 version of code to github repo

2025-06-29 23:29:37 +00:00

.pre-commit-config.yaml

[feat] support fa3 backend for pd disaggregated (#2695 )

2025-07-03 22:33:27 +08:00

build.sh

Feat/blackwell sm100 support (#2670 )

2025-07-09 15:29:42 +08:00

LICENSE

[LLM] First commit the llm deployment code

2025-06-09 19:20:15 +08:00

mkdocs.yml

Update mkdocs.yml

2025-07-04 17:55:52 +08:00

README.md

Update README.md

2025-06-30 11:16:00 +08:00

requirements_iluvatar.txt

Adapt for iluvatar gpu (#2684 )

2025-07-07 16:53:14 +08:00

requirements.txt

[Sync] Update to latest code (#2679 )

2025-07-03 15:43:53 +08:00

setup.py

[GCU] Support gcu platform (#2702 )

2025-07-08 13:00:52 +08:00

README.md

Installation | Quick Start | Supported Models

FastDeploy 2.0: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

News

[2025-06] 🔥 Released FastDeploy v2.0: Supports inference and deployment for ERNIE 4.5. Furthermore, we open-source an industrial-grade PD disaggregation with context caching, dynamic role switching for effective resource utilization to further enhance inference performance for MoE models.

About

FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:

🚀 Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
🔄 Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
🤝 OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.
🧮 Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
⏩ Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
🖥️ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU etc.

Requirements

OS: Linux
Python: 3.10 ~ 3.12

Installation

FastDeploy supports inference deployment on NVIDIA GPUs, Kunlunxin XPUs, Iluvatar GPUs, Enflame GCUs, and other hardware. For detailed installation instructions:

Note: We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU, Hygon DCU, and MetaX GPU are currently under development and testing. Stay tuned for updates!

Get Started

Learn how to use FastDeploy through our documentation:

Supported Models

Model	Data Type	PD Disaggregation	Chunked Prefill	Prefix Caching	MTP	CUDA Graph	Maximum Context Length
ERNIE-4.5-300B-A47B	BF16/WINT4/WINT8/W4A8C8/WINT2/FP8	✅	✅	✅	✅(WINT4)	WIP	128K
ERNIE-4.5-300B-A47B-Base	BF16/WINT4/WINT8	✅	✅	✅	✅(WINT4)	WIP	128K
ERNIE-4.5-VL-424B-A47B	BF16/WINT4/WINT8	WIP	✅	WIP	❌	WIP	128K
ERNIE-4.5-VL-28B-A3B	BF16/WINT4/WINT8	❌	✅	WIP	❌	WIP	128K
ERNIE-4.5-21B-A3B	BF16/WINT4/WINT8/FP8	❌	✅	✅	WIP	✅	128K
ERNIE-4.5-21B-A3B-Base	BF16/WINT4/WINT8/FP8	❌	✅	✅	WIP	✅	128K
ERNIE-4.5-0.3B	BF16/WINT8/FP8	❌	✅	✅	❌	✅	128K

Advanced Usage

Acknowledgement

FastDeploy is licensed under the Apache-2.0 open-source license. During development, portions of vLLM code were referenced and incorporated to maintain interface compatibility, for which we express our gratitude.

Description

⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.

android graphcore intel jetson kunlun object-detection onnx onnxruntime openvino picodet rockchip serving stable-diffusion tensorrt uie yolov5 yolov8

Readme Apache-2.0 410 MiB

Languages

Python 54.3%

C++ 24.1%

Cuda 20.6%

Shell 0.8%

C 0.1%