FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2025-10-04 08:16:42 +08:00

Files

celsowm 771e71a24d Feat/blackwell sm100 support (#2670 )

* Add initial support for NVIDIA Blackwell (SM100) architecture

This change introduces initial support for the NVIDIA Blackwell GPU
architecture, specifically targeting SM100 (Compute Capability 10.x)
with '100a' architecture-specific features (e.g., for CUTLASS).

Key changes:
- Updated custom_ops/setup_ops.py to generate appropriate gencode
flags (arch=compute_100a,code=sm_100a) when '100' is specified
in FD_BUILDING_ARCS. Requires CUDA 12.9+.
- Updated custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h:
- Added CutlassTileConfigSM100 enum (with placeholder tile shapes).
- Added BLACKWELL to CandidateConfigTypeParam.
- Updated CutlassGemmConfig struct with is_sm100 flag,
tile_config_sm100, and new constructor for SM100.
- Modified toString() and fromString() for SM100 support.
- Updated custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu:
- Added get_candidate_tiles_sm100() (with placeholder tiles).
- Added placeholder mcast support functions for SM100.
- Updated get_candidate_configs() to include SM100 paths using
the BLACKWELL flag and new SM100 config types.
- Updated build.sh with comments to guide users on specifying '100'
for Blackwell in FD_BUILDING_ARCS.

Further work:
- Optimal CUTLASS tile configurations for SM100 need to be researched
and updated in cutlass_heuristic.cu.
- Kernel auto-generation scripts in custom_ops/utils/ may need
SM100-specific versions if Blackwell's hardware features for FP8/TMA
differ significantly from SM90.
- Compatibility of third-party libraries (CUTLASS v3.8.0, DeepGEMM)
with Blackwell should be fully verified.

* Feat: Implement detailed Blackwell (SM100) CUTLASS heuristics

This change integrates specific, expert-provided CUTLASS heuristic
configurations for the NVIDIA Blackwell (SM100) GPU architecture,
replacing previous placeholders. This includes:

- Updated `custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h`:
- Populated `CutlassTileConfigSM100` enum with specific tile shapes
(e.g., CtaShape64x64x128B, CtaShape128x128x128B) suitable for SM100.
- Added `FP4_ONLY` to `CandidateConfigTypeParam` for new FP4 paths.

- Updated `custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu`:
- Implemented `get_candidate_tiles_sm100` with detailed logic for
selecting tile configurations based on GROUPED_GEMM and FP4_ONLY flags,
using the new SM100 tile enums.
- Implemented `supports_mcast_along_m_sm100` and
`supports_mcast_along_n_sm100` with specific tile checks for Blackwell.
- Updated the `sm == 100` (Blackwell) block in `get_candidate_configs`
to use these new helper functions and accurately populate candidate
kernel configurations for various cluster shapes.

- `custom_ops/setup_ops.py` remains configured to compile for
`arch=compute_100a,code=sm_100a` with CUDA 12.9+ for these features.

This aligns the codebase with heuristic configurations similar to those
in upstream TensorRT-LLM / CUTLASS for Blackwell, enabling more
performant kernel selection on this new architecture.

---------

Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

2025-07-09 15:29:42 +08:00

arch

Sync v2.0 version of code to github repo

2025-06-29 23:29:37 +00:00

epilogue

Sync v2.0 version of code to github repo

2025-06-29 23:29:37 +00:00

gemm

Sync v2.0 version of code to github repo

2025-06-29 23:29:37 +00:00

transform/threadblock

[LLM] First commit the llm deployment code