* Change wint2 to ColumnMajor.
Change-Id: I6b44d02946a685f8fe24d9f2c7be258b51e16da2
* Unify default_wint2x_mma.
Change-Id: I9e77b0e8e6cecab01fedc0b24b536ee0a1a89ff7
* Change wint2 to ColumnMajorTileInterleave.
Change-Id: I593cbe36f991c0c5044989d65f0014087587c624
* Enable async copy for B.
Change-Id: Ia3ac37ad162a8cf3ccce4f268e81bd06c8ac3c46
* Add wint2x Dequantizer
* Remove TileDequanterB related codes.
Change-Id: Id8e65703b72a8984d367f584ff41b7726017fbb8
* Implement FastInterleavedAndBiasedNumericArrayConverter for wint2.
Change-Id: I438f2b18ab964a04ae1cdb09d9e7d9f7b95eafca
* Implement Wint2ParamsAccessor to load extra quant params from global memory.
Change-Id: Ic3750cd9b767df8893501820880c3342a4b47233
* Implement FastInterleavedAndBiasedNumericArrayConverter for wint2.
Change-Id: I438f2b18ab964a04ae1cdb09d9e7d9f7b95eafca
* Use async copy for local_scale.
Change-Id: Ib882ba41c3d2354bda4d25b40e2408ad3b2f7893
* Check and correct the load and dequantize of weights.
Change-Id: Ie8dca505b39987144964fe6407d465b3b5953790
* Change for performance tuning.
Change-Id: I1da026fb1d1533a9d70350c7ba23c27e896cfc29
* Optimize the global memory access size of local_scale reading.
Change-Id: I4cbe3a2ef5951723d415c2d3252ce912394beaf5
* Specialize mma_tensor_op for wint2 to enable fine-grained pipeline.
Change-Id: Icbb4d48f90a41136f42d6ffff42d68de32f408da
* Minor fix.
Change-Id: I14d4ac9d267ee05442a3b47f00c26bee13d79e6f
* optimizing dequant performance with LOP3
* optimizing dequant performance with LOP3
* Avoid redundant dequantization of local_scale and use bf16 as computing type.
Change-Id: I63239ebc8f8e4a92d6281af59840ba50600b4334
* Add Multiplier and remove some logs.
Change-Id: Ifa199d81e6aeb472d2247c63f85ef30213684bcd
* optimizing dequant performance with LOP3
* Use __byte_perm to implement int8 to float32 conversion for performance improvement.
* Use lop3 to optimize the dequantize of local_scale.
Change-Id: I6189759970cb5b8dcbef769724784b8a7533b63c
* Minor fix and remove some logs.
Change-Id: I6279ba9926d5041093b1c6aea200acf2e4c49d46
* Fix stages for test.
Change-Id: I6f7b7cac612ef2c678e9d49f5ffa60eb53d3ae29
* Fix stages for test and add clock64 to profile.
Change-Id: Iffaf7324beaa910ce9ee56f47ae289de98f1a267
* Use __byte_perm to replace shift-and-or operations for faster integer merging.
* Split the uint2b convert.
Change-Id: I78da672ce8968e21f685285140ba546a161521b4
* Optimize convert of unscale.
Change-Id: I6795da1cdf5e8ab38ddaa9836240921b5312913a
* Minor optimization.
Change-Id: I1800aec34c3f4621abb02658208108f54da44d88
* Optimize mma pipeline and refine codes.
Change-Id: Id3075cf7b88f2813a11ccd1d3b49c62c978f36b8
* Add missing support.
Change-Id: Id65b7bc2c25fbb1a5b232c6bc9fb8c9093f691a8
* Accelerate FP16 dequantization performance
* Support tile shape as Xx64x64.
Change-Id: Ib8fd37e1ba1d06f7d11f2956e7f1367b0a92bcac
* Remove debugging codes and minor optimization.
Change-Id: I6b79bd56a6e8dd823efc169967ecd3cc9a43baf4
* Fix offset bug.
Change-Id: Id7aeb91e99d6f51836f2aff22187b4f79607395e
* Fix typo.
Change-Id: I19dde93fc1c1f7e19605905c90dc46298e203952
* Restore some codes and remove some debugging logs.
Change-Id: I8d44daf82ad1c6f8174134d195e7b3fe9a3afdfb
---------
Co-authored-by: baoqiwen <baoqiwen@baidu.com>
* Add initial support for NVIDIA Blackwell (SM100) architecture
This change introduces initial support for the NVIDIA Blackwell GPU
architecture, specifically targeting SM100 (Compute Capability 10.x)
with '100a' architecture-specific features (e.g., for CUTLASS).
Key changes:
- Updated custom_ops/setup_ops.py to generate appropriate gencode
flags (arch=compute_100a,code=sm_100a) when '100' is specified
in FD_BUILDING_ARCS. Requires CUDA 12.9+.
- Updated custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h:
- Added CutlassTileConfigSM100 enum (with placeholder tile shapes).
- Added BLACKWELL to CandidateConfigTypeParam.
- Updated CutlassGemmConfig struct with is_sm100 flag,
tile_config_sm100, and new constructor for SM100.
- Modified toString() and fromString() for SM100 support.
- Updated custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu:
- Added get_candidate_tiles_sm100() (with placeholder tiles).
- Added placeholder mcast support functions for SM100.
- Updated get_candidate_configs() to include SM100 paths using
the BLACKWELL flag and new SM100 config types.
- Updated build.sh with comments to guide users on specifying '100'
for Blackwell in FD_BUILDING_ARCS.
Further work:
- Optimal CUTLASS tile configurations for SM100 need to be researched
and updated in cutlass_heuristic.cu.
- Kernel auto-generation scripts in custom_ops/utils/ may need
SM100-specific versions if Blackwell's hardware features for FP8/TMA
differ significantly from SM90.
- Compatibility of third-party libraries (CUTLASS v3.8.0, DeepGEMM)
with Blackwell should be fully verified.
* Feat: Implement detailed Blackwell (SM100) CUTLASS heuristics
This change integrates specific, expert-provided CUTLASS heuristic
configurations for the NVIDIA Blackwell (SM100) GPU architecture,
replacing previous placeholders. This includes:
- Updated `custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h`:
- Populated `CutlassTileConfigSM100` enum with specific tile shapes
(e.g., CtaShape64x64x128B, CtaShape128x128x128B) suitable for SM100.
- Added `FP4_ONLY` to `CandidateConfigTypeParam` for new FP4 paths.
- Updated `custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu`:
- Implemented `get_candidate_tiles_sm100` with detailed logic for
selecting tile configurations based on GROUPED_GEMM and FP4_ONLY flags,
using the new SM100 tile enums.
- Implemented `supports_mcast_along_m_sm100` and
`supports_mcast_along_n_sm100` with specific tile checks for Blackwell.
- Updated the `sm == 100` (Blackwell) block in `get_candidate_configs`
to use these new helper functions and accurately populate candidate
kernel configurations for various cluster shapes.
- `custom_ops/setup_ops.py` remains configured to compile for
`arch=compute_100a,code=sm_100a` with CUDA 12.9+ for these features.
This aligns the codebase with heuristic configurations similar to those
in upstream TensorRT-LLM / CUTLASS for Blackwell, enabling more
performant kernel selection on this new architecture.
---------
Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>