Commit Graph

50 Commits

Author SHA1 Message Date
AIbin
22fe695f1c 【Inference Optimize】Support automatic generation of marlin kernel (#3149)
* Support automatic generation of marlin kernel
2025-08-01 22:43:18 +08:00
yangjianfengo1
64d7a3194d 集中式支持fa3 (#3112) 2025-08-01 18:03:36 +08:00
Ryan
94264bbf60 [Code Simplification] Refactor Post-processing in VL Model Forward Method (#2937)
* rm sth useless

* refactor model forward

* mv bool index to kernel
2025-08-01 17:28:07 +08:00
chen
a2f5cc54f8 moe preprocess op support 160 experts and fused_moe triton kernel name add K (#3121) 2025-08-01 10:46:20 +08:00
RAM
d850660872 [Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel (#2989)
* reset decoder_block_shape_q buffer

* refactor GetBlockShapeAndSplitKVBlock Kernel and cudagraph padding batch

* update decode_max_tile_size

* fix pre-commit

* update block_multihead_attn_backend

* update flas attn backend

* update MLA Attention

* update XPU Attention

* update gcu,iluvatar model runner

* Update MTP

* fix MTP bug
2025-07-31 00:09:31 +08:00
ming1753
5acde4eb43 [Feature] Multimodal Scheduler V1 (#3019)
* [Feature] Support multimodal scheduler v1

* remove debug log

* fix bug

* fix format

* modify code

* fix bug

* fix bug

* fix bug

* modify code
2025-07-30 16:05:55 +08:00
Sunny-bot1
74aa31d15b [Feature] support bad_words (#3055)
* support bad_words

* support online infer bad_words

* update

* add CI test

* update

* update

* update

---------

Co-authored-by: Yuanle Liu <yuanlehome@163.com>
2025-07-30 09:31:29 +08:00
JYChen
dafe02a7b9 [stop sequence] support stop sequence (#3025)
* stop seqs in multi-ends

* unittest for gpu stop op

* kernel tid==0
2025-07-29 14:17:37 +08:00
yinwei
f2a528f9ae [XPU] Support kvblock centralized management (#3017) 2025-07-29 10:40:55 +08:00
Yuan Xiaolan
7d87aaace8 optimize w4a8 decoding (#3050) 2025-07-28 22:20:13 +08:00
lizhenyun01
e80ea8a71b remove Synchronize in hadamard 2025-07-28 19:22:46 +08:00
lizhenyun01
238766e403 fix c4 prompt_cache 2025-07-28 14:31:37 +08:00
Yiqun Liu
8f426c1690 Optimize the performance of moe_expert_ffn_wint2 (#2990)
* Change wint2 to ColumnMajor.

Change-Id: I6b44d02946a685f8fe24d9f2c7be258b51e16da2

* Unify default_wint2x_mma.

Change-Id: I9e77b0e8e6cecab01fedc0b24b536ee0a1a89ff7

* Change wint2 to ColumnMajorTileInterleave.

Change-Id: I593cbe36f991c0c5044989d65f0014087587c624

* Enable async copy for B.

Change-Id: Ia3ac37ad162a8cf3ccce4f268e81bd06c8ac3c46

* Add wint2x Dequantizer

* Remove TileDequanterB related codes.

Change-Id: Id8e65703b72a8984d367f584ff41b7726017fbb8

* Implement FastInterleavedAndBiasedNumericArrayConverter for wint2.

Change-Id: I438f2b18ab964a04ae1cdb09d9e7d9f7b95eafca

* Implement Wint2ParamsAccessor to load extra quant params from global memory.

Change-Id: Ic3750cd9b767df8893501820880c3342a4b47233

* Implement FastInterleavedAndBiasedNumericArrayConverter for wint2.

Change-Id: I438f2b18ab964a04ae1cdb09d9e7d9f7b95eafca

* Use async copy for local_scale.

Change-Id: Ib882ba41c3d2354bda4d25b40e2408ad3b2f7893

* Check and correct the load and dequantize of weights.

Change-Id: Ie8dca505b39987144964fe6407d465b3b5953790

* Change for performance tuning.

Change-Id: I1da026fb1d1533a9d70350c7ba23c27e896cfc29

* Optimize the global memory access size of local_scale reading.

Change-Id: I4cbe3a2ef5951723d415c2d3252ce912394beaf5

* Specialize mma_tensor_op for wint2 to enable fine-grained pipeline.

Change-Id: Icbb4d48f90a41136f42d6ffff42d68de32f408da

* Minor fix.

Change-Id: I14d4ac9d267ee05442a3b47f00c26bee13d79e6f

* optimizing dequant performance with LOP3

* optimizing dequant performance with LOP3

* Avoid redundant dequantization of local_scale and use bf16 as computing type.

Change-Id: I63239ebc8f8e4a92d6281af59840ba50600b4334

* Add Multiplier and remove some logs.

Change-Id: Ifa199d81e6aeb472d2247c63f85ef30213684bcd

* optimizing dequant performance with LOP3

* Use __byte_perm to implement int8 to float32 conversion for performance improvement.

* Use lop3 to optimize the dequantize of local_scale.

Change-Id: I6189759970cb5b8dcbef769724784b8a7533b63c

* Minor fix and remove some logs.

Change-Id: I6279ba9926d5041093b1c6aea200acf2e4c49d46

* Fix stages for test.

Change-Id: I6f7b7cac612ef2c678e9d49f5ffa60eb53d3ae29

* Fix stages for test and add clock64 to profile.

Change-Id: Iffaf7324beaa910ce9ee56f47ae289de98f1a267

* Use __byte_perm to replace shift-and-or operations for faster integer merging.

* Split the uint2b convert.

Change-Id: I78da672ce8968e21f685285140ba546a161521b4

* Optimize convert of unscale.

Change-Id: I6795da1cdf5e8ab38ddaa9836240921b5312913a

* Minor optimization.

Change-Id: I1800aec34c3f4621abb02658208108f54da44d88

* Optimize mma pipeline and refine codes.

Change-Id: Id3075cf7b88f2813a11ccd1d3b49c62c978f36b8

* Add missing support.

Change-Id: Id65b7bc2c25fbb1a5b232c6bc9fb8c9093f691a8

* Accelerate FP16 dequantization performance

* Support tile shape as Xx64x64.

Change-Id: Ib8fd37e1ba1d06f7d11f2956e7f1367b0a92bcac

* Remove debugging codes and minor optimization.

Change-Id: I6b79bd56a6e8dd823efc169967ecd3cc9a43baf4

* Fix offset bug.

Change-Id: Id7aeb91e99d6f51836f2aff22187b4f79607395e

* Fix typo.

Change-Id: I19dde93fc1c1f7e19605905c90dc46298e203952

* Restore some codes and remove some debugging logs.

Change-Id: I8d44daf82ad1c6f8174134d195e7b3fe9a3afdfb

---------

Co-authored-by: baoqiwen <baoqiwen@baidu.com>
2025-07-28 10:32:43 +08:00
xiaoxiaohehe001
2970b00dfa [Feature] Support_eplb (#2997)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* [Feature] support_eplb

* [Feature] support_eplb

* [Fix] fix mm ep
2025-07-24 20:22:45 +08:00
Zero Rains
0fb37ab7e4 update flake8 version to support pre-commit in python3.12 (#3000)
* update flake8 version to support pre-commit in python3.12

* polish code
2025-07-24 01:43:31 -07:00
lizhenyun01
6235ef3881 fix chunk_prefill 2025-07-24 12:00:52 +08:00
lizhenyun01
29c3292f02 support c4 attn && fix cache 2025-07-24 12:00:52 +08:00
chenjian
85a78d695d [Feature] Support block scheduler v1 for FD (#2928)
* Support FD block scheduler v1

* Support FD block scheduler v1

* Support FD block scheduler v1

* Fix according to copilot review

* Fix according to review

* Remove is_dummy

* Fix bug when real_bsz=1

* Fix infer first token cost time

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-23 20:31:31 +08:00
lizhenyun01
e51f018577 support chunk_prefill in fa3 2025-07-23 12:19:20 +08:00
Sunny-bot1
7c5e34e72d [FIX]fix rejection sampling when topp=0 using _SAMPLING_EPS (#2967)
* fix rejection sampling when topp=0

* fix
2025-07-22 05:53:37 -07:00
GoldPancake
9b84d51e25 [MTP Fix] Fix code and register cpp operators (#2965) 2025-07-22 19:36:24 +08:00
K11OntheBoat
e991777757 [Feature] DeepseekV3 use pd_build_static_op (#2948)
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
2025-07-22 15:03:41 +08:00
K11OntheBoat
8020927f50 [BugFix] Rename attention params of deepseekv3 (#2939)
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
2025-07-22 14:01:30 +08:00
lizexu123
67990e0572 [Feature] support min_p_sampling (#2872)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* Fastdeploy support min_p

* add test_min_p

* fix

* min_p_sampling

* update

* delete vl_gpu_model_runner.py

* fix

* Align usage of min_p with vLLM

* fix

* modified unit test

* fix test_min_sampling

* pre-commit all files

* fix

* fix

* fix

* fix xpu_model_runner.py
2025-07-20 23:17:59 -07:00
Zero Rains
25698d56d1 polish code with new pre-commit rule (#2923) 2025-07-19 23:19:27 +08:00
周周周
d306944f4f remove cum_offsets from get_block_shape_and_split_kv_block (#2913)
* remove padding_offsets from get_padding_offset.cu

* remove padding_offsets from get_padding_offset.cu

* remove padding_offsets from get_padding_offset.cu

* remove cum_offsets from get_block_shape_and_split_kv_block

* remove cum_offsets from get_block_shape_and_split_kv_block
2025-07-18 16:13:32 +08:00
周周周
1339e56282 [XPU] Remove padding_offsets from get_padding_offset.cu (#2911) 2025-07-18 14:16:44 +08:00
周周周
ddb10ac509 [Inference, rename] remove padding_offsets from atten use batch_id_per_token (#2880)
* remove padding_offsets from atten
2025-07-17 18:41:31 +08:00
freeliuzc
d49f8fb30a [Feature][MTP] Support cacheKV transfer in per_chunk mode (#2890)
* support chunk_prefill both normal and speculative_decoding(mtp)

* optimize pd-disaggregation config

* fix bug
2025-07-17 17:58:08 +08:00
ming1753
1f15ca21e4 [Feature] support prompt repetition_penalty (#2806)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-07-17 12:05:52 +08:00
周周周
aa76085d1f [Attention] remove cum_offsets from atten, and use cu_seqlens_q (#2870)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
[Attention] remove cum_offsets from atten, and use cu_seqlens_q (#2870)
2025-07-16 20:10:57 +08:00
yulangz
17314ee126 [XPU] Update doc and add scripts for downloading dependencies (#2845)
* [XPU] update xvllm download

* update supported models

* fix xpu model runner in huge memory with small model

* update doc
2025-07-16 11:05:56 +08:00
Yuanle Liu
61b3997b85 refactor rl get_name_mappings_to_training (#2847)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* refactor rl get_name_mappings_to_training

* fix tp>1

* change variable name(ffn1->up_gate_proj/ffn2->down_proj)

* change variable name(linear_weight->weight/linear_bias->bias)

* add rl names mapping for vl

* fix ernie 0.3B error

* fix develop code

* fix
2025-07-15 07:31:42 -07:00
freeliuzc
7cdd8d290d [MTP] optimize mtp infer speed (#2840)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-07-14 19:50:22 +08:00
chen
d33105baeb [Feature] Online Chat API Support Return logprobs (#2777)
* online chat support logprobs

* check xpu

* check vl_gpu_model_runner and xpu_model_runner

* get_worker() check platform
2025-07-10 16:33:40 +08:00
Sunny-bot1
e45050cae3 [Feature] support top_k_top_p sampling (#2753)
* support top_k_top_p sampling

* fix

* add api param

* add api para

* fix

* fix

* fix

* fix

* fix

* fix

* fix
2025-07-09 20:58:58 -07:00
lifulll
1f28bdf994 dcu adapter ernie45t (#2756)
Co-authored-by: lifu <lifu@sugon.com>
Co-authored-by: yongqiangma <xing.wo@163.com>
2025-07-09 18:56:27 +08:00
zhink
b89180f1cd [Feature] support custom all-reduce (#2758)
* [Feature] support custom all-reduce

* add vllm adapted
2025-07-09 16:00:27 +08:00
celsowm
771e71a24d Feat/blackwell sm100 support (#2670)
* Add initial support for NVIDIA Blackwell (SM100) architecture

This change introduces initial support for the NVIDIA Blackwell GPU
architecture, specifically targeting SM100 (Compute Capability 10.x)
with '100a' architecture-specific features (e.g., for CUTLASS).

Key changes:
- Updated custom_ops/setup_ops.py to generate appropriate gencode
  flags (arch=compute_100a,code=sm_100a) when '100' is specified
  in FD_BUILDING_ARCS. Requires CUDA 12.9+.
- Updated custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h:
    - Added CutlassTileConfigSM100 enum (with placeholder tile shapes).
    - Added BLACKWELL to CandidateConfigTypeParam.
    - Updated CutlassGemmConfig struct with is_sm100 flag,
      tile_config_sm100, and new constructor for SM100.
    - Modified toString() and fromString() for SM100 support.
- Updated custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu:
    - Added get_candidate_tiles_sm100() (with placeholder tiles).
    - Added placeholder mcast support functions for SM100.
    - Updated get_candidate_configs() to include SM100 paths using
      the BLACKWELL flag and new SM100 config types.
- Updated build.sh with comments to guide users on specifying '100'
  for Blackwell in FD_BUILDING_ARCS.

Further work:
- Optimal CUTLASS tile configurations for SM100 need to be researched
  and updated in cutlass_heuristic.cu.
- Kernel auto-generation scripts in custom_ops/utils/ may need
  SM100-specific versions if Blackwell's hardware features for FP8/TMA
  differ significantly from SM90.
- Compatibility of third-party libraries (CUTLASS v3.8.0, DeepGEMM)
  with Blackwell should be fully verified.

* Feat: Implement detailed Blackwell (SM100) CUTLASS heuristics

This change integrates specific, expert-provided CUTLASS heuristic
configurations for the NVIDIA Blackwell (SM100) GPU architecture,
replacing previous placeholders. This includes:

- Updated `custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h`:
    - Populated `CutlassTileConfigSM100` enum with specific tile shapes
      (e.g., CtaShape64x64x128B, CtaShape128x128x128B) suitable for SM100.
    - Added `FP4_ONLY` to `CandidateConfigTypeParam` for new FP4 paths.

- Updated `custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu`:
    - Implemented `get_candidate_tiles_sm100` with detailed logic for
      selecting tile configurations based on GROUPED_GEMM and FP4_ONLY flags,
      using the new SM100 tile enums.
    - Implemented `supports_mcast_along_m_sm100` and
      `supports_mcast_along_n_sm100` with specific tile checks for Blackwell.
    - Updated the `sm == 100` (Blackwell) block in `get_candidate_configs`
      to use these new helper functions and accurately populate candidate
      kernel configurations for various cluster shapes.

- `custom_ops/setup_ops.py` remains configured to compile for
  `arch=compute_100a,code=sm_100a` with CUDA 12.9+ for these features.

This aligns the codebase with heuristic configurations similar to those
in upstream TensorRT-LLM / CUTLASS for Blackwell, enabling more
performant kernel selection on this new architecture.

---------

Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-09 15:29:42 +08:00
RichardWooSJTU
fee544e808 fix ep prefill (#2762) 2025-07-09 14:03:05 +08:00
GoldPancake
f7cad30a38 [Feature] Add speculative decoding simulation benchmark. (#2751)
* Add speculative decoding simulation benchmark

* Fix the name of the parameter
2025-07-09 12:08:43 +08:00
EnflameGCU
d0f4d6ba3a [GCU] Support gcu platform (#2702)
baseline: e7fa57ebae

Co-authored-by: yongqiangma <xing.wo@163.com>
2025-07-08 13:00:52 +08:00
ming1753
1eb8ea7328 [Bug fix] fix complie bug when sm < 89 (#2738) 2025-07-08 11:24:52 +08:00
ming1753
ef6649a577 [Optimize] Optimize tensorwise fp8 performance (#2729)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* [Optimize] Optimize tensorwise fp8 performance
2025-07-07 20:06:28 +08:00
liddk1121
1b54a2831e Adapt for iluvatar gpu (#2684) 2025-07-07 16:53:14 +08:00
Yuanle Liu
240bdac2a4 [feat] support fa3 backend for pd disaggregated (#2695)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* support fa3 backend run in pd disaggregated

* support fa3 backend run in pd disaggregated

* support fa3 backend run in pd disaggregated

* support fa3 backend run in pd disaggregated

* delete use_fast_ffn
2025-07-03 22:33:27 +08:00
Jiang-Jia-Jun
05c670e593 [Sync] Update to latest code (#2679)
* [Sync] Update to latest code

* Add new code files

* Add new code files

* update code

* Try to fix build.sh

* Try to fix build.sh

* Update code

* Update requirements.txt

* Update code

---------

Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com>
2025-07-03 15:43:53 +08:00
MARD1NO
ac5f860536 use shfl_xor_sync to reduce redundant shfl broadcast 2025-06-30 13:12:21 +08:00
Jiang-Jia-Jun
92c2cfa2e7 Sync v2.0 version of code to github repo 2025-06-29 23:29:37 +00:00
jiangjiajun
684703fd72 [LLM] First commit the llm deployment code 2025-06-09 19:20:15 +08:00