AIbin
|
a7392a0ff9
|
【Inference Optimize】DeepSeek-V3-model MLA Optimize (#3886)
* support MLA chunk_size auto search & cuda_graph
|
2025-09-11 10:46:09 +08:00 |
|
Jundong Liu
|
3d0aaa5923
|
[Excutor] Experiment Feature-Support Prefill in cudagraph (#3459)
* Support prefill in Cudagraph
* Refactor GetBlockShapeAndSplitKVBlock Kernel V2
* Refactor GetBlockShapeAndSplitKVBlock Kernel V2.1
* Refactor GetBlockShapeAndSplitKVBlock Kernel V2.2
* Refactor GetBlockShapeAndSplitKVBlock Kernel V2.3
* Refactor GetBlockShapeAndSplitKVBlock Kernel V2.4
* Refactor GetBlockShapeAndSplitKVBlock Kernel V2.5
* Solve problem about encoder_num_blocks_x_cpu
* Add early-exit mechanism for attention kernel
* fix test case about append-attention
* Update testcode, Add annotations to related tensors
* move get_input_length_list
* solve test_code
* Add annotations about early-exit for attention kernel
* Add annotations about early-exit for attention kernel2
* solve comment
* solve mtp
---------
Co-authored-by: RAM <gstian5555@outlook.com>
|
2025-09-08 13:12:24 +08:00 |
|
Yuan Xiaolan
|
fa58a9fa8f
|
qk norm for speculate decode C16 (#3637)
|
2025-09-03 14:53:56 +08:00 |
|
YuanRisheng
|
642480f5f6
|
[CI] Standard unittest (#3606)
* standard unittest
* fix bugs
* fix script
|
2025-08-26 19:03:11 +08:00 |
|
freeliuzc
|
76759108c9
|
[Feature][SpeculativeDecoding]Support tree-attention (#3514)
* support tree-attention
* fix merge bug
* fix unit-test api
* fix merge bug
|
2025-08-22 13:36:41 +08:00 |
|