chen
ad202272ed
【Infer】Improve the performance block_wise_fp8 of triton_moe_backend ( #2942 )
2025-07-23 13:02:50 +08:00
lizhenyun01
e51f018577
support chunk_prefill in fa3
2025-07-23 12:19:20 +08:00
K11OntheBoat
93bb68aa71
[Feature] Marlin MoE backend supports DeepseekV3 ( #2962 )
...
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
2025-07-22 18:11:15 +08:00
Nyakku Shigure
48e6a0ca26
[SOT] Mark dynamic dims by type annotations ( #2771 )
...
Deploy GitHub Pages / deploy (push) Has been cancelled
* [SOT] Mark dynamic dims by type annotations
* fix conflict of forward_meta
* mark more attn backend
* fix missing annotated and add env SOT_SPECIALIZED_DIM_NUMBERS
* auto infer implicit 0 dim dynamic dim
* revert manual marked dims
* revert missing update
* auto infer can use unsafe code in warmup stage
* check -> type_match
* fix codestyle
* restore blank line
* empty commit
* add need_warmup nonlocal;
* add doc for resolver
* add missing type hints
* unquote "ForwardMeta"
2025-07-22 00:23:52 -07:00
lifulll
2c6a9e887e
native top_p_sampling ( #2901 )
2025-07-22 14:09:59 +08:00
K11OntheBoat
8020927f50
[BugFix] Rename attention params of deepseekv3 ( #2939 )
...
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
2025-07-22 14:01:30 +08:00
zhink
0262ef7eb3
custom all reduce support cuda graph ( #2938 )
...
Deploy GitHub Pages / deploy (push) Has been cancelled
* Support enabling cuda graph and custom all reduce at the same time, and fix the overwritten custom all reduce flag
* rename communication_op to communication
2025-07-21 22:52:03 +08:00
周周周
ff4569f135
remove some code in ep.py ( #2947 )
2025-07-21 22:44:57 +08:00
lizexu123
67990e0572
[Feature] support min_p_sampling ( #2872 )
...
Deploy GitHub Pages / deploy (push) Has been cancelled
* Fastdeploy support min_p
* add test_min_p
* fix
* min_p_sampling
* update
* delete vl_gpu_model_runner.py
* fix
* Align usage of min_p with vLLM
* fix
* modified unit test
* fix test_min_sampling
* pre-commit all files
* fix
* fix
* fix
* fix xpu_model_runner.py
2025-07-20 23:17:59 -07:00
Zero Rains
25698d56d1
polish code with new pre-commit rule ( #2923 )
2025-07-19 23:19:27 +08:00
周周周
d306944f4f
remove cum_offsets from get_block_shape_and_split_kv_block ( #2913 )
...
* remove padding_offsets from get_padding_offset.cu
* remove padding_offsets from get_padding_offset.cu
* remove padding_offsets from get_padding_offset.cu
* remove cum_offsets from get_block_shape_and_split_kv_block
* remove cum_offsets from get_block_shape_and_split_kv_block
2025-07-18 16:13:32 +08:00
周周周
ddb10ac509
[Inference, rename] remove padding_offsets from atten use batch_id_per_token ( #2880 )
...
* remove padding_offsets from atten
2025-07-17 18:41:31 +08:00
freeliuzc
d49f8fb30a
[Feature][MTP] Support cacheKV transfer in per_chunk mode ( #2890 )
...
* support chunk_prefill both normal and speculative_decoding(mtp)
* optimize pd-disaggregation config
* fix bug
2025-07-17 17:58:08 +08:00
Yuanle Liu
dbb9e2506b
Fix rollout_model init ( #2881 )
2025-07-16 22:36:21 -07:00
ming1753
1f15ca21e4
[Feature] support prompt repetition_penalty ( #2806 )
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-07-17 12:05:52 +08:00
Yuanle Liu
63d6e7ce06
fix and refine vl ( #2866 )
...
* refine vl config
* delete attn_sep
* fix vl accuracy
2025-07-16 05:59:28 -07:00
周周周
aa76085d1f
[Attention] remove cum_offsets from atten, and use cu_seqlens_q ( #2870 )
...
Deploy GitHub Pages / deploy (push) Has been cancelled
[Attention] remove cum_offsets from atten, and use cu_seqlens_q (#2870 )
2025-07-16 20:10:57 +08:00
Yuanle Liu
dda4a9f848
rl update ( #2861 )
2025-07-16 00:33:10 -07:00
freeliuzc
2d1184aefe
[Fix] fix expert_parallel bug in decoder stage ( #2848 )
2025-07-16 11:08:18 +08:00
RAM
0fad10b35a
[Executor] CUDA Graph support padding batch ( #2844 )
...
* cuda graph support padding batch
* Integrate the startup parameters for the graph optimization backend and provide support for user - defined capture sizes.
* Do not insert max_num_seqs when the user specifies a capture list
* Support set graph optimization config from YAML file
* update cuda graph ci
* fix ci bug
* fix ci bug
2025-07-15 19:49:01 -07:00
Yuanle Liu
61b3997b85
refactor rl get_name_mappings_to_training ( #2847 )
...
Deploy GitHub Pages / deploy (push) Has been cancelled
* refactor rl get_name_mappings_to_training
* fix tp>1
* change variable name(ffn1->up_gate_proj/ffn2->down_proj)
* change variable name(linear_weight->weight/linear_bias->bias)
* add rl names mapping for vl
* fix ernie 0.3B error
* fix develop code
* fix
2025-07-15 07:31:42 -07:00
AIbin
fd91da7b41
【Inference Optimize】Support wint2 triton kernel about triton_utils_v2 ( #2842 )
...
* update supported_models doc
2025-07-15 14:35:40 +08:00
freeliuzc
7cdd8d290d
[MTP] optimize mtp infer speed ( #2840 )
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-07-14 19:50:22 +08:00
YuanRisheng
4c7b8bc458
Simplify the Config code ( #2770 )
...
* simplify the code
* fix vl
* delete config
* fix
* perfect code
* fix ci
* fix xpu
* fix xpu
* fix server
* resolve conflict
* fix mtp
* resolve conflict
* fix xpu
* fix xpu
* fix vl
* fix log
* fix qwen moe
* fix qwen moe
* fix qwen moe
2025-07-14 19:50:05 +08:00
zhink
c08561c13a
[Feature] support tensor-parallel-size>num_key_value_heads for qwen3 ( #2799 )
2025-07-11 15:09:43 +08:00
Sunny-bot1
240d6236bc
[Fix]fix top_k_top_p sampling ( #2801 )
...
Deploy GitHub Pages / deploy (push) Has been cancelled
* fix topk-topp
* update
* add base_non_truncated
2025-07-10 22:35:10 +08:00
littledgg
59071268b6
[Executor] Move forward_meta.py to fastdeploy/model_executor ( #2774 )
...
* Use PEP 563 in attention.py and fix conflict
* merge commit
* Change what was left out last time
2025-07-10 20:36:51 +08:00
chen
d33105baeb
[Feature] Online Chat API Support Return logprobs ( #2777 )
...
* online chat support logprobs
* check xpu
* check vl_gpu_model_runner and xpu_model_runner
* get_worker() check platform
2025-07-10 16:33:40 +08:00
K11OntheBoat
24f934f1f9
[BugFix] Fix low prediction accuracy of deepseekv3 ( #2798 )
2025-07-10 16:16:44 +08:00
Sunny-bot1
1e2319cbef
Rename top_p_sampling to top_k_top_p_sampling ( #2791 )
2025-07-10 00:09:25 -07:00
Sunny-bot1
e45050cae3
[Feature] support top_k_top_p sampling ( #2753 )
...
* support top_k_top_p sampling
* fix
* add api param
* add api para
* fix
* fix
* fix
* fix
* fix
* fix
* fix
2025-07-09 20:58:58 -07:00
Ryan
b0f525955c
[SOT] Remove breakgraph in post processing && fix datatype ( #2780 )
2025-07-10 11:26:00 +08:00
chen
888780ffde
[Feature] block_wise_fp8 support triton_moe_backend ( #2767 )
2025-07-09 19:22:47 +08:00
lifulll
1f28bdf994
dcu adapter ernie45t ( #2756 )
...
Co-authored-by: lifu <lifu@sugon.com >
Co-authored-by: yongqiangma <xing.wo@163.com >
2025-07-09 18:56:27 +08:00
yulangz
be21ef5047
[XPU] Supports BF16 for ERNIE-4.5-21B-A3B and ERNIE-4.5-0.3B ( #2765 )
...
* fix no quant xpu moe
* change dir of xpu moe weight only
2025-07-09 15:57:51 +08:00
RichardWooSJTU
fee544e808
fix ep prefill ( #2762 )
2025-07-09 14:03:05 +08:00
GoldPancake
f7cad30a38
[Feature] Add speculative decoding simulation benchmark. ( #2751 )
...
* Add speculative decoding simulation benchmark
* Fix the name of the parameter
2025-07-09 12:08:43 +08:00
RichardWooSJTU
6610aa29d0
Revert "[Bug fix] fix attention rank init ( #2743 )" ( #2761 )
...
This reverts commit e8bbe7244b
.
2025-07-09 10:38:12 +08:00
RichardWooSJTU
e8bbe7244b
[Bug fix] fix attention rank init ( #2743 )
...
* fix attention rank init
* fix attention rank init
2025-07-08 17:19:49 +08:00
EnflameGCU
d0f4d6ba3a
[GCU] Support gcu platform ( #2702 )
...
baseline: e7fa57ebae
Co-authored-by: yongqiangma <xing.wo@163.com >
2025-07-08 13:00:52 +08:00
gaoziyuan
26d5d737dd
【Fearture】support qwen2 some func ( #2740 )
...
* add rl qwen model support
* fix
* fix
2025-07-08 12:03:04 +08:00
ming1753
1eb8ea7328
[Bug fix] fix complie bug when sm < 89 ( #2738 )
2025-07-08 11:24:52 +08:00
ming1753
ef6649a577
[Optimize] Optimize tensorwise fp8 performance ( #2729 )
...
Deploy GitHub Pages / deploy (push) Has been cancelled
* [Optimize] Optimize tensorwise fp8 performance
2025-07-07 20:06:28 +08:00
liddk1121
1b54a2831e
Adapt for iluvatar gpu ( #2684 )
2025-07-07 16:53:14 +08:00
GoldPancake
e7fa57ebae
Extract eh_proj Layer from ParallelLMHead for MTP to Avoid Weight Transposition Issue ( #2707 )
...
Deploy GitHub Pages / deploy (push) Has been cancelled
* fix mtp eh_proj layer
* fix mtp update_cfg function
* fix stringdoc
* simplify class name
2025-07-04 14:15:04 +08:00
Yuanle Liu
240bdac2a4
[feat] support fa3 backend for pd disaggregated ( #2695 )
...
Deploy GitHub Pages / deploy (push) Has been cancelled
* support fa3 backend run in pd disaggregated
* support fa3 backend run in pd disaggregated
* support fa3 backend run in pd disaggregated
* support fa3 backend run in pd disaggregated
* delete use_fast_ffn
2025-07-03 22:33:27 +08:00
Jiang-Jia-Jun
05c670e593
[Sync] Update to latest code ( #2679 )
...
* [Sync] Update to latest code
* Add new code files
* Add new code files
* update code
* Try to fix build.sh
* Try to fix build.sh
* Update code
* Update requirements.txt
* Update code
---------
Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com >
2025-07-03 15:43:53 +08:00
AIbin
a197dcd729
【Inference Optimize】Support ERNIE-4_5-300B-A47B-2BITS-Paddle model TP2/TP4 Inference ( #2666 )
...
* Support TP2&TP4 Wint
* Support TP2&TP4 Wint2 Inference
2025-07-01 18:29:11 +08:00
Jiang-Jia-Jun
92c2cfa2e7
Sync v2.0 version of code to github repo
2025-06-29 23:29:37 +00:00
jiangjiajun
684703fd72
[LLM] First commit the llm deployment code
2025-06-09 19:20:15 +08:00