Commit Graph

71 Commits

Author SHA1 Message Date
bukejiyu
db698bda01 qwen loader (#3057) 2025-07-30 19:09:38 +08:00
bukejiyu
8e203666d9 w4a8 offline (#3074)
* w4a8 offline

* update

* update

* update
2025-07-30 16:33:30 +08:00
Yuan Xiaolan
35935da9e5 support W4A8 EPLB (#3075) 2025-07-30 14:34:12 +08:00
Zero Rains
b2f9a42d87 [Feature] Support repetition early stop (#3024)
* support repetition early stop and support user to set the parameter

* remove log

* fix codestyle

* add the early_stop_config to rollout_config

* update config and EarlyStopper class

* fix the bug for triton

* modify the stop method

* update description

* modify the usage for stop_flags

---------

Co-authored-by: Yuanle Liu <yuanlehome@163.com>
2025-07-29 22:42:54 +08:00
Yuan Xiaolan
3214fb5393 support model loading for w4a8 offline quant (#3064)
支持W4A8 EP 对离线量化权重的load
2025-07-29 21:54:37 +08:00
Longzhi Wang
be0a0f2bb2 fix arguement error in ep when pd (#3060) 2025-07-29 17:17:24 +08:00
YuanRisheng
502ee92a0a Unify server-side and model-side Config (Part3) (#3047)
* merge model config

* fix arch

* fix rl
2025-07-29 17:07:44 +08:00
Longzhi Wang
907d561523 fix ep when paddle version mismatch (#3056) 2025-07-29 15:06:49 +08:00
Yuan Xiaolan
b1d787a272 [fix] w4a8 model loading and hadamard config (#3013) 2025-07-28 18:17:59 +08:00
AIbin
ec52d39e68 【Inference Optimize】Update wint2 weight n-dim reorder (#3042) 2025-07-28 16:31:56 +08:00
Longzhi Wang
247010d298 fix arguement error (#3030) 2025-07-28 11:03:29 +08:00
YuanRisheng
6ccc10ad47 Unify server-side and model-side Config (Part1) (#3018)
* move cache config

* fix mtp
2025-07-28 10:51:52 +08:00
Longzhi Wang
0700c90caa [Feat] support mixed ep (#2969)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* Support mixed ep

* fix comment

* fix comment

* update mixep

* fix conflict

* fix typo

* update

* fix typo

* fix code style

* fix conflict
2025-07-25 15:29:30 +08:00
chen
332154f504 [feature] Support FA2 (#3009) 2025-07-25 14:09:00 +08:00
xiaoxiaohehe001
2970b00dfa [Feature] Support_eplb (#2997)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* [Feature] support_eplb

* [Feature] support_eplb

* [Fix] fix mm ep
2025-07-24 20:22:45 +08:00
EnflameGCU
c40df1802e [GCU] Update to develop (#2988) 2025-07-24 19:30:52 +08:00
Zero Rains
0fb37ab7e4 update flake8 version to support pre-commit in python3.12 (#3000)
* update flake8 version to support pre-commit in python3.12

* polish code
2025-07-24 01:43:31 -07:00
lizhenyun01
29c3292f02 support c4 attn && fix cache 2025-07-24 12:00:52 +08:00
lizexu123
832d25334a [Code Simplification] fix init_distributed_environment() (#2982) 2025-07-24 11:43:28 +08:00
bukejiyu
bfeb664ab8 update (#2978)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-07-24 00:16:42 +08:00
chen
172e69fe17 FA3 fix bug (#2987) 2025-07-23 19:07:43 +08:00
chen
ad202272ed 【Infer】Improve the performance block_wise_fp8 of triton_moe_backend (#2942) 2025-07-23 13:02:50 +08:00
lizhenyun01
e51f018577 support chunk_prefill in fa3 2025-07-23 12:19:20 +08:00
K11OntheBoat
93bb68aa71 [Feature] Marlin MoE backend supports DeepseekV3 (#2962)
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
2025-07-22 18:11:15 +08:00
Nyakku Shigure
48e6a0ca26 [SOT] Mark dynamic dims by type annotations (#2771)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* [SOT] Mark dynamic dims by type annotations

* fix conflict of forward_meta

* mark more attn backend

* fix missing annotated and add env SOT_SPECIALIZED_DIM_NUMBERS

* auto infer implicit 0 dim dynamic dim

* revert manual marked dims

* revert missing update

* auto infer can use unsafe code in warmup stage

* check -> type_match

* fix codestyle

* restore blank line

* empty commit

* add need_warmup nonlocal;

* add doc for resolver

* add missing type hints

* unquote "ForwardMeta"
2025-07-22 00:23:52 -07:00
lifulll
2c6a9e887e native top_p_sampling (#2901) 2025-07-22 14:09:59 +08:00
K11OntheBoat
8020927f50 [BugFix] Rename attention params of deepseekv3 (#2939)
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
2025-07-22 14:01:30 +08:00
zhink
0262ef7eb3 custom all reduce support cuda graph (#2938)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* Support enabling cuda graph and custom all reduce at the same time, and fix the overwritten custom all reduce flag

* rename communication_op to communication
2025-07-21 22:52:03 +08:00
周周周
ff4569f135 remove some code in ep.py (#2947) 2025-07-21 22:44:57 +08:00
lizexu123
67990e0572 [Feature] support min_p_sampling (#2872)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* Fastdeploy support min_p

* add test_min_p

* fix

* min_p_sampling

* update

* delete vl_gpu_model_runner.py

* fix

* Align usage of min_p with vLLM

* fix

* modified unit test

* fix test_min_sampling

* pre-commit all files

* fix

* fix

* fix

* fix xpu_model_runner.py
2025-07-20 23:17:59 -07:00
Zero Rains
25698d56d1 polish code with new pre-commit rule (#2923) 2025-07-19 23:19:27 +08:00
周周周
d306944f4f remove cum_offsets from get_block_shape_and_split_kv_block (#2913)
* remove padding_offsets from get_padding_offset.cu

* remove padding_offsets from get_padding_offset.cu

* remove padding_offsets from get_padding_offset.cu

* remove cum_offsets from get_block_shape_and_split_kv_block

* remove cum_offsets from get_block_shape_and_split_kv_block
2025-07-18 16:13:32 +08:00
周周周
ddb10ac509 [Inference, rename] remove padding_offsets from atten use batch_id_per_token (#2880)
* remove padding_offsets from atten
2025-07-17 18:41:31 +08:00
freeliuzc
d49f8fb30a [Feature][MTP] Support cacheKV transfer in per_chunk mode (#2890)
* support chunk_prefill both normal and speculative_decoding(mtp)

* optimize pd-disaggregation config

* fix bug
2025-07-17 17:58:08 +08:00
Yuanle Liu
dbb9e2506b Fix rollout_model init (#2881) 2025-07-16 22:36:21 -07:00
ming1753
1f15ca21e4 [Feature] support prompt repetition_penalty (#2806)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-07-17 12:05:52 +08:00
Yuanle Liu
63d6e7ce06 fix and refine vl (#2866)
* refine vl config

* delete attn_sep

* fix vl accuracy
2025-07-16 05:59:28 -07:00
周周周
aa76085d1f [Attention] remove cum_offsets from atten, and use cu_seqlens_q (#2870)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
[Attention] remove cum_offsets from atten, and use cu_seqlens_q (#2870)
2025-07-16 20:10:57 +08:00
Yuanle Liu
dda4a9f848 rl update (#2861) 2025-07-16 00:33:10 -07:00
freeliuzc
2d1184aefe [Fix] fix expert_parallel bug in decoder stage (#2848) 2025-07-16 11:08:18 +08:00
RAM
0fad10b35a [Executor] CUDA Graph support padding batch (#2844)
* cuda graph support padding batch

* Integrate the startup parameters for the graph optimization backend and provide support for user - defined capture sizes.

* Do not insert max_num_seqs when the user specifies a capture list

* Support set graph optimization config from YAML file

* update cuda graph ci

* fix ci bug

* fix ci bug
2025-07-15 19:49:01 -07:00
Yuanle Liu
61b3997b85 refactor rl get_name_mappings_to_training (#2847)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* refactor rl get_name_mappings_to_training

* fix tp>1

* change variable name(ffn1->up_gate_proj/ffn2->down_proj)

* change variable name(linear_weight->weight/linear_bias->bias)

* add rl names mapping for vl

* fix ernie 0.3B error

* fix develop code

* fix
2025-07-15 07:31:42 -07:00
AIbin
fd91da7b41 【Inference Optimize】Support wint2 triton kernel about triton_utils_v2 (#2842)
* update supported_models doc
2025-07-15 14:35:40 +08:00
freeliuzc
7cdd8d290d [MTP] optimize mtp infer speed (#2840)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-07-14 19:50:22 +08:00
YuanRisheng
4c7b8bc458 Simplify the Config code (#2770)
* simplify the code

* fix vl

* delete config

* fix

* perfect code

* fix ci

* fix xpu

* fix xpu

* fix server

* resolve conflict

* fix mtp

* resolve conflict

* fix xpu

* fix xpu

* fix vl

* fix log

* fix qwen moe

* fix qwen moe

* fix qwen moe
2025-07-14 19:50:05 +08:00
zhink
c08561c13a [Feature] support tensor-parallel-size>num_key_value_heads for qwen3 (#2799) 2025-07-11 15:09:43 +08:00
Sunny-bot1
240d6236bc [Fix]fix top_k_top_p sampling (#2801)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* fix topk-topp

* update

* add base_non_truncated
2025-07-10 22:35:10 +08:00
littledgg
59071268b6 [Executor] Move forward_meta.py to fastdeploy/model_executor (#2774)
* Use PEP 563 in attention.py and fix conflict

* merge commit

* Change what was left out last time
2025-07-10 20:36:51 +08:00
chen
d33105baeb [Feature] Online Chat API Support Return logprobs (#2777)
* online chat support logprobs

* check xpu

* check vl_gpu_model_runner and xpu_model_runner

* get_worker() check platform
2025-07-10 16:33:40 +08:00
K11OntheBoat
24f934f1f9 [BugFix] Fix low prediction accuracy of deepseekv3 (#2798) 2025-07-10 16:16:44 +08:00