FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2025-10-05 16:48:03 +08:00

Author	SHA1	Message	Date
xiaoxiaohehe001	2970b00dfa	[Feature] Support_eplb (#2997 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * [Feature] support_eplb * [Feature] support_eplb * [Fix] fix mm ep	2025-07-24 20:22:45 +08:00
littledgg	f37d00e856	[Model] Provide clearer error for missing KV cache quantization scales (#3007 )	2025-07-24 20:15:00 +08:00
EnflameGCU	c40df1802e	[GCU] Update to develop (#2988 )	2025-07-24 19:30:52 +08:00
Zero Rains	0fb37ab7e4	update flake8 version to support pre-commit in python3.12 (#3000 ) * update flake8 version to support pre-commit in python3.12 * polish code	2025-07-24 01:43:31 -07:00
xiaoxiaohehe001	2c0ff068e2	[Fix] fix mm ep empty run (#2999 )	2025-07-24 14:15:55 +08:00
lizhenyun01	29c3292f02	support c4 attn && fix cache	2025-07-24 12:00:52 +08:00
lizexu123	832d25334a	[Code Simplification] fix init_distributed_environment() (#2982 )	2025-07-24 11:43:28 +08:00
bukejiyu	bfeb664ab8	update (#2978 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details	2025-07-24 00:16:42 +08:00
chenjian	85a78d695d	[Feature] Support block scheduler v1 for FD (#2928 ) * Support FD block scheduler v1 * Support FD block scheduler v1 * Support FD block scheduler v1 * Fix according to copilot review * Fix according to review * Remove is_dummy * Fix bug when real_bsz=1 * Fix infer first token cost time --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2025-07-23 20:31:31 +08:00
chen	172e69fe17	FA3 fix bug (#2987 )	2025-07-23 19:07:43 +08:00
lizexu123	9b22b8d2c3	delete max-len (#2959 )	2025-07-23 15:11:39 +08:00
chen	ad202272ed	【Infer】Improve the performance block_wise_fp8 of triton_moe_backend (#2942 )	2025-07-23 13:02:50 +08:00
lizhenyun01	e51f018577	support chunk_prefill in fa3	2025-07-23 12:19:20 +08:00
Ryan	95b5af24db	[SOT] Add sot warmup (NVIDIA GPU Only) (#2929 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * add sot warmup * fix code style * change batch_size list * add param to config * rm free_list settings && set sot_warmup_sizes * finish debug with dynamic dims by type annotations * add profile_run guard * rm sth useless	2025-07-22 21:36:14 +08:00
K11OntheBoat	93bb68aa71	[Feature] Marlin MoE backend supports DeepseekV3 (#2962 ) Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>	2025-07-22 18:11:15 +08:00
Nyakku Shigure	48e6a0ca26	[SOT] Mark dynamic dims by type annotations (#2771 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * [SOT] Mark dynamic dims by type annotations * fix conflict of forward_meta * mark more attn backend * fix missing annotated and add env SOT_SPECIALIZED_DIM_NUMBERS * auto infer implicit 0 dim dynamic dim * revert manual marked dims * revert missing update * auto infer can use unsafe code in warmup stage * check -> type_match * fix codestyle * restore blank line * empty commit * add need_warmup nonlocal; * add doc for resolver * add missing type hints * unquote "ForwardMeta"	2025-07-22 00:23:52 -07:00
lifulll	2c6a9e887e	native top_p_sampling (#2901 )	2025-07-22 14:09:59 +08:00
K11OntheBoat	8020927f50	[BugFix] Rename attention params of deepseekv3 (#2939 ) Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>	2025-07-22 14:01:30 +08:00
zhink	0262ef7eb3	custom all reduce support cuda graph (#2938 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * Support enabling cuda graph and custom all reduce at the same time, and fix the overwritten custom all reduce flag * rename communication_op to communication	2025-07-21 22:52:03 +08:00
周周周	ff4569f135	remove some code in ep.py (#2947 )	2025-07-21 22:44:57 +08:00
Yuanle Liu	2f74e93d7e	use dist.all_reduce(min) to sync num_blocks_local (#2933 ) * pre-commit all files check * reduce min num_blocks_local * fix nranks=1 * pre-commit when commit-msg	2025-07-21 01:23:36 -07:00
lizexu123	67990e0572	[Feature] support min_p_sampling (#2872 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * Fastdeploy support min_p * add test_min_p * fix * min_p_sampling * update * delete vl_gpu_model_runner.py * fix * Align usage of min_p with vLLM * fix * modified unit test * fix test_min_sampling * pre-commit all files * fix * fix * fix * fix xpu_model_runner.py	2025-07-20 23:17:59 -07:00
周周周	8c5407d9e4	remove cum_offsets from ForwardMeta (#2925 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details	2025-07-19 23:57:27 +08:00
Zero Rains	25698d56d1	polish code with new pre-commit rule (#2923 )	2025-07-19 23:19:27 +08:00
ming1753	5328daa333	[Bug Fix] fix ep config bug (#2920 )	2025-07-18 19:12:56 +08:00
xiaoxiaohehe001	a42fc3f40b	[Feature] Support 45tVL EP FP8 Infer. (#2909 ) * support_mm_ep_fp8 * support_mm_ep	2025-07-18 17:57:15 +08:00
gaoziyuan	6efad14b95	support vl ori_vacab_size (#2900 )	2025-07-18 16:26:14 +08:00
周周周	d306944f4f	remove cum_offsets from get_block_shape_and_split_kv_block (#2913 ) * remove padding_offsets from get_padding_offset.cu * remove padding_offsets from get_padding_offset.cu * remove padding_offsets from get_padding_offset.cu * remove cum_offsets from get_block_shape_and_split_kv_block * remove cum_offsets from get_block_shape_and_split_kv_block	2025-07-18 16:13:32 +08:00
周周周	ddb10ac509	[Inference, rename] remove padding_offsets from atten use batch_id_per_token (#2880 ) * remove padding_offsets from atten	2025-07-17 18:41:31 +08:00
freeliuzc	d49f8fb30a	[Feature][MTP] Support cacheKV transfer in per_chunk mode (#2890 ) * support chunk_prefill both normal and speculative_decoding(mtp) * optimize pd-disaggregation config * fix bug	2025-07-17 17:58:08 +08:00
Yuanle Liu	dbb9e2506b	Fix rollout_model init (#2881 )	2025-07-16 22:36:21 -07:00
ming1753	1f15ca21e4	[Feature] support prompt repetition_penalty (#2806 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details	2025-07-17 12:05:52 +08:00
Yuanle Liu	63d6e7ce06	fix and refine vl (#2866 ) * refine vl config * delete attn_sep * fix vl accuracy	2025-07-16 05:59:28 -07:00
周周周	aa76085d1f	[Attention] remove cum_offsets from atten, and use cu_seqlens_q (#2870 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details [Attention] remove cum_offsets from atten, and use cu_seqlens_q (#2870)	2025-07-16 20:10:57 +08:00
Yuanle Liu	dda4a9f848	rl update (#2861 )	2025-07-16 00:33:10 -07:00
xiaoxiaohehe001	0d0340392f	[Fix] Fix mm ep weight init. (#2855 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * fix_45t_mm * Update load_weight_utils.py * Update load_weight_utils.py	2025-07-16 12:02:39 +08:00
freeliuzc	2d1184aefe	[Fix] fix expert_parallel bug in decoder stage (#2848 )	2025-07-16 11:08:18 +08:00
RAM	0fad10b35a	[Executor] CUDA Graph support padding batch (#2844 ) * cuda graph support padding batch * Integrate the startup parameters for the graph optimization backend and provide support for user - defined capture sizes. * Do not insert max_num_seqs when the user specifies a capture list * Support set graph optimization config from YAML file * update cuda graph ci * fix ci bug * fix ci bug	2025-07-15 19:49:01 -07:00
Yuanle Liu	61b3997b85	refactor rl get_name_mappings_to_training (#2847 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * refactor rl get_name_mappings_to_training * fix tp>1 * change variable name(ffn1->up_gate_proj/ffn2->down_proj) * change variable name(linear_weight->weight/linear_bias->bias) * add rl names mapping for vl * fix ernie 0.3B error * fix develop code * fix	2025-07-15 07:31:42 -07:00
Zero Rains	e7bcbbab52	Merge vl execution path into normal execution path (#2829 ) * merge vl model into gpu_model runner Change-Id: I9f4691a3d5f135e8d72b1d58abcd15ef3aa3f2a6 * fix chinese Change-Id: Ic7405109b984c21e076fb3b01ff6feb571d0119a * fix the parse parameter Change-Id: I4cd62ee87c06220af580d91e347145d4394917fe * fix the bug in online_inference Change-Id: Idb111bb2114e83017c4050b2a68cf039c6d3c559 * polish code Change-Id: I7d4194102c2f1b0743b74fbd5fc284eb8ef4d17c	2025-07-15 22:20:03 +08:00
AIbin	fd91da7b41	【Inference Optimize】Support wint2 triton kernel about triton_utils_v2 (#2842 ) * update supported_models doc	2025-07-15 14:35:40 +08:00
freeliuzc	7cdd8d290d	[MTP] optimize mtp infer speed (#2840 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details	2025-07-14 19:50:22 +08:00
YuanRisheng	4c7b8bc458	Simplify the Config code (#2770 ) * simplify the code * fix vl * delete config * fix * perfect code * fix ci * fix xpu * fix xpu * fix server * resolve conflict * fix mtp * resolve conflict * fix xpu * fix xpu * fix vl * fix log * fix qwen moe * fix qwen moe * fix qwen moe	2025-07-14 19:50:05 +08:00
freeliuzc	7f64d408a9	[MTP] support expert-parellel in mtp (#2835 )	2025-07-14 14:28:50 +08:00
bukejiyu	bad53c6b6e	[vl]remove duplicated load logic (#2744 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details	2025-07-13 07:36:26 +08:00
gaoziyuan	749b2e9c89	support qwen3moe name_mapping (#2820 )	2025-07-12 12:05:54 +08:00
zhink	c08561c13a	[Feature] support tensor-parallel-size>num_key_value_heads for qwen3 (#2799 )	2025-07-11 15:09:43 +08:00
Sunny-bot1	240d6236bc	[Fix]fix top_k_top_p sampling (#2801 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * fix topk-topp * update * add base_non_truncated	2025-07-10 22:35:10 +08:00
littledgg	59071268b6	[Executor] Move forward_meta.py to fastdeploy/model_executor (#2774 ) * Use PEP 563 in attention.py and fix conflict * merge commit * Change what was left out last time	2025-07-10 20:36:51 +08:00
lizexu123	8c660a0dfb	[BugFix] fix RMSNorm rms_norm_esp (#2797 ) * fix rms * add vl * fix * add vl * fix * fix	2025-07-10 20:02:24 +08:00

1 2

79 Commits