FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2025-10-10 02:50:19 +08:00

Author	SHA1	Message	Date
Liumengyuan	e93d4cfcdd	Add with_output version AppendAttention (#3302 ) * get use_output from fd_config * add clear TODO description * add mask_offset para to align with develop * fix bug * fix use_output logic * fix sot bug	2025-08-28 17:10:18 +08:00
yangjianfengo1	e81046fdad	【New Feature】集中式支持w4afp8 (#3644 ) * 支持tp w4afp8 * code style	2025-08-28 10:53:24 +08:00
周周周	76513f6416	Support 45t fp8 8 GPU (#3659 )	2025-08-28 10:52:53 +08:00
Sunny-bot1	479c8b85d3	[Optimize]support machete weight only gemm (#3561 ) * support machete weight only gemm * add generate * update * fix * change file location * add sm_version limit * fix * fix * fix ci * fix coverage * fix xpu	2025-08-28 09:49:58 +08:00
Liumengyuan	11803e0907	fix undefined cuPointerGetAttribute symbol error (#3628 )	2025-08-27 20:24:59 +08:00
Jiang-Jia-Jun	c694fa2879	Revert "[Feature] block sparse attention (#3209 )" (#3647 ) This reverts commit `646a0c2fd8`.	2025-08-27 17:35:04 +08:00
xiaoxiaohehe001	ad319a87cc	support fa3 rope3d (#3622 )	2025-08-27 11:31:29 +08:00
yangjianfengo1	646a0c2fd8	[Feature] block sparse attention (#3209 ) Some checks failed CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details * 支持稀疏attn * fix bug * code style * fix moba attn get kv shape * 修复a100编译 * codestyle * code style * code style * code style * fix conflict * 增加单侧 * code style * 增加eblite 加载时间 * fix bug * for ci * for ci * for ci * for ci * 支持mlp block size 128 * 增加小算子单测 * fix 单测 mlp * 将环境变量加入到config里面 * fix rollout config	2025-08-26 07:16:04 -07:00
gaoziyuan	82e64b13e1	[NewFeature]Support dp multi api server && Fix some bug in mixed ep && merge develop (#3598 ) * [Feature] update ep * fix ci * fix ci * fix ci * fix ci * fix ci * fix ci * fix ci * fix queue ports idx * fix ci * fix ci * fix ci * fix ci * fix ci * fix ci * fix ci * fix ci * Update engine.py * fix ci * fix some bug in mixed ep * add server fix and op fix * rm some log * fix code style * ltd fix * fix * fix * fix some bug * fix bug * fix bug * fix style * Update config.py * Update splitwise_connector.py * Update cache_messager.py * Update __init__.py * merge and fix * Update engine.py * Update common_engine.py * Update run_ci_xpu.sh * Update ernie_processor.py * Update ernie_processor.py --------- Co-authored-by: ltd0924 <ltd0924@sina.com> Co-authored-by: ltd0924 <32387785+ltd0924@users.noreply.github.com>	2025-08-26 19:59:02 +08:00
lzy	d339df2e90	Supports DP+TP+EP hybrid parallel deployment strategy (#3489 ) * Support DP+TP+EP hybrid parallel deployment strategy * Support DP+TP+EP hybrid parallel deployment strategy * fix conflict * add moe_tp_ep function split_allgather_out * del tp_group in moe_cutlass_backend * for ci * fix parallel_config for ci * del log	2025-08-26 00:04:01 -07:00
freeliuzc	52eda7fdb3	[Feature][MTP]support new speculative decoding method named hybrid mtp with ngram (#3610 )	2025-08-26 14:29:22 +08:00
xiaoxiaohehe001	70c75798a7	[NewFeatures] support noex rope3d (#3542 ) Some checks failed CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details * [NewFeatures] support noex rope3d * [NewFeatures] support noex rope3d encoder	2025-08-26 11:44:57 +08:00
Kane2011	2ae7ab28d2	[MetaxGPU] adapt to the latest fastdeploy on metax gpu (#3492 )	2025-08-25 17:44:20 +08:00
Yuan Xiaolan	9205c88da1	support w4afp8 EP inference (#3044 ) Some checks failed CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details	2025-08-25 11:27:45 +08:00
lengxia	137e539456	[Feature][XPU] add custom kernels for mtp (#3537 )	2025-08-25 10:14:17 +08:00
freeliuzc	76759108c9	[Feature][SpeculativeDecoding]Support tree-attention (#3514 ) * support tree-attention * fix merge bug * fix unit-test api * fix merge bug	2025-08-22 13:36:41 +08:00
yangjianfengo1	e5aa7087db	【bug fix】修复w4a8编译慢 (#3510 ) * 修复w4a8编译 * code style * 修复tma copy	2025-08-21 18:50:14 +08:00
Ryan	bcdfc1d6b9	Add custom op declaration for `all_reduce` (#3473 ) * add custom op declaration * roll back try except	2025-08-20 20:29:58 +08:00
yangjianfengo1	b047681c5d	【New Feature】支持Fp8 group Gemm 24稀疏 (#3463 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details Publish Job / publish_pre_check (push) Has been cancelled Details Publish Job / print_publish_pre_check_outputs (push) Has been cancelled Details Publish Job / FD-Clone-Linux (push) Has been cancelled Details Publish Job / Show Code Archive Output (push) Has been cancelled Details Publish Job / BUILD_SM8090 (push) Has been cancelled Details Publish Job / BUILD_SM8689 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled Details Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled Details Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details Publish Job / Run Base Tests (push) Has been cancelled Details Publish Job / Run Accuracy Tests (push) Has been cancelled Details * 支持24稀疏 * code style * 增加stmatrix 宏定义判断 * code style	2025-08-19 02:54:47 -07:00
lizexu123	32b39620bc	[Code Simplification] remove cum_offsets (#3410 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details Publish Job / publish_pre_check (push) Has been cancelled Details Publish Job / print_publish_pre_check_outputs (push) Has been cancelled Details Publish Job / FD-Clone-Linux (push) Has been cancelled Details Publish Job / Show Code Archive Output (push) Has been cancelled Details Publish Job / BUILD_SM8090 (push) Has been cancelled Details Publish Job / BUILD_SM8689 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled Details Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled Details Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details Publish Job / Run Base Tests (push) Has been cancelled Details Publish Job / Run Accuracy Tests (push) Has been cancelled Details	2025-08-18 20:21:25 +08:00
Jundong Liu	ea4a3b479c	[Excutor] Increase buffer size to prevent address corruption; add forward metadata debug tool (#3404 ) * 修复buffer申请不够大，增加打印forwardmetadata的工具 * fix mistake * Make CPU tensor in CPUPlace * Add test about forward_meta_str and Add unitest_requirement --------- Co-authored-by: RAM <gstian5555@outlook.com>	2025-08-18 16:14:09 +08:00
freeliuzc	a12d0bc549	[Feature][MTP]update multi-draft-token strategy (#3369 ) * update multi-draft-token strategy * fix format --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2025-08-18 13:59:56 +08:00
Sunny-bot1	6c1f3ff897	topk_gating_softmax support bias (#3405 )	2025-08-15 11:57:45 +08:00
chen	f0f00a6025	[OPs] Universal optimization and Fix early_stop cuda 700 (#3375 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * delete nonzero * delete setup_ops_base.py * check if * check gcp infer_seed.cpu() * fix repetition_early_stopper_kernel cuda 700	2025-08-14 22:40:44 +08:00
lzy	1e06b9fa6d	make append_attn supports mask_offset (#3138 ) * make append_attn supports mask_offset * add unittest	2025-08-14 03:40:55 -07:00
Sunny-bot1	2e7831185f	[Optimize]Add norm_weights feature for topk_gating_softmax (#3372 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details	2025-08-14 15:05:23 +08:00
Kane2011	b4fef2cf29	[MetaxGPU] Support FastDeploy on metax gpu (#3241 ) * [MetaxGPU] Support FastDeploy on metax gpu * Update metax_worker.py 1. change worker log; 2. remove custom allreduce, adapt it later; 3. remove cuda graph; * Update __init__.py 1. remove metax's key work comment * Update __init__.py 1. remove metax's key word comment; 2. add fused_moe_kernel_paddle import --------- Co-authored-by: yongqiangma <xing.wo@163.com>	2025-08-13 11:11:54 +08:00
Ryan	ed6bff215a	fix custom op order rms_norm_eps (#3348 )	2025-08-13 10:12:49 +08:00
Sunny-bot1	8224b21525	Refactor moe_topk_select op to use apply_norm_weight as a template parameter (#3345 ) * Refactor moe_topk_select op to use apply_norm_weight as a template parameter * update test	2025-08-13 08:44:16 +08:00
Jundong Liu	1e4968e810	[Excutor] Fixed the issue of CUDA graph execution failure caused by different branches during decoding (#3223 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * 彻底解决解码切块问题 * update C8 and C4 kernel * fix problem * fix with pre-commit * retain branch for mtp	2025-08-09 07:37:19 +08:00
gaoziyuan	a799d14df1	[Bugfix] Fix model accuracy in some ops (#3231 ) * fix noaux_tc op * fix * update * fix qk norm * fix linear for prequant loader * test * fix * fix * rm some print * fix noaux_tc op * test * Fix the confused enable_early_stop when only set early_stop_config (#3214) * fix the confused early_stop_config when only set early_stop_config * pre-commit * write a general method * Add ci case for min token and max token (#3229) Co-authored-by: xujing43 <xujing43@baidu.com> * add some evil cases (#3240) * add repitation early stop cases * add repitation early stop cases * add bad cases * add bad cases * add evil cases * qwen3_moe (#3084) * [Feature] support seed parameter (#3161) * support seed * fix * add SamplingMetadata seed test * The next_tokens values are inconsistent! * add air and rejection seed test * fix * add SamplingParams seed test * fix seed=0 * Default to defualt * fix * fix args_utils * fix review * fix review * fix * fix * add xpu,gcu,iluvatar support seed * fix * 【Fix Bug】修复 fa3 支持集中式bug (#3235) * fix fa3 集中式bug * 增加qknorm参数 * fix qk norm * fix * update * fix linear for prequant loader * fix * fix * rm some print * fix * fix moe init weight&scale * fix moe init weight&scale --------- Co-authored-by: bukejiyu <395822456@qq.com> Co-authored-by: yuanxiaolan <yuanxiaolan01@baidu.com> Co-authored-by: Zero Rains <linjunlu@zerorains.top> Co-authored-by: xjkmfa <108254620+xjkmfa@users.noreply.github.com> Co-authored-by: xujing43 <xujing43@baidu.com> Co-authored-by: Divano <dddivano@outlook.com> Co-authored-by: bukejiyu <52310069+bukejiyu@users.noreply.github.com> Co-authored-by: lizexu123 <39205361+lizexu123@users.noreply.github.com> Co-authored-by: yangjianfengo1 <125249383+yangjianfengo1@users.noreply.github.com> Co-authored-by: qingqing01 <dangqingqing@baidu.com>	2025-08-08 17:30:37 +08:00
yzwu	fbdd6b0663	[Iluvatar GPU] Optimze attention and moe performance (#3234 )	2025-08-08 10:51:24 +08:00
yangjianfengo1	89397516a8	[New Feature] Support W4Afp8 MoE GroupGemm (#3171 ) * init * 增加多线程编译 * fix bug * fix bug * code style * 增加fp16 * 将print替换成assert * 修复stmatrix * 减小单测shape * 减小单测shape	2025-08-06 10:34:05 +08:00
Yuan Xiaolan	7ce00e597c	support qk norm (#3145 )	2025-08-05 16:46:14 +08:00
Yuan Xiaolan	af543b7f0f	revise get_moe_scores (#3164 )	2025-08-05 16:43:07 +08:00
yangjianfengo1	40f7f3e0d8	[New Feature] fa3 支持flash mask (#3184 ) * 支持flash mask * 修改test_flash_mask * 修改test.sh	2025-08-05 12:20:48 +08:00
RichardWooSJTU	e39159f3bd	Add switch to apply fine-grained per token quant fp8 (#3192 ) Co-authored-by: yuanxiaolan <yuanxiaolan01@baidu.com>	2025-08-04 19:54:03 -07:00
AIbin	22fe695f1c	【Inference Optimize】Support automatic generation of marlin kernel (#3149 ) * Support automatic generation of marlin kernel	2025-08-01 22:43:18 +08:00
yangjianfengo1	64d7a3194d	集中式支持fa3 (#3112 )	2025-08-01 18:03:36 +08:00
Ryan	94264bbf60	[Code Simplification] Refactor Post-processing in VL Model Forward Method (#2937 ) * rm sth useless * refactor model forward * mv bool index to kernel	2025-08-01 17:28:07 +08:00
chen	a2f5cc54f8	moe preprocess op support 160 experts and fused_moe triton kernel name add K (#3121 )	2025-08-01 10:46:20 +08:00
RAM	d850660872	[Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel (#2989 ) * reset decoder_block_shape_q buffer * refactor GetBlockShapeAndSplitKVBlock Kernel and cudagraph padding batch * update decode_max_tile_size * fix pre-commit * update block_multihead_attn_backend * update flas attn backend * update MLA Attention * update XPU Attention * update gcu,iluvatar model runner * Update MTP * fix MTP bug	2025-07-31 00:09:31 +08:00
ming1753	5acde4eb43	[Feature] Multimodal Scheduler V1 (#3019 ) * [Feature] Support multimodal scheduler v1 * remove debug log * fix bug * fix format * modify code * fix bug * fix bug * fix bug * modify code	2025-07-30 16:05:55 +08:00
Sunny-bot1	74aa31d15b	[Feature] support bad_words (#3055 ) * support bad_words * support online infer bad_words * update * add CI test * update * update * update --------- Co-authored-by: Yuanle Liu <yuanlehome@163.com>	2025-07-30 09:31:29 +08:00
JYChen	dafe02a7b9	[stop sequence] support stop sequence (#3025 ) * stop seqs in multi-ends * unittest for gpu stop op * kernel tid==0	2025-07-29 14:17:37 +08:00
yinwei	f2a528f9ae	[XPU] Support kvblock centralized management (#3017 )	2025-07-29 10:40:55 +08:00
Yuan Xiaolan	7d87aaace8	optimize w4a8 decoding (#3050 )	2025-07-28 22:20:13 +08:00
lizhenyun01	e80ea8a71b	remove Synchronize in hadamard	2025-07-28 19:22:46 +08:00
lizhenyun01	238766e403	fix c4 prompt_cache	2025-07-28 14:31:37 +08:00
Yiqun Liu	8f426c1690	Optimize the performance of moe_expert_ffn_wint2 (#2990 ) * Change wint2 to ColumnMajor. Change-Id: I6b44d02946a685f8fe24d9f2c7be258b51e16da2 * Unify default_wint2x_mma. Change-Id: I9e77b0e8e6cecab01fedc0b24b536ee0a1a89ff7 * Change wint2 to ColumnMajorTileInterleave. Change-Id: I593cbe36f991c0c5044989d65f0014087587c624 * Enable async copy for B. Change-Id: Ia3ac37ad162a8cf3ccce4f268e81bd06c8ac3c46 * Add wint2x Dequantizer * Remove TileDequanterB related codes. Change-Id: Id8e65703b72a8984d367f584ff41b7726017fbb8 * Implement FastInterleavedAndBiasedNumericArrayConverter for wint2. Change-Id: I438f2b18ab964a04ae1cdb09d9e7d9f7b95eafca * Implement Wint2ParamsAccessor to load extra quant params from global memory. Change-Id: Ic3750cd9b767df8893501820880c3342a4b47233 * Implement FastInterleavedAndBiasedNumericArrayConverter for wint2. Change-Id: I438f2b18ab964a04ae1cdb09d9e7d9f7b95eafca * Use async copy for local_scale. Change-Id: Ib882ba41c3d2354bda4d25b40e2408ad3b2f7893 * Check and correct the load and dequantize of weights. Change-Id: Ie8dca505b39987144964fe6407d465b3b5953790 * Change for performance tuning. Change-Id: I1da026fb1d1533a9d70350c7ba23c27e896cfc29 * Optimize the global memory access size of local_scale reading. Change-Id: I4cbe3a2ef5951723d415c2d3252ce912394beaf5 * Specialize mma_tensor_op for wint2 to enable fine-grained pipeline. Change-Id: Icbb4d48f90a41136f42d6ffff42d68de32f408da * Minor fix. Change-Id: I14d4ac9d267ee05442a3b47f00c26bee13d79e6f * optimizing dequant performance with LOP3 * optimizing dequant performance with LOP3 * Avoid redundant dequantization of local_scale and use bf16 as computing type. Change-Id: I63239ebc8f8e4a92d6281af59840ba50600b4334 * Add Multiplier and remove some logs. Change-Id: Ifa199d81e6aeb472d2247c63f85ef30213684bcd * optimizing dequant performance with LOP3 * Use __byte_perm to implement int8 to float32 conversion for performance improvement. * Use lop3 to optimize the dequantize of local_scale. Change-Id: I6189759970cb5b8dcbef769724784b8a7533b63c * Minor fix and remove some logs. Change-Id: I6279ba9926d5041093b1c6aea200acf2e4c49d46 * Fix stages for test. Change-Id: I6f7b7cac612ef2c678e9d49f5ffa60eb53d3ae29 * Fix stages for test and add clock64 to profile. Change-Id: Iffaf7324beaa910ce9ee56f47ae289de98f1a267 * Use __byte_perm to replace shift-and-or operations for faster integer merging. * Split the uint2b convert. Change-Id: I78da672ce8968e21f685285140ba546a161521b4 * Optimize convert of unscale. Change-Id: I6795da1cdf5e8ab38ddaa9836240921b5312913a * Minor optimization. Change-Id: I1800aec34c3f4621abb02658208108f54da44d88 * Optimize mma pipeline and refine codes. Change-Id: Id3075cf7b88f2813a11ccd1d3b49c62c978f36b8 * Add missing support. Change-Id: Id65b7bc2c25fbb1a5b232c6bc9fb8c9093f691a8 * Accelerate FP16 dequantization performance * Support tile shape as Xx64x64. Change-Id: Ib8fd37e1ba1d06f7d11f2956e7f1367b0a92bcac * Remove debugging codes and minor optimization. Change-Id: I6b79bd56a6e8dd823efc169967ecd3cc9a43baf4 * Fix offset bug. Change-Id: Id7aeb91e99d6f51836f2aff22187b4f79607395e * Fix typo. Change-Id: I19dde93fc1c1f7e19605905c90dc46298e203952 * Restore some codes and remove some debugging logs. Change-Id: I8d44daf82ad1c6f8174134d195e7b3fe9a3afdfb --------- Co-authored-by: baoqiwen <baoqiwen@baidu.com>	2025-07-28 10:32:43 +08:00

1 2

87 Commits