FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2025-10-04 16:22:57 +08:00

Author	SHA1	Message	Date
chenjian	85a78d695d	[Feature] Support block scheduler v1 for FD (#2928 ) * Support FD block scheduler v1 * Support FD block scheduler v1 * Support FD block scheduler v1 * Fix according to copilot review * Fix according to review * Remove is_dummy * Fix bug when real_bsz=1 * Fix infer first token cost time --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2025-07-23 20:31:31 +08:00
lizhenyun01	e51f018577	support chunk_prefill in fa3	2025-07-23 12:19:20 +08:00
Sunny-bot1	7c5e34e72d	[FIX]fix rejection sampling when topp=0 using _SAMPLING_EPS (#2967 ) * fix rejection sampling when topp=0 * fix	2025-07-22 05:53:37 -07:00
GoldPancake	9b84d51e25	[MTP Fix] Fix code and register cpp operators (#2965 )	2025-07-22 19:36:24 +08:00
K11OntheBoat	e991777757	[Feature] DeepseekV3 use pd_build_static_op (#2948 ) Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>	2025-07-22 15:03:41 +08:00
K11OntheBoat	8020927f50	[BugFix] Rename attention params of deepseekv3 (#2939 ) Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>	2025-07-22 14:01:30 +08:00
lizexu123	67990e0572	[Feature] support min_p_sampling (#2872 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * Fastdeploy support min_p * add test_min_p * fix * min_p_sampling * update * delete vl_gpu_model_runner.py * fix * Align usage of min_p with vLLM * fix * modified unit test * fix test_min_sampling * pre-commit all files * fix * fix * fix * fix xpu_model_runner.py	2025-07-20 23:17:59 -07:00
Zero Rains	25698d56d1	polish code with new pre-commit rule (#2923 )	2025-07-19 23:19:27 +08:00
周周周	d306944f4f	remove cum_offsets from get_block_shape_and_split_kv_block (#2913 ) * remove padding_offsets from get_padding_offset.cu * remove padding_offsets from get_padding_offset.cu * remove padding_offsets from get_padding_offset.cu * remove cum_offsets from get_block_shape_and_split_kv_block * remove cum_offsets from get_block_shape_and_split_kv_block	2025-07-18 16:13:32 +08:00
周周周	1339e56282	[XPU] Remove padding_offsets from get_padding_offset.cu (#2911 )	2025-07-18 14:16:44 +08:00
周周周	ddb10ac509	[Inference, rename] remove padding_offsets from atten use batch_id_per_token (#2880 ) * remove padding_offsets from atten	2025-07-17 18:41:31 +08:00
freeliuzc	d49f8fb30a	[Feature][MTP] Support cacheKV transfer in per_chunk mode (#2890 ) * support chunk_prefill both normal and speculative_decoding(mtp) * optimize pd-disaggregation config * fix bug	2025-07-17 17:58:08 +08:00
ming1753	1f15ca21e4	[Feature] support prompt repetition_penalty (#2806 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details	2025-07-17 12:05:52 +08:00
周周周	aa76085d1f	[Attention] remove cum_offsets from atten, and use cu_seqlens_q (#2870 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details [Attention] remove cum_offsets from atten, and use cu_seqlens_q (#2870)	2025-07-16 20:10:57 +08:00
Yuanle Liu	61b3997b85	refactor rl get_name_mappings_to_training (#2847 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * refactor rl get_name_mappings_to_training * fix tp>1 * change variable name(ffn1->up_gate_proj/ffn2->down_proj) * change variable name(linear_weight->weight/linear_bias->bias) * add rl names mapping for vl * fix ernie 0.3B error * fix develop code * fix	2025-07-15 07:31:42 -07:00
freeliuzc	7cdd8d290d	[MTP] optimize mtp infer speed (#2840 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details	2025-07-14 19:50:22 +08:00
chen	d33105baeb	[Feature] Online Chat API Support Return logprobs (#2777 ) * online chat support logprobs * check xpu * check vl_gpu_model_runner and xpu_model_runner * get_worker() check platform	2025-07-10 16:33:40 +08:00
Sunny-bot1	e45050cae3	[Feature] support top_k_top_p sampling (#2753 ) * support top_k_top_p sampling * fix * add api param * add api para * fix * fix * fix * fix * fix * fix * fix	2025-07-09 20:58:58 -07:00
lifulll	1f28bdf994	dcu adapter ernie45t (#2756 ) Co-authored-by: lifu <lifu@sugon.com> Co-authored-by: yongqiangma <xing.wo@163.com>	2025-07-09 18:56:27 +08:00
zhink	b89180f1cd	[Feature] support custom all-reduce (#2758 ) * [Feature] support custom all-reduce * add vllm adapted	2025-07-09 16:00:27 +08:00
celsowm	771e71a24d	Feat/blackwell sm100 support (#2670 ) * Add initial support for NVIDIA Blackwell (SM100) architecture This change introduces initial support for the NVIDIA Blackwell GPU architecture, specifically targeting SM100 (Compute Capability 10.x) with '100a' architecture-specific features (e.g., for CUTLASS). Key changes: - Updated custom_ops/setup_ops.py to generate appropriate gencode flags (arch=compute_100a,code=sm_100a) when '100' is specified in FD_BUILDING_ARCS. Requires CUDA 12.9+. - Updated custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h: - Added CutlassTileConfigSM100 enum (with placeholder tile shapes). - Added BLACKWELL to CandidateConfigTypeParam. - Updated CutlassGemmConfig struct with is_sm100 flag, tile_config_sm100, and new constructor for SM100. - Modified toString() and fromString() for SM100 support. - Updated custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu: - Added get_candidate_tiles_sm100() (with placeholder tiles). - Added placeholder mcast support functions for SM100. - Updated get_candidate_configs() to include SM100 paths using the BLACKWELL flag and new SM100 config types. - Updated build.sh with comments to guide users on specifying '100' for Blackwell in FD_BUILDING_ARCS. Further work: - Optimal CUTLASS tile configurations for SM100 need to be researched and updated in cutlass_heuristic.cu. - Kernel auto-generation scripts in custom_ops/utils/ may need SM100-specific versions if Blackwell's hardware features for FP8/TMA differ significantly from SM90. - Compatibility of third-party libraries (CUTLASS v3.8.0, DeepGEMM) with Blackwell should be fully verified. * Feat: Implement detailed Blackwell (SM100) CUTLASS heuristics This change integrates specific, expert-provided CUTLASS heuristic configurations for the NVIDIA Blackwell (SM100) GPU architecture, replacing previous placeholders. This includes: - Updated `custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h`: - Populated `CutlassTileConfigSM100` enum with specific tile shapes (e.g., CtaShape64x64x128B, CtaShape128x128x128B) suitable for SM100. - Added `FP4_ONLY` to `CandidateConfigTypeParam` for new FP4 paths. - Updated `custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu`: - Implemented `get_candidate_tiles_sm100` with detailed logic for selecting tile configurations based on GROUPED_GEMM and FP4_ONLY flags, using the new SM100 tile enums. - Implemented `supports_mcast_along_m_sm100` and `supports_mcast_along_n_sm100` with specific tile checks for Blackwell. - Updated the `sm == 100` (Blackwell) block in `get_candidate_configs` to use these new helper functions and accurately populate candidate kernel configurations for various cluster shapes. - `custom_ops/setup_ops.py` remains configured to compile for `arch=compute_100a,code=sm_100a` with CUDA 12.9+ for these features. This aligns the codebase with heuristic configurations similar to those in upstream TensorRT-LLM / CUTLASS for Blackwell, enabling more performant kernel selection on this new architecture. --------- Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2025-07-09 15:29:42 +08:00
RichardWooSJTU	fee544e808	fix ep prefill (#2762 )	2025-07-09 14:03:05 +08:00
GoldPancake	f7cad30a38	[Feature] Add speculative decoding simulation benchmark. (#2751 ) * Add speculative decoding simulation benchmark * Fix the name of the parameter	2025-07-09 12:08:43 +08:00
ming1753	1eb8ea7328	[Bug fix] fix complie bug when sm < 89 (#2738 )	2025-07-08 11:24:52 +08:00
ming1753	ef6649a577	[Optimize] Optimize tensorwise fp8 performance (#2729 ) Some checks failed Deploy GitHub Pages / deploy (push) Has been cancelled Details * [Optimize] Optimize tensorwise fp8 performance	2025-07-07 20:06:28 +08:00
liddk1121	1b54a2831e	Adapt for iluvatar gpu (#2684 )	2025-07-07 16:53:14 +08:00
Jiang-Jia-Jun	05c670e593	[Sync] Update to latest code (#2679 ) * [Sync] Update to latest code * Add new code files * Add new code files * update code * Try to fix build.sh * Try to fix build.sh * Update code * Update requirements.txt * Update code --------- Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com>	2025-07-03 15:43:53 +08:00
MARD1NO	ac5f860536	use shfl_xor_sync to reduce redundant shfl broadcast	2025-06-30 13:12:21 +08:00
Jiang-Jia-Jun	92c2cfa2e7	Sync v2.0 version of code to github repo	2025-06-29 23:29:37 +00:00
jiangjiajun	684703fd72	[LLM] First commit the llm deployment code	2025-06-09 19:20:15 +08:00

30 Commits