FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2025-10-05 16:48:03 +08:00

Files

AIbin a7392a0ff9 【Inference Optimize】DeepSeek-V3-model MLA Optimize (#3886 )

* support MLA chunk_size auto search & cuda_graph

2025-09-11 10:46:09 +08:00

template_instantiation

supports dynamic Cfp8 (#3767 )

2025-09-07 20:41:29 -07:00

append_attention_c4_impl.cuh

[Excutor] Experiment Feature-Support Prefill in cudagraph (#3459 )

2025-09-08 13:12:24 +08:00

append_attention_c8_impl.cuh

[Excutor] Experiment Feature-Support Prefill in cudagraph (#3459 )

2025-09-08 13:12:24 +08:00

append_attention_c16_impl.cuh

[Excutor] Experiment Feature-Support Prefill in cudagraph (#3459 )

2025-09-08 13:12:24 +08:00

append_attention_func.cuh

supports dynamic Cfp8 (#3767 )

2025-09-07 20:41:29 -07:00

append_attention_kernel.h

supports dynamic Cfp8 (#3767 )

2025-09-07 20:41:29 -07:00

decode_attention_func.cuh

[Sync] Update to latest code (#2679 )

2025-07-03 15:43:53 +08:00

decode_attention_kernel.cu

[Inference, rename] remove padding_offsets from atten use batch_id_per_token (#2880 )

2025-07-17 18:41:31 +08:00

decoder_write_cache_with_rope_impl.cuh

[Feature] Support zai-org/GLM-4.5-Air BF16 model (#3928 )

2025-09-10 19:36:10 +08:00

decoder_write_cache_with_rope_kernel.cu

[Feature] Support zai-org/GLM-4.5-Air BF16 model (#3928 )

2025-09-10 19:36:10 +08:00

decoder_write_cache_with_rope_kernel.h

clean code in sttantion (#3917 )

2025-09-05 20:49:01 +08:00

encoder_write_cache_with_rope_impl.cuh

[Feature] Support zai-org/GLM-4.5-Air BF16 model (#3928 )

2025-09-10 19:36:10 +08:00

encoder_write_cache_with_rope_kernel.h

[Feature] Support zai-org/GLM-4.5-Air BF16 model (#3928 )

2025-09-10 19:36:10 +08:00

get_block_shape_and_split_kv_block.cu

【Inference Optimize】DeepSeek-V3-model MLA Optimize (#3886 )

2025-09-11 10:46:09 +08:00

gqa_rope_write_cache.cu

fix typos (#3951 )

2025-09-08 15:22:41 +08:00

mem_util.cuh

[LLM] First commit the llm deployment code

2025-06-09 19:20:15 +08:00

mla_cache_kernel.cu

[Feature] DeepseekV3 use pd_build_static_op (#2948 )

2025-07-22 15:03:41 +08:00

mla_cache_kernel.cuh

[Inference, rename] remove padding_offsets from atten use batch_id_per_token (#2880 )

2025-07-17 18:41:31 +08:00

mma_tensor_op.cuh

[LLM] First commit the llm deployment code

2025-06-09 19:20:15 +08:00

multi_head_latent_attention_kernel.h

[Inference, rename] remove padding_offsets from atten use batch_id_per_token (#2880 )

2025-07-17 18:41:31 +08:00

pre_cache_len_concat.cu

[LLM] First commit the llm deployment code

2025-06-09 19:20:15 +08:00

speculate_write_cache_with_rope_impl.cuh

support rope_3d in spec mode (#4034 )

2025-09-10 03:15:05 -07:00

speculate_write_cache_with_rope_kernel.cu

support rope_3d in spec mode (#4034 )

2025-09-10 03:15:05 -07:00

speculate_write_cache_with_rope_kernel.h

support mtp rope_3d (#3791 )

2025-09-04 17:18:05 +08:00

utils.cuh

supports dynamic Cfp8 (#3767 )

2025-09-07 20:41:29 -07:00