chen
b491dcd23c
[Optimization] compulte real max_logprobs in batch ( #5430 ) ( #5448 )
2025-12-09 16:48:06 +08:00
cmcamdy
9f4977eb74
[xpu] support mtp for xpu(mix) ( #5274 )
...
* [XPU] support kernel for mtp(base)
* [XPU] support kernel for mtp(base)
* format
* format
* format
* fix gather next token
* fix step && add test
* fix
* mv pre/post process
* add adjust batch / gather next token for mtp
* fix code style
* fix mtp kenrel name
* fix mtp kernel test
* mv xpu pre/post process
* mv xpu pre/post process
* [xpu] support mtp
* fix code style
2025-12-01 11:03:14 +08:00
Daci
eab8384da6
[Feature] ThreadPoolExecutor async fill_token_bitmask ( #5083 )
...
* ThreadPoolExecutor async fill_token_bitmask
* ThreadPoolExecutor async fill_token_bitmask logging
* fix test_guided_decoding
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* add fill_bitmask_parallel_batch_size ENV
* FD_FILL_BITMASK_BATCH fastdeploy.envs
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-11-19 10:04:16 +08:00
Daci
5fc12eddfe
[Optimization] xgrammar async compile, multi thread, speed up ( #4835 )
...
* xgrammar async compile, multi thread, speed up
* fix test_sampler.py & pre-commit err
* add redis version check && fix request.llm_engine_recv_req_timestamp
* xgrammar prefill & decode & v0
* fix test_gpu_prompt_logprobs.py
* add test_guided_decoding.py
* Update fastdeploy/scheduler/splitwise_scheduler.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fastdeploy/model_executor/guided_decoding/xgrammar_backend.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fastdeploy/model_executor/guided_decoding/xgrammar_backend.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* fix torch xgrammar unittest env
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-11-14 18:05:26 +08:00
SunLei
3098aee05f
[Perf] Support tensor transmission between work and engine with zero-copy to improve efficiency ( #4839 )
...
* feat(zmq): support tensor transmission with zero-copy for improved efficiency
* perf: zmq.send disable copy
* zmq recv data for debug
* convert logprobs tensor to cpu
2025-11-11 15:43:11 +08:00
chen
1c3ca48128
[Feature][Executor] GPU Model Runner Supports prompt_logprobs and max_logprobs ( #4769 )
2025-11-05 10:43:25 +08:00
GoldPancake
1f3ce65b58
[Feature] support mtp distribution equivalence verification ( #4699 )
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-10-31 11:45:04 +08:00
GoldPancake
fddda50cb9
Add ut for speculative sampler ( #4650 )
2025-10-30 10:37:49 +08:00
李泳桦
a012e3608b
[Feature] support logits processors ( #4515 )
...
* [feat] provide an interface for logits processors and a builtin LogitBiasLogitsProcessor
* [chore] fix code style
* [fix] add unit test & fix existing bugs
* [feat] add engine/worker arg --logits-processors
* [fix] redefine user args as logits_processors_args and fix some bugs
* [fix] fix test_sampler
* Update fastdeploy/model_executor/logits_processor/builtin.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fastdeploy/model_executor/logits_processor/__init__.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/model_executor/test_logits_processor.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* [fix] fix typo
* Update fastdeploy/engine/sampling_params.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* [fix] fix bracelet
* [chore] redefine logits processor interface: pass the entire share_inputs into LP, do not copy share_inputs and logits
* [doc] add docs
* [fix] fix logit bias processor not applied when decoding is too fast & add docs and tests
* [fix] fix redundant code
* [feat] skip apply() if no bias is specified
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-10-29 00:08:53 +08:00
RAM
25a983ba9c
1.fix the bug of draft model with ep 2.fix sampler bug ( #4589 )
2025-10-27 17:47:34 +08:00
chen
5c63a089f6
[Feature] Support logprobs_mode ( #4567 )
2025-10-27 14:27:48 +08:00
GoldPancake
47595a2480
[Feature] support mtp logprob ( #4464 )
...
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
* support mtp logprob
* fix unitest
2025-10-20 15:18:12 +08:00
Jianyu Li
3bbe99eae7
[Intel HPU] Enable dist sampler on intel hpu platform ( #4445 )
2025-10-16 19:02:27 +08:00
RAM
aa27b03bc0
[Executor]CUDAGraph support Speculate Decode ( #3769 )
...
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
* success run ngram
* Revert "[Code Simplification] remove cum_offsets (#3410 )"
This reverts commit 32b39620bc .
* success run ngram5 tp4 42bs
* success run ngram5 tp4 42bs
* mtp draft commit
* add decorator for target model
* enable draft model in cudagraph v0.5
* revert revrt cum_offset
* enable target model in cudagraph v0.9 And clean debug code
* Revert "success run ngram"
This reverts commit 8351e83993 .
* add reverted code
* enable target model in cudagraph v0.9
* solve comment
* fix bid < 0
* Enable Target Model Padding And Draft Model in cudagraph
* solve problem
* delete rebuild padding debug note
* fast compile
* Add capture list for mtp
* success run 256 tp1 mtp
* Enable Lite TP2 Bsz256
* realy enable tp2 bsz 256
* fix problem
* Solve problem for Draft model in cudagraph
* Solve comment
* replace emptytensor as zeros
* Solve comments
* Revert "fast compile"
This reverts commit 834639a7ff .
* fix bug
* fix merge bug
* fix typo
* fix bug
---------
Co-authored-by: lizexu <2694294196@qq.com >
Co-authored-by: littledgg <1658565283@qq.com >
Co-authored-by: zeroRains <linjunlu@zerorains.top >
Co-authored-by: gongshaotian <gstain5555@outlook.com >
2025-10-09 21:18:29 +08:00
fmiao2372
f1b5392e20
[Intel HPU] Support intel hpu platform ( #4161 )
...
* [Intel HPU] Support intel hpu platform
* fix some issues
* apply precommit and move AttentionBackend_HPU
* fix format issue
* correct ops import
* fix ci issue
* update code in layers
* fix code style issue
* remove dense tp moe ep mode
* fix enc_dec_block_num
* fix rebase issue
* rename hpu to gaudi in readme
* rename ForwardMeta_HPU to HPUForwardMeta
2025-09-24 12:27:50 +08:00
YuanRisheng
2e9e53ff7e
[FDConfig]Remove max_num_batched_tokens/max_num_seqs in parallel config ( #4116 )
...
* remove max_num_batched_tokens in parallel config
* remove max_num_seqs
* update test case
* fix test
* fix
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2025-09-17 10:43:35 +08:00
co63oc
8466219ec8
fix typos ( #3840 )
...
* fix typos
* ci
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-09-12 11:04:38 +08:00
kevin
1908465542
[Feature] mm and thinking model support structred output ( #2749 )
...
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
* mm support structured output
* update code
* update code
* update format
* update code
* update code
* add enable_thinking default
* update code
* add structured_outputs test case
* add ci install xgrammar
* add ci timeout time
* update test for structured_outputs
* update code
* add error traceback info
* update error msg
* update structred output code
* update code
* update code
* update config
* update torch version
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2025-09-02 16:21:09 +08:00
co63oc
d6369b4d51
fix typos ( #3684 )
2025-09-01 17:50:17 +08:00
chen
9cab3f47ff
[Feature] Add temp_scaled_logprobs and top_p_normalized_logprobs parameters for logits and logprobs post processing ( #3552 )
...
* [feature] Add temp_scaled_logprobs and top_p_normalized_logprobs parameters for logits and logprobs post processing
* infer engine support temp_scaled_logprobs and top_p_normalized_logprobs
* delete some code
* code check
* code check and add doc
* fix tokenizer.decoder(-1), return 'Invalid Token'
* add ci for temp_scaled and top_p logprobs
* check test
* check seq len time shape
* logprob clip inf
---------
Co-authored-by: sunlei1024 <sunlei5788@gmail.com >
2025-08-25 14:11:49 +08:00
Zero Rains
8b12c80f90
[FixBug] compute early stopping with real batch size ( #3418 )
...
* [FixBug] compute early stopping with real batch size
* update
* fix test_sampler
2025-08-18 22:09:21 -07:00
chen
5585cf7aa5
fix mtp_rej_topp input ( #3450 )
2025-08-18 16:12:42 +08:00
chen
f0f00a6025
[OPs] Universal optimization and Fix early_stop cuda 700 ( #3375 )
...
Deploy GitHub Pages / deploy (push) Has been cancelled
* delete nonzero
* delete setup_ops_base.py
* check if
* check gcp infer_seed.cpu()
* fix repetition_early_stopper_kernel cuda 700
2025-08-14 22:40:44 +08:00
Kane2011
b4fef2cf29
[MetaxGPU] Support FastDeploy on metax gpu ( #3241 )
...
* [MetaxGPU] Support FastDeploy on metax gpu
* Update metax_worker.py
1. change worker log;
2. remove custom allreduce, adapt it later;
3. remove cuda graph;
* Update __init__.py
1. remove metax's key work comment
* Update __init__.py
1. remove metax's key word comment;
2. add fused_moe_kernel_paddle import
---------
Co-authored-by: yongqiangma <xing.wo@163.com >
2025-08-13 11:11:54 +08:00
freeliuzc
71267840f7
【Fix】fix mtp bug ( #3139 )
2025-08-08 13:30:12 +08:00
yzwu
fbdd6b0663
[Iluvatar GPU] Optimze attention and moe performance ( #3234 )
2025-08-08 10:51:24 +08:00
lizexu123
afff4d37ea
[Feature] support seed parameter ( #3161 )
...
* support seed
* fix
* add SamplingMetadata seed test
* The next_tokens values are inconsistent!
* add air and rejection seed test
* fix
* add SamplingParams seed test
* fix seed=0
* Default to defualt
* fix
* fix args_utils
* fix review
* fix review
* fix
* fix
* add xpu,gcu,iluvatar support seed
* fix
2025-08-06 15:20:47 +08:00
Zero Rains
b2f9a42d87
[Feature] Support repetition early stop ( #3024 )
...
* support repetition early stop and support user to set the parameter
* remove log
* fix codestyle
* add the early_stop_config to rollout_config
* update config and EarlyStopper class
* fix the bug for triton
* modify the stop method
* update description
* modify the usage for stop_flags
---------
Co-authored-by: Yuanle Liu <yuanlehome@163.com >
2025-07-29 22:42:54 +08:00
Zero Rains
0fb37ab7e4
update flake8 version to support pre-commit in python3.12 ( #3000 )
...
* update flake8 version to support pre-commit in python3.12
* polish code
2025-07-24 01:43:31 -07:00
lifulll
2c6a9e887e
native top_p_sampling ( #2901 )
2025-07-22 14:09:59 +08:00
lizexu123
67990e0572
[Feature] support min_p_sampling ( #2872 )
...
Deploy GitHub Pages / deploy (push) Has been cancelled
* Fastdeploy support min_p
* add test_min_p
* fix
* min_p_sampling
* update
* delete vl_gpu_model_runner.py
* fix
* Align usage of min_p with vLLM
* fix
* modified unit test
* fix test_min_sampling
* pre-commit all files
* fix
* fix
* fix
* fix xpu_model_runner.py
2025-07-20 23:17:59 -07:00
Zero Rains
25698d56d1
polish code with new pre-commit rule ( #2923 )
2025-07-19 23:19:27 +08:00
ming1753
1f15ca21e4
[Feature] support prompt repetition_penalty ( #2806 )
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-07-17 12:05:52 +08:00
freeliuzc
7cdd8d290d
[MTP] optimize mtp infer speed ( #2840 )
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-07-14 19:50:22 +08:00
Sunny-bot1
240d6236bc
[Fix]fix top_k_top_p sampling ( #2801 )
...
Deploy GitHub Pages / deploy (push) Has been cancelled
* fix topk-topp
* update
* add base_non_truncated
2025-07-10 22:35:10 +08:00
chen
d33105baeb
[Feature] Online Chat API Support Return logprobs ( #2777 )
...
* online chat support logprobs
* check xpu
* check vl_gpu_model_runner and xpu_model_runner
* get_worker() check platform
2025-07-10 16:33:40 +08:00
Sunny-bot1
1e2319cbef
Rename top_p_sampling to top_k_top_p_sampling ( #2791 )
2025-07-10 00:09:25 -07:00
Sunny-bot1
e45050cae3
[Feature] support top_k_top_p sampling ( #2753 )
...
* support top_k_top_p sampling
* fix
* add api param
* add api para
* fix
* fix
* fix
* fix
* fix
* fix
* fix
2025-07-09 20:58:58 -07:00
GoldPancake
f7cad30a38
[Feature] Add speculative decoding simulation benchmark. ( #2751 )
...
* Add speculative decoding simulation benchmark
* Fix the name of the parameter
2025-07-09 12:08:43 +08:00
EnflameGCU
d0f4d6ba3a
[GCU] Support gcu platform ( #2702 )
...
baseline: e7fa57ebae
Co-authored-by: yongqiangma <xing.wo@163.com >
2025-07-08 13:00:52 +08:00
liddk1121
1b54a2831e
Adapt for iluvatar gpu ( #2684 )
2025-07-07 16:53:14 +08:00
Jiang-Jia-Jun
92c2cfa2e7
Sync v2.0 version of code to github repo
2025-06-29 23:29:37 +00:00
jiangjiajun
684703fd72
[LLM] First commit the llm deployment code
2025-06-09 19:20:15 +08:00