Commit Graph

766 Commits

Author SHA1 Message Date
chenjian
c487b62ee0 [Bug fix] Fix memory allocation (#3475)
* Support batched tokens for EP

* Support batched tokens for EP

* Support batched tokens for EP

* Support batched tokens for EP

* Support batched tokens for EP and fix bug

* Support batched tokens for EP and fix bug

* Support batched tokens for EP and fix bug

* Support batched tokens for EP and fix bug

* Fix bug for memory allocation
2025-08-19 19:48:24 +08:00
chenjian
d2f6c3b998 [Bug fix] Fix bug for seq_len_encoder is 1 (#3467) 2025-08-19 15:21:32 +08:00
chenjian
aba94169dc [Feature] Support batched tokens for EP (#3415)
* Support batched tokens for EP

* Support batched tokens for EP

* Support batched tokens for EP

* Support batched tokens for EP

* Support batched tokens for EP and fix bug

* Support batched tokens for EP and fix bug

* Support batched tokens for EP and fix bug

* Support batched tokens for EP and fix bug
2025-08-18 11:43:36 +08:00
chenjian
3f86ae0007 fix cache messager bug when d restart (#3386) 2025-08-14 11:43:59 +08:00
chenjian
89177d881c [Bug fix] Fix zmq core bug (#3357)
* [Bug fix] Fix zmq core bug due to concurrently used by threads

* Fix zmq core bug due to concurrently used by threads
2025-08-13 20:24:39 +08:00
chenjian
7573802a88 [Feature] Support mtp ep in fd (#3340)
* [Optimize] Add metrics for analysing perf

* Fix bug in mtp
2025-08-11 21:49:44 +08:00
chenjian
110f33a530 [Bug fix] Test td cache messager (#3242)
* support disable cache task in decode node

* fix busg

* Update engine.py

* Update expert_service.py

* Update splitwise_connector.py

* Optimize log for debug

* Optimize log for debug

* fix bug

---------

Co-authored-by: ltd0924 <ltd0924@sina.com>
Co-authored-by: ltd0924 <32387785+ltd0924@users.noreply.github.com>
2025-08-06 15:52:45 +08:00
chenjian
a4572a5e5d fix bug for pd step signal (#3230) 2025-08-06 10:41:52 +08:00
chenjian
a9d231c900 Fix bug for concurrently visit zmq (#3233) 2025-08-06 10:41:10 +08:00
ltd0924
b20ffe3697 [Feature] optimize expert parallel (#3196)
* optimize

* Update expert_service.py

* Update worker_process.py

* optimize
2025-08-05 17:34:24 +08:00
ltd0924
dcf9c2daff [Feature] Optimize prefix cache (#3208)
* [LLM] support ep

* Update worker_process.py

* Update expert_service.py

* Update worker_process.py

* format files

* optimize prefix cache

* optimize prefix cache

* optimize prefix cache

* pre commit format

* pre commit format

* pre commit format

* Update cache_messager.py
2025-08-05 17:13:11 +08:00
chenjian
9f9971844f [Feature] Support ep pd with external module (#3194)
* Support external module

* Support external module

* Support external module

* Support external module

* refactor code to make it more clear

* refactor code to make it more clear

* refactor code to make it more clear

* refactor code to make it more clear

* fix according to review

* fix according to review

* fix according to review

* fix according to review

* fix according to review

* fix according to review

* fix bug

* fix bug

* fix bug

* merge

---------

Co-authored-by: root <root@tjdm-inf-sci-k8s-hzz2-h12ni8-0202.tjdm.baidu.com>
2025-08-04 20:32:41 +08:00
gaoziyuan
0443587a57 【Feature】support qwen3 name_mapping (#3179)
* add fd plugins && rm model_classed

* fix reviews

* add docs

* fix

* fix unitest ci

* support qwen3 name_mapping
2025-08-04 01:34:07 -07:00
ltd0924
c9e6ce1518 Update cache_messager.py (#3172) 2025-08-04 14:32:34 +08:00
gaoziyuan
4021d66ea5 【Feature】add fd plugins && rm model_classes (#3123)
* add fd plugins && rm model_classed

* fix reviews

* add docs

* fix

* fix unitest ci
2025-08-03 19:53:20 -07:00
bukejiyu
1582814905 fix load_pre_sharded_checkpoint (#3152)
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-08-04 10:44:20 +08:00
ApplEOFDiscord
b71cbb466d [Feature] remove dependency on enable_mm and refine multimodal's code (#3014)
* remove dependency on enable_mm

* fix codestyle check error

* fix codestyle check error

* update docs

* resolve conflicts on model config

* fix unit test error

* fix code style check error

---------

Co-authored-by: shige <1021937542@qq.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-08-01 20:01:18 +08:00
yangjianfengo1
64d7a3194d 集中式支持fa3 (#3112) 2025-08-01 18:03:36 +08:00
Ryan
94264bbf60 [Code Simplification] Refactor Post-processing in VL Model Forward Method (#2937)
* rm sth useless

* refactor model forward

* mv bool index to kernel
2025-08-01 17:28:07 +08:00
yinwei
3a4db15765 Fix out-of-memory issue during single-XPU deployment (#3133) 2025-08-01 17:12:03 +08:00
chen
a2f5cc54f8 moe preprocess op support 160 experts and fused_moe triton kernel name add K (#3121) 2025-08-01 10:46:20 +08:00
SunLei
dade19d7a4 [Feature] General support for logprobs (#2974)
* [Feature] support logprobs in chat/completions and completions endpoints

* Temporarily comment out text_offset due to incorrect logic

* Clean up temporary debug prints

* [Feature] support logprobs in offline mode via SamplingParams

* fix: serialize Logprob as dict before zmq send to fix msgpack error

* refactor: remove redundant methods to simplify codebase

* Fix missing fields in CompletionOutput.to_dict affecting msgpack serialization

* refactor: centralize param validation in engine_client to reduce duplication

* revert: rollback changes in offline_demo.py

* revert: rollback changes in offline_demo.py

* [bugfix] fix parameter validation for logprobs

* [bugfix] fix parameter validation for logprobs

* [bugfix] fix parameter validation for logprobs

* [bugfix] fix parameter validation for logprobs

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-31 20:25:56 +08:00
chenjian
fe17410f9c [BUG] Fix bug for pd in fd (#3034)
* Fix bug for pd in fd

* Fix bug for pd in fd

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-31 20:17:27 +08:00
Yuan Xiaolan
5f56d289a7 fix is_permuted (#3098)
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-31 19:58:05 +08:00
LiqinruiG
25005fee30 [Doc] add chat_template_kwagrs and update params docs (#3103)
* add chat_template_kwagrs and update params docs

* add chat_template_kwagrs and update params docs

* update enable_thinking

* pre-commit

* update test case

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-31 19:44:06 +08:00
kevin
22cab724e8 [Feature] block scheduler v1 support prefix caching (#3061)
* block scheduler v1 support prefix cache

* update code

* update code

* fix code bug

* add timeout time

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-31 19:29:19 +08:00
chenjian
32307283f1 Fix bug for offline inference in scheduler v1 (#3117) 2025-07-31 17:54:24 +08:00
RAM
d850660872 [Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel (#2989)
* reset decoder_block_shape_q buffer

* refactor GetBlockShapeAndSplitKVBlock Kernel and cudagraph padding batch

* update decode_max_tile_size

* fix pre-commit

* update block_multihead_attn_backend

* update flas attn backend

* update MLA Attention

* update XPU Attention

* update gcu,iluvatar model runner

* Update MTP

* fix MTP bug
2025-07-31 00:09:31 +08:00
chenjian
fe0e3f508b [BUG FIX] Fix bug when preempted request rescheduled (#3080)
* Fix bug when preempted request rescheduled

* Fix bug when preempted request rescheduled

* Fix bug when preempted request rescheduled
2025-07-30 22:25:47 +08:00
Jiang-Jia-Jun
0616c208d2 [Feature] Support include_stop_str_in_output in completion api (#3096)
* [Feature] Support include_stop_str_in_output in completion api

* Fix ci test

---------

Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com>
2025-07-30 22:18:48 +08:00
YuanRisheng
7dfdd157ac [BugFix]Fix ep size (#3092)
* fix ep

* fix num_layer
2025-07-30 21:03:12 +08:00
ltd0924
d17886de19 [Feature] support ep in mixed mode (#3001)
* [LLM] support ep

* Update worker_process.py

* Update expert_service.py

* Update worker_process.py

* format files
2025-07-30 20:43:39 +08:00
Zhida Hu
3f8a41e68c [*] fix the memory leak when modify qp to rts failed (#3051)
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-30 19:49:07 +08:00
李泳桦
b242150f94 [feat] extra parameters are all passed directly via http payload now, or in extra_body if using openai client (#3058)
* [feat] extra parameters are all passed directly via http payload now, or in extra_body if using openai client

* [fix] delete ci test case for enable_thinking

* [fix] add reasoning_parser when server starts

* [fix] fix ci consistency test error with reasoning parser

* [doc] update docs related to metadata

* [fix] cancel enable_thinking default value
2025-07-30 19:25:20 +08:00
bukejiyu
db698bda01 qwen loader (#3057) 2025-07-30 19:09:38 +08:00
zhink
d89b6dd43f adapter qwen3 moe attr for init (#3066)
adapter qwen3 moe attr for init
2025-07-30 16:49:28 +08:00
bukejiyu
8e203666d9 w4a8 offline (#3074)
* w4a8 offline

* update

* update

* update
2025-07-30 16:33:30 +08:00
ming1753
5acde4eb43 [Feature] Multimodal Scheduler V1 (#3019)
* [Feature] Support multimodal scheduler v1

* remove debug log

* fix bug

* fix format

* modify code

* fix bug

* fix bug

* fix bug

* modify code
2025-07-30 16:05:55 +08:00
Jiang-Jia-Jun
ffa0f4d99b [Fix] Fix version function (#3076)
* [Fix] Fix version function

* Fix commit

* Fix commit

* fix code sync

* Update coverage_run.sh

---------

Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com>
2025-07-30 16:05:24 +08:00
ltd0924
ecf2fd5b9a [BugFix] vl encoder tokens dtype problem (#3069) 2025-07-30 15:20:53 +08:00
Yuan Xiaolan
35935da9e5 support W4A8 EPLB (#3075) 2025-07-30 14:34:12 +08:00
Yzc216
159767717d [Feature] multi source download (#3072)
* multi-source download

* multi-source download

* huggingface download revision

* requirement

* style

* add revision arg

* test

* pre-commit

* Change default download

* change requirements.txt

* modify English Documentation

* documentation

* modify model download path
2025-07-30 14:10:13 +08:00
YuanRisheng
99a70fc722 unify parallel config (#3070) 2025-07-30 11:41:23 +08:00
Sunny-bot1
74aa31d15b [Feature] support bad_words (#3055)
* support bad_words

* support online infer bad_words

* update

* add CI test

* update

* update

* update

---------

Co-authored-by: Yuanle Liu <yuanlehome@163.com>
2025-07-30 09:31:29 +08:00
zhuzixuan
ad7bb52a28 修复传入max_tokens=1时的报错 (#3068)
* 修复传入max_tokens=1时的报错

* 修复传入max_tokens=1时的报错

* 修复传入max_tokens=1时的报错

* 修复传入max_tokens=1时的报错

* 修复传入max_tokens=1时的报错

* 修复传入max_tokens=1时的报错
2025-07-29 23:49:28 +08:00
Ryan
73cfe1fd37 [SOT] Extend SOT warmup support to new hardware (#3032)
* add new hardware

* add_sot_warmup4new_hardware

* fix conflict

* rm Optional
2025-07-29 22:45:20 +08:00
Zero Rains
b2f9a42d87 [Feature] Support repetition early stop (#3024)
* support repetition early stop and support user to set the parameter

* remove log

* fix codestyle

* add the early_stop_config to rollout_config

* update config and EarlyStopper class

* fix the bug for triton

* modify the stop method

* update description

* modify the usage for stop_flags

---------

Co-authored-by: Yuanle Liu <yuanlehome@163.com>
2025-07-29 22:42:54 +08:00
Yuan Xiaolan
3214fb5393 support model loading for w4a8 offline quant (#3064)
支持W4A8 EP 对离线量化权重的load
2025-07-29 21:54:37 +08:00
Longzhi Wang
be0a0f2bb2 fix arguement error in ep when pd (#3060) 2025-07-29 17:17:24 +08:00
YuanRisheng
502ee92a0a Unify server-side and model-side Config (Part3) (#3047)
* merge model config

* fix arch

* fix rl
2025-07-29 17:07:44 +08:00