Compare commits

...

161 Commits

Author SHA1 Message Date
chenjian
c49c43d51c [Bug fix] Fix perf in mixed deployment with yiyan adpater (#3703)
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2025-09-01 14:06:09 +08:00
chenjian
a424ab907f [Bug fix] Fix prefix cache in v1 (#3710)
* [Bug fix] Fix prefix cache in V1

* add comment
2025-09-01 10:14:25 +08:00
chenjian
10a95f8ed5 [Fix] Do not drop result when request result slowly (#3704)
* [Fix] Do not drop result when request result slowly

* set default FD_ZMQ_SNDHWM to 64k
2025-09-01 10:14:04 +08:00
RAM
b9af800edd [Optimize] Increase zmq buffer size to prevent apiserver too slowly to consume (#3723) (#3728)
Co-authored-by: chenjian <1435317881@qq.com>
2025-08-30 15:58:18 +08:00
Zero Rains
64cf769bee fix the bug when num_key_value_heads < tensor_parallel_size (#3722) 2025-08-30 12:40:29 +08:00
Jiang-Jia-Jun
3364af767b Revert "[BugFix] Modify the bug in Qwen2 when enabling ENABLE_V1_KVCACHE_SCHE…" (#3719)
This reverts commit 578b8c5da2.
2025-08-29 19:55:50 +08:00
lizexu123
578b8c5da2 [BugFix] Modify the bug in Qwen2 when enabling ENABLE_V1_KVCACHE_SCHEDULER. (#3670)
* merge 2.1

* fix

* pre-commit

* fix
2025-08-29 19:53:44 +08:00
ltd0924
8517e04956 [bugfix]PR3663 parameter is 0 (#3679)
* Update engine.py

* Update engine_client.py

* Update engine.py

* Update engine.py
2025-08-29 11:46:42 +08:00
李泳桦
aad9d3564e [feat] add metrics for yiyan adapter (#3615)
* [feat] add metrics for yiyan adapter (#3219)

* [feat] add metrics for yiyan adapter

* [fix] fix metrics num_requests_waiting and num_requests_running

* [fix] fix metrics gpu_cache_usage_perc

* [refactor] change where requests_number increases

* [chore] rename xxx_block_num as xxx_gpu_block_num, and update their values accordingly

* [chore] delete useless code

* [fix] fix error
2025-08-28 21:16:58 +08:00
Jiang-Jia-Jun
6039cdc2c5 Revert "[BugFix] fix parameter is 0 (#3663)" (#3681)
This reverts commit 6a90cfd144.
2025-08-28 15:55:55 +08:00
李泳桦
6545994c58 [fix] qwen output inconsistency when top_p=0 (#3634) (#3662)
* [fix] qwen output inconsistency when top_p=0

* [fix] remove decode pre_id code
2025-08-28 09:54:17 +08:00
ltd0924
6a90cfd144 [BugFix] fix parameter is 0 (#3663)
* Update engine.py

* Update engine_client.py
2025-08-28 09:52:17 +08:00
YuBaoku
47e6270dec [CI] add container naming and cleanup logic in workflows (#3655) 2025-08-27 21:51:56 +08:00
zhuzixuan
80db7fce05 【Bugfix】修复2.1分支上0.3B模型性能大幅下降 (#3624)
* 恢复异步方法。
【BugFix】completion接口echo回显支持 (#3245)

* wenxin-tools-511,修复v1/completion无法回显的问题。

* 支持多prompt的回显

* 支持多prompt情况下的流式回显

* 补充了 completion 接口支持 echo 的单元测试

* pre-commit

* 移除了多余的test文件

* 修复了completion接口echo支持的单测方法

* 补充了单元测试文件

* 补充单测

* unittest

* 补充单测

* 修复单测

* 删除不必要的assert.

* 重新提交

* 更新测试方法

* ut

* 验证是否是正确思路单测

* 验证是否是正确思路单测

* 验证是否是正确思路单测3

* 优化单测代码,有针对性地缩小单测范围。

* 优化单测代码2,有针对性地缩小单测范围。

* 优化单测代码3,有针对性地缩小单测范围。

* support 'echo' in chat/completion.

* update

* update

* update

* update

* update

* update

* 补充了关于tokenid的单元测试

* update

* 修正index错误

* 修正index错误

* [Bugfix] Significant performance degradation of 0.3B model on branch 2.1
2025-08-27 15:29:01 +08:00
ltd0924
96aed92e4a [BugFix] ep mixed mode offline exit failed (#3623) 2025-08-26 20:12:44 +08:00
SunLei
d8444e22ca fix: replace list * n initialization with list comprehension to avoid shared references (#3620) 2025-08-26 17:53:09 +08:00
李泳桦
df27a488b1 [fix] fix ZmqIpcClient.close() error (#3600) 2025-08-26 10:16:41 +08:00
李泳桦
b1f8f1aa07 [fix] fix completion stream api output_tokens not in usage (#3588) 2025-08-25 18:31:57 +08:00
zhuzixuan
4e369c7fa7 【BugFix】completion接口echo回显支持 (#3477)
* update
【BugFix】completion接口echo回显支持 (#3245)

* wenxin-tools-511,修复v1/completion无法回显的问题。

* 支持多prompt的回显

* 支持多prompt情况下的流式回显

* 补充了 completion 接口支持 echo 的单元测试

* pre-commit

* 移除了多余的test文件

* 修复了completion接口echo支持的单测方法

* 补充了单元测试文件

* 补充单测

* unittest

* 补充单测

* 修复单测

* 删除不必要的assert.

* 重新提交

* 更新测试方法

* ut

* 验证是否是正确思路单测

* 验证是否是正确思路单测

* 验证是否是正确思路单测3

* 优化单测代码,有针对性地缩小单测范围。

* 优化单测代码2,有针对性地缩小单测范围。

* 优化单测代码3,有针对性地缩小单测范围。

* support 'echo' in chat/completion.

* update

* update

* update

* update

* update

* update

* 补充了关于tokenid的单元测试

* update

* 修正index错误

* 修正index错误

* 解决冲突

* 解决冲突

* 解决冲突

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-08-23 13:08:48 +08:00
Zero Rains
f8d3255520 [Cherry-Pick] Launch expert_service before kv_cache initialization in worker_process (#3558)
* launch expert_service before kv_cache initialization

* update code

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-08-23 13:08:34 +08:00
chenjian
e8af92aab7 [Feature] Support mixed deployment with yiyan adapter (#3533)
* [Feature] Support mixed deployment with yiyan adapter

* [Feature] Support mixed deployment with yiyan adapter

* fix merge

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-08-23 09:56:47 +08:00
K11OntheBoat
8b9f167ccc Avoid tokenizer bug for XPU CI (#3563)
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
2025-08-23 00:09:56 +08:00
K11OntheBoat
93d999b830 [Feature] Support limit thinking len for text models (#3527)
* support limit thinking len

* remove default think_end_id

* remove reasoning_max_tokens

* update think_end_id for ernie

* update think_end_id for ernie.

---------

Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
Co-authored-by: luukunn <981429396@qq.com>
2025-08-22 14:48:15 +08:00
ltd0924
4d6fb96cd6 [BugFix] Api server bugs (#3530)
* Update serving_chat.py

* Update serving_completion.py

* Update serving_completion.py
2025-08-22 14:01:14 +08:00
ltd0924
c18975366e [BUGFIX] fix ep mixed bug (#3513)
* Update expert_service.py

* Update engine.py

* Update engine.py

* Update engine.py

* Update expert_service.py

* Update engine.py
2025-08-22 11:35:50 +08:00
luukunn
4a9c04a746 [Feature] add tool parser (#3518)
* [Feature] Pass through the `chat_template_kwargs` to the data processing module (#3421)

* fix chat_template_args

* fix args

* add offline

* add offline

* fix

* fix

* fix default enable_thinking value

* fix default enable_thinking value

* modify condition

* Revert "modify condition"

This reverts commit 26430bdeb1.

* fix unit test

* add Tool Parser (#3272)

* add tool-parser

* add tool-parser

* add tool parser

* add tool parser

* fix

* add offline

* add offline

* fix

* parsers:tool&reasoning

* 修改tool parser名称·

* update

* fix reasoning-parser

* add requirements

* fix finish reason

* fix

* fix reasoning-parser

* fix

* fix

* fix

* fix

* fix

---------

Co-authored-by: zhuzixuan <zhuzixuan@baidu.com>

* [Feature] add tool parser (#3483)

* add tool parser

* add x1 enable_thinking

* restart ci

* fix vl reasoning parser

* modify call style

* modify call style

* add offline enablethinking

* fix completion

* fix

* fix unit test

* fix unit test

* fix unit test

* fix vl reasoning parser

* fix vl reasoning parser

* fix unit test

---------

Co-authored-by: zhuzixuan <zhuzixuan@baidu.com>
2025-08-22 11:14:35 +08:00
RAM
d97aab25bc [Excutor] Fixed the issue of CUDA graph execution failure caused by different branches during decoding (#3223) (#3512)
* 彻底解决解码切块问题

* update C8 and C4 kernel

* fix problem

* fix with pre-commit

* retain branch for mtp

Co-authored-by: Jundong Liu <61149469+littledgg@users.noreply.github.com>
2025-08-21 20:58:47 +08:00
李泳桦
1b399b91c0 [fix] setting disable_chat_template while passing prompt_token_ids led to response error (#3511)
* [fix] setting disable_chat_template while passing prompt_token_ids led to response error

* [fix] code syntax

* [test] add test case for this bug

* [test] add test case for empty message list

* [test] fix test case for empty message list
2025-08-21 17:33:10 +08:00
memoryCoderC
8bf48dfab8 [Feature] add prompt_tokens and completion_tokens (#3505)
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2025-08-21 14:10:06 +08:00
lizexu123
fcdc5c2c54 fix num_seqs (#3396) 2025-08-21 14:03:11 +08:00
YuBaoku
5d4d38674f [CI] fix run_ci error in release/2.1 (#3499) 2025-08-21 10:07:20 +08:00
luukunn
d07338f932 [Feature] Pass through the chat_template_kwargs to the data processing module (#3421) (#3469)
* fix chat_template_args

* fix args

* add offline

* add offline

* fix

* fix

* fix default enable_thinking value

* fix default enable_thinking value

* modify condition

* Revert "modify condition"

This reverts commit 26430bdeb1.

* fix unit test
2025-08-19 17:40:12 +08:00
gaoziyuan
3ffbc98179 fix dynamic_weight config bug (#3432) 2025-08-18 14:36:53 +08:00
chenjian
edd13aad66 support logprob in v1 for release/2.1 (#3446) 2025-08-17 08:16:00 +08:00
RAM
1065406ed3 [Docs]Updata docs of graph opt backend (#3443)
* Updata docs of graph opt backend

* update best_practices

* update mkdocs.yaml

* [Docs]Update link
2025-08-15 22:10:54 +08:00
ming1753
570ad54b51 [Docs] release 2.1 (#3441)
* [Docs] release 2.1

* sync gh-pages.yml
2025-08-15 19:32:29 +08:00
yongqiangma
9af57513b3 update installation readme (#3435) 2025-08-15 18:44:39 +08:00
JYChen
2e6d97f5eb cherry-pick update docs (#3422) 2025-08-15 13:00:03 +08:00
Jiang-Jia-Jun
ff030d9090 Update Dockerfile.gpu 2025-08-15 12:29:37 +08:00
ltd0924
5a829fc7af [Docs] Add Multinode deployment document (#3416)
* Create multi-node_deployment.md

* Create multi-node_deployment.md
2025-08-15 09:55:34 +08:00
yinwei
d998efbc17 [Doc]Release fastdeploy-xpu 2.0.3 (#3408)
* fix v1 schedule oom bug

* fix v1 schedule oom bug

* update release note

* update info
2025-08-14 19:19:54 +08:00
yinwei
8a15bdc0c8 [Doc]Release fastdeploy-xpu 2.1.0 (#3407)
* fix v1 schedule oom bug

* fix v1 schedule oom bug

* update release note
2025-08-14 19:11:16 +08:00
memoryCoderC
ad8ea68906 [BugFix] fix ErnieProcessor not set raw_prediction (#3401) 2025-08-14 19:10:07 +08:00
yinwei
101605869c [XPU] Fixed the issue of performance degradation caused by enabling ENABLE_V1_KVCACHE_SCHEDULER (#3393)
* fix v1 schedule oom bug

* fix v1 schedule oom bug
2025-08-14 17:41:40 +08:00
Jiang-Jia-Jun
28918702c2 Revert "Merge branch 'feature/online/vs_think_20250813' into release/2.1"
This reverts commit 02596fc537, reversing
changes made to 03347626a6.
2025-08-14 17:20:29 +08:00
Jiang-Jia-Jun
02596fc537 Merge branch 'feature/online/vs_think_20250813' into release/2.1 2025-08-14 17:13:36 +08:00
ltd0924
03347626a6 [BugFix] fix control signal release failed (#3374)
* [BugFix]

* [BugFix]

* [BugFix]

* [BugFix]

* fix

* fix

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-08-14 17:01:25 +08:00
YUNSHEN XIE
b2df0311b8 Optimize CI execution workflow. (#3371) (#3384)
* fix
2025-08-14 14:51:15 +08:00
xiaolei373
d1d321bafd feat(log):add_request_and_response_log (#3392) 2025-08-14 14:50:48 +08:00
Jiang-Jia-Jun
dc5d3ff5a0 [Polish Code] Remove useless notes 2025-08-14 14:05:29 +08:00
Jiang-Jia-Jun
f0a707e06f [BugFix] Fix default log level of paddleformers (#3377)
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2025-08-14 11:36:13 +08:00
JYChen
4870919682 fix stopseq error info (#3342)
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2025-08-14 10:45:05 +08:00
ming1753
a375378cc1 [Bug Fix] Fix V1 video bug (#3387) 2025-08-14 09:49:22 +08:00
YUNSHEN XIE
192f9caab4 Pre ce modified (#3335) (#3360)
* Pre ce modified (#3335)

* update

* update

* fix

* fix

* update

* update

* update

* fix

* update

* update

* update

* add ut fix pr(3367)
2025-08-13 18:50:52 +08:00
luukunn
81092c0fe3 add tool parser 2025-08-13 16:06:22 +08:00
YUNSHEN XIE
ad816f20f4 Use latest PaddlePaddle package (#3347) (#3352)
* Use latest PaddlePaddle package

* fix
2025-08-13 11:06:01 +08:00
memoryCoderC
37b76158f9 Completion add raw_prediction/text_after_process (#3362) 2025-08-12 23:20:36 +08:00
memoryCoderC
fe2094609f Release/2.1 (#3361)
* [BugFix] v1/completions add finish_reason

* update TestOpenAIServingCompletion for merge
2025-08-12 23:06:51 +08:00
gaoziyuan
b4bb54b56b bugfix (#3322) 2025-08-12 16:16:37 +08:00
Jiang-Jia-Jun
eeec4bd15e Remove useless code release/2.1 (#3338) 2025-08-12 11:32:50 +08:00
chenjian
d2592750f7 fix bug for scheduler v0 (#3306)
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: YUNSHEN XIE <1084314248@qq.com>
2025-08-12 00:41:15 +08:00
chenjian
25f51b0611 Fix block num in schduelr v1 for release 2.1 (#3315)
* fix bug for scheduler v0

* fix block num setting in scheduler v1 for release 2.1

* fix block num setting in scheduler v1 for release 2.1

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: YUNSHEN XIE <1084314248@qq.com>
2025-08-12 00:41:05 +08:00
ming1753
9b07f85f6d [Bug Fix] fix vl V1 schedule bug (#3284)
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: YUNSHEN XIE <1084314248@qq.com>
2025-08-12 00:40:45 +08:00
Sunny-bot1
2fe31c6f0f [Docs]fix sampling docs 2.1 (#3333)
* [Docs]fix sampling docs (#3113)

* fix sampling docs

* fix sampling docs

* update

* fix docs
2025-08-11 21:04:10 +08:00
YUNSHEN XIE
a33e557732 fix ci pypi index error (#3327) 2025-08-11 20:24:27 +08:00
kevin
054c790642 fix uvicorn multi worker error (#3309)
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-08-11 20:19:31 +08:00
Jiang-Jia-Jun
ca4e4ab911 Revert "[BugFix] fix ep (#3290)" (#3317)
This reverts commit 86ff68be4b.
2025-08-11 16:17:58 +08:00
chenjian
c000cff744 fix scheduler bug in release2.1 (#3295) 2025-08-10 13:55:22 +08:00
lizexu123
86ff68be4b [BugFix] fix ep (#3290)
* fix ep

* fix
2025-08-09 16:32:35 +08:00
yinwei
702c313ed1 revert pr (#3286) 2025-08-09 16:29:35 +08:00
ltd0924
6706ccb37e [BugFix] fix too many open files problem (#3275) 2025-08-08 20:11:32 +08:00
JYChen
1b6f482c15 [Cherry-pick] fix stop seq (#3263)
* fix out-bound value for stop sequence

* catch error if there are out-of-bounds value

* check in offline mode
2025-08-07 19:11:37 +08:00
sg263
5d3bf308f6 merge develop trace FD_START (#3253)
Co-authored-by: shige <shige@baidu.com>
2025-08-07 11:10:55 +08:00
Sunny-bot1
f672a34f95 [FIX 2.1]fix bad_words when sending requests consecutively (#3199)
* fix bad_words

* fix log

* fix log
2025-08-06 15:47:27 +08:00
lizexu123
bc0b92bba4 [BugFix] support real batch_size (#3109) (#3217)
* support real bsz

* fix

* fix xpu_model_runner.py,gpu_model_runner.py,gcu_model_runner.py,iluvatar_model_runner.py

* add event_loop_ep

* fix

* Add comments

* fix

* support mtp real_batch_size

* fix

* self.tmp_seq_lens_this_time->self.seq_lens_this_time_buffer

* fix

* fix VL real_seq_lens_this_time

* fix

* fix mtp

* fix

* fix mtp

* fix xpu

* fix
2025-08-06 14:30:33 +08:00
SunLei
3dd8492601 [Bugfix] Fix uninitialized decoded_token and add corresponding unit test (#3201)
* Update test_base_chat.py (#3183)

* [Bugfix] Fix uninitialized decoded_token and add corresponding unit test.

---------

Co-authored-by: Divano <dddivano@outlook.com>
2025-08-05 10:55:22 +08:00
RAM
bd77a3a643 [Bug Fix] Fix bug of MLA Attention Backend (#3178)
* fix typo

* fix mla attention backend
2025-08-05 10:53:27 +08:00
YUNSHEN XIE
9561603ed9 Apply CI fix from Develop (#3151)
* fix ci approve

* Describe PR diff coverage using JSON file (#3114)

* Refactored ci pipeline

* update

* Describe PR diff coverage using JSON file

* remove pip cache setting from Approve

* fix

* update

* fix ci (#3141)

* fix
2025-08-04 16:30:56 +08:00
plusNew001
e26313a355 Update Dockerfile.xpu (#3147) 2025-08-04 16:25:33 +08:00
yinwei
4367c09a5f Fix out-of-memory issue during single-XPU deployment (#3131) 2025-08-04 16:02:43 +08:00
bukejiyu
8e789dcb67 fix load_pre_sharded_checkpoint (#3152) (#3169)
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-08-04 15:44:10 +08:00
ltd0924
5f6fc7f7b9 Update cache_messager.py (#3173) 2025-08-04 15:09:17 +08:00
RAM
d4059cabf0 fix typo (#3153) 2025-08-01 22:34:59 +08:00
chen
c8dd5976ae fix request_output sampling_params (#3154) 2025-08-01 22:34:33 +08:00
Jiang-Jia-Jun
4880c16be3 Update setup.py 2025-07-31 20:30:24 +08:00
SunLei
dade19d7a4 [Feature] General support for logprobs (#2974)
* [Feature] support logprobs in chat/completions and completions endpoints

* Temporarily comment out text_offset due to incorrect logic

* Clean up temporary debug prints

* [Feature] support logprobs in offline mode via SamplingParams

* fix: serialize Logprob as dict before zmq send to fix msgpack error

* refactor: remove redundant methods to simplify codebase

* Fix missing fields in CompletionOutput.to_dict affecting msgpack serialization

* refactor: centralize param validation in engine_client to reduce duplication

* revert: rollback changes in offline_demo.py

* revert: rollback changes in offline_demo.py

* [bugfix] fix parameter validation for logprobs

* [bugfix] fix parameter validation for logprobs

* [bugfix] fix parameter validation for logprobs

* [bugfix] fix parameter validation for logprobs

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-31 20:25:56 +08:00
chenjian
fe17410f9c [BUG] Fix bug for pd in fd (#3034)
* Fix bug for pd in fd

* Fix bug for pd in fd

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-31 20:17:27 +08:00
Zhang Yulong
1a543bca29 Fix test_EB_Lite_serving.py (#3119)
* Fix test_EB_Lite_serving.py

* fix test_EB_Lite_serving.py
2025-07-31 20:15:25 +08:00
Yuan Xiaolan
5f56d289a7 fix is_permuted (#3098)
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-31 19:58:05 +08:00
LiqinruiG
25005fee30 [Doc] add chat_template_kwagrs and update params docs (#3103)
* add chat_template_kwagrs and update params docs

* add chat_template_kwagrs and update params docs

* update enable_thinking

* pre-commit

* update test case

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-31 19:44:06 +08:00
kevin
22cab724e8 [Feature] block scheduler v1 support prefix caching (#3061)
* block scheduler v1 support prefix cache

* update code

* update code

* fix code bug

* add timeout time

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-31 19:29:19 +08:00
chenjian
32307283f1 Fix bug for offline inference in scheduler v1 (#3117) 2025-07-31 17:54:24 +08:00
YUNSHEN XIE
583eae2fd1 fix ci (#3106)
* fix ci

* disable test_non_streaming_chat_with_min_tokens
2025-07-31 17:25:08 +08:00
JYChen
1ef38b1563 [doc] best practice for eb45 text models (#3002)
* [doc] best practice for eb45 text models

* fix docs
2025-07-31 17:21:55 +08:00
Jiang-Jia-Jun
4498058722 Update README.md 2025-07-31 15:33:12 +08:00
Jiang-Jia-Jun
66304cf921 Update sampling.md 2025-07-31 15:02:57 +08:00
yinwei
5b9aec1f10 xpu release 2.0.3 (#3105) 2025-07-31 14:26:07 +08:00
YUNSHEN XIE
66c3835a46 add approve ci (#3093)
* add approve ci

* fix

* fix
2025-07-31 10:10:10 +08:00
RAM
d850660872 [Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel (#2989)
* reset decoder_block_shape_q buffer

* refactor GetBlockShapeAndSplitKVBlock Kernel and cudagraph padding batch

* update decode_max_tile_size

* fix pre-commit

* update block_multihead_attn_backend

* update flas attn backend

* update MLA Attention

* update XPU Attention

* update gcu,iluvatar model runner

* Update MTP

* fix MTP bug
2025-07-31 00:09:31 +08:00
Jiang-Jia-Jun
998968f1e8 [Doc] Update parameters of serving 2025-07-30 22:35:01 +08:00
chenjian
fe0e3f508b [BUG FIX] Fix bug when preempted request rescheduled (#3080)
* Fix bug when preempted request rescheduled

* Fix bug when preempted request rescheduled

* Fix bug when preempted request rescheduled
2025-07-30 22:25:47 +08:00
Jiang-Jia-Jun
0616c208d2 [Feature] Support include_stop_str_in_output in completion api (#3096)
* [Feature] Support include_stop_str_in_output in completion api

* Fix ci test

---------

Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com>
2025-07-30 22:18:48 +08:00
YuanRisheng
7dfdd157ac [BugFix]Fix ep size (#3092)
* fix ep

* fix num_layer
2025-07-30 21:03:12 +08:00
ltd0924
d17886de19 [Feature] support ep in mixed mode (#3001)
* [LLM] support ep

* Update worker_process.py

* Update expert_service.py

* Update worker_process.py

* format files
2025-07-30 20:43:39 +08:00
JYChen
bd29b2aaca add stop_seqs doc (#3090) 2025-07-30 20:36:18 +08:00
Jiang-Jia-Jun
6ead7a3a49 Update setup.py 2025-07-30 20:21:41 +08:00
YUNSHEN XIE
e4ba9a0dde debug use (#3095) 2025-07-30 20:18:36 +08:00
Zhida Hu
3f8a41e68c [*] fix the memory leak when modify qp to rts failed (#3051)
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-07-30 19:49:07 +08:00
李泳桦
b242150f94 [feat] extra parameters are all passed directly via http payload now, or in extra_body if using openai client (#3058)
* [feat] extra parameters are all passed directly via http payload now, or in extra_body if using openai client

* [fix] delete ci test case for enable_thinking

* [fix] add reasoning_parser when server starts

* [fix] fix ci consistency test error with reasoning parser

* [doc] update docs related to metadata

* [fix] cancel enable_thinking default value
2025-07-30 19:25:20 +08:00
bukejiyu
db698bda01 qwen loader (#3057) 2025-07-30 19:09:38 +08:00
AIbin
28fff1b035 Revert "Add uinttest for moe_ffn_wint2. (#3037)" (#3085)
This reverts commit 327e1943fa.
2025-07-30 19:04:07 +08:00
YuanRisheng
acc5c0aa85 add ci for custom op approve (#3079) 2025-07-30 16:50:20 +08:00
zhink
d89b6dd43f adapter qwen3 moe attr for init (#3066)
adapter qwen3 moe attr for init
2025-07-30 16:49:28 +08:00
bukejiyu
8e203666d9 w4a8 offline (#3074)
* w4a8 offline

* update

* update

* update
2025-07-30 16:33:30 +08:00
ming1753
5acde4eb43 [Feature] Multimodal Scheduler V1 (#3019)
* [Feature] Support multimodal scheduler v1

* remove debug log

* fix bug

* fix format

* modify code

* fix bug

* fix bug

* fix bug

* modify code
2025-07-30 16:05:55 +08:00
Jiang-Jia-Jun
ffa0f4d99b [Fix] Fix version function (#3076)
* [Fix] Fix version function

* Fix commit

* Fix commit

* fix code sync

* Update coverage_run.sh

---------

Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com>
2025-07-30 16:05:24 +08:00
ltd0924
ecf2fd5b9a [BugFix] vl encoder tokens dtype problem (#3069) 2025-07-30 15:20:53 +08:00
YuanRisheng
eeadbf332a delete unused unittest (#3065) 2025-07-30 15:11:58 +08:00
Yiqun Liu
327e1943fa Add uinttest for moe_ffn_wint2. (#3037)
Change-Id: Ifd452527eaf87ea96c3fa4fa9aeb17729b33c2de
2025-07-30 15:03:09 +08:00
Yuan Xiaolan
35935da9e5 support W4A8 EPLB (#3075) 2025-07-30 14:34:12 +08:00
Yzc216
159767717d [Feature] multi source download (#3072)
* multi-source download

* multi-source download

* huggingface download revision

* requirement

* style

* add revision arg

* test

* pre-commit

* Change default download

* change requirements.txt

* modify English Documentation

* documentation

* modify model download path
2025-07-30 14:10:13 +08:00
Zero Rains
4dc130c5a9 [Doc] add repetition early stopping doc (#3078)
* add repetition early stop doc

* add the early_stop.md
2025-07-29 22:01:57 -07:00
YuanRisheng
99a70fc722 unify parallel config (#3070) 2025-07-30 11:41:23 +08:00
lddfym
5ca684c762 update doc: load_balance.md (#3008)
* update doc of load_balance

* update doc: load_balance.md
2025-07-30 10:27:56 +08:00
Sunny-bot1
74aa31d15b [Feature] support bad_words (#3055)
* support bad_words

* support online infer bad_words

* update

* add CI test

* update

* update

* update

---------

Co-authored-by: Yuanle Liu <yuanlehome@163.com>
2025-07-30 09:31:29 +08:00
Sunny-bot1
9c962343f2 [Docs] add sampling docs (#2973)
* add sampling docs

* add minp sampling docs

* update sample docs

* update

* update

* add bad words desc

* update
2025-07-30 02:24:16 +08:00
zhuzixuan
ad7bb52a28 修复传入max_tokens=1时的报错 (#3068)
* 修复传入max_tokens=1时的报错

* 修复传入max_tokens=1时的报错

* 修复传入max_tokens=1时的报错

* 修复传入max_tokens=1时的报错

* 修复传入max_tokens=1时的报错

* 修复传入max_tokens=1时的报错
2025-07-29 23:49:28 +08:00
Ryan
73cfe1fd37 [SOT] Extend SOT warmup support to new hardware (#3032)
* add new hardware

* add_sot_warmup4new_hardware

* fix conflict

* rm Optional
2025-07-29 22:45:20 +08:00
Zero Rains
b2f9a42d87 [Feature] Support repetition early stop (#3024)
* support repetition early stop and support user to set the parameter

* remove log

* fix codestyle

* add the early_stop_config to rollout_config

* update config and EarlyStopper class

* fix the bug for triton

* modify the stop method

* update description

* modify the usage for stop_flags

---------

Co-authored-by: Yuanle Liu <yuanlehome@163.com>
2025-07-29 22:42:54 +08:00
Yuan Xiaolan
3214fb5393 support model loading for w4a8 offline quant (#3064)
支持W4A8 EP 对离线量化权重的load
2025-07-29 21:54:37 +08:00
Longzhi Wang
be0a0f2bb2 fix arguement error in ep when pd (#3060) 2025-07-29 17:17:24 +08:00
YuanRisheng
502ee92a0a Unify server-side and model-side Config (Part3) (#3047)
* merge model config

* fix arch

* fix rl
2025-07-29 17:07:44 +08:00
Longzhi Wang
907d561523 fix ep when paddle version mismatch (#3056) 2025-07-29 15:06:49 +08:00
JYChen
dafe02a7b9 [stop sequence] support stop sequence (#3025)
* stop seqs in multi-ends

* unittest for gpu stop op

* kernel tid==0
2025-07-29 14:17:37 +08:00
YuanRisheng
1a815b7a2a Fix Speculative Config bug (#3049)
* fix speculative bug

* fix rl
2025-07-29 10:50:48 +08:00
yinwei
f2a528f9ae [XPU] Support kvblock centralized management (#3017) 2025-07-29 10:40:55 +08:00
Jiang-Jia-Jun
286802a070 Update ernie-4.5.md 2025-07-29 10:10:09 +08:00
Yuan Xiaolan
7d87aaace8 optimize w4a8 decoding (#3050) 2025-07-28 22:20:13 +08:00
lizhenyun01
e80ea8a71b remove Synchronize in hadamard 2025-07-28 19:22:46 +08:00
Yuan Xiaolan
b1d787a272 [fix] w4a8 model loading and hadamard config (#3013) 2025-07-28 18:17:59 +08:00
YUNSHEN XIE
c8bf8b3913 add logprob ci test (#3022)
* add logprob ci test
2025-07-28 17:30:58 +08:00
K11OntheBoat
83048bbe55 [Feature] Deepseekv3 supports cudagraph (#3041)
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
2025-07-28 17:12:54 +08:00
AIbin
ec52d39e68 【Inference Optimize】Update wint2 weight n-dim reorder (#3042) 2025-07-28 16:31:56 +08:00
YuanRisheng
bddf403576 Unify server-side and model-side Config (Part2) (#3035)
* merge speculative and graph opt conifg

* add attr
2025-07-28 15:31:48 +08:00
yinwei
776fb03250 add error info (#3040) 2025-07-28 15:10:28 +08:00
YUNSHEN XIE
60311956e4 fix(ci): correct diff coverage data download URL (#3036) 2025-07-28 14:44:02 +08:00
lizhenyun01
238766e403 fix c4 prompt_cache 2025-07-28 14:31:37 +08:00
chen
01485cd28b MTP rejection_topp add topk input (#3031) 2025-07-28 13:58:45 +08:00
begin2023
dd877f38b1 [Perf] Remove unnecessary operations in non-cuda_graph (#3010)
* [Perf] Remove unnecessary operations in non-cuda_graph

* fix code logic

* use suggestion comment

* reduce function call

* reduce function call

* reduce function call

* reduce function call
2025-07-27 20:38:29 -07:00
Longzhi Wang
247010d298 fix arguement error (#3030) 2025-07-28 11:03:29 +08:00
YuanRisheng
6ccc10ad47 Unify server-side and model-side Config (Part1) (#3018)
* move cache config

* fix mtp
2025-07-28 10:51:52 +08:00
Yiqun Liu
8f426c1690 Optimize the performance of moe_expert_ffn_wint2 (#2990)
* Change wint2 to ColumnMajor.

Change-Id: I6b44d02946a685f8fe24d9f2c7be258b51e16da2

* Unify default_wint2x_mma.

Change-Id: I9e77b0e8e6cecab01fedc0b24b536ee0a1a89ff7

* Change wint2 to ColumnMajorTileInterleave.

Change-Id: I593cbe36f991c0c5044989d65f0014087587c624

* Enable async copy for B.

Change-Id: Ia3ac37ad162a8cf3ccce4f268e81bd06c8ac3c46

* Add wint2x Dequantizer

* Remove TileDequanterB related codes.

Change-Id: Id8e65703b72a8984d367f584ff41b7726017fbb8

* Implement FastInterleavedAndBiasedNumericArrayConverter for wint2.

Change-Id: I438f2b18ab964a04ae1cdb09d9e7d9f7b95eafca

* Implement Wint2ParamsAccessor to load extra quant params from global memory.

Change-Id: Ic3750cd9b767df8893501820880c3342a4b47233

* Implement FastInterleavedAndBiasedNumericArrayConverter for wint2.

Change-Id: I438f2b18ab964a04ae1cdb09d9e7d9f7b95eafca

* Use async copy for local_scale.

Change-Id: Ib882ba41c3d2354bda4d25b40e2408ad3b2f7893

* Check and correct the load and dequantize of weights.

Change-Id: Ie8dca505b39987144964fe6407d465b3b5953790

* Change for performance tuning.

Change-Id: I1da026fb1d1533a9d70350c7ba23c27e896cfc29

* Optimize the global memory access size of local_scale reading.

Change-Id: I4cbe3a2ef5951723d415c2d3252ce912394beaf5

* Specialize mma_tensor_op for wint2 to enable fine-grained pipeline.

Change-Id: Icbb4d48f90a41136f42d6ffff42d68de32f408da

* Minor fix.

Change-Id: I14d4ac9d267ee05442a3b47f00c26bee13d79e6f

* optimizing dequant performance with LOP3

* optimizing dequant performance with LOP3

* Avoid redundant dequantization of local_scale and use bf16 as computing type.

Change-Id: I63239ebc8f8e4a92d6281af59840ba50600b4334

* Add Multiplier and remove some logs.

Change-Id: Ifa199d81e6aeb472d2247c63f85ef30213684bcd

* optimizing dequant performance with LOP3

* Use __byte_perm to implement int8 to float32 conversion for performance improvement.

* Use lop3 to optimize the dequantize of local_scale.

Change-Id: I6189759970cb5b8dcbef769724784b8a7533b63c

* Minor fix and remove some logs.

Change-Id: I6279ba9926d5041093b1c6aea200acf2e4c49d46

* Fix stages for test.

Change-Id: I6f7b7cac612ef2c678e9d49f5ffa60eb53d3ae29

* Fix stages for test and add clock64 to profile.

Change-Id: Iffaf7324beaa910ce9ee56f47ae289de98f1a267

* Use __byte_perm to replace shift-and-or operations for faster integer merging.

* Split the uint2b convert.

Change-Id: I78da672ce8968e21f685285140ba546a161521b4

* Optimize convert of unscale.

Change-Id: I6795da1cdf5e8ab38ddaa9836240921b5312913a

* Minor optimization.

Change-Id: I1800aec34c3f4621abb02658208108f54da44d88

* Optimize mma pipeline and refine codes.

Change-Id: Id3075cf7b88f2813a11ccd1d3b49c62c978f36b8

* Add missing support.

Change-Id: Id65b7bc2c25fbb1a5b232c6bc9fb8c9093f691a8

* Accelerate FP16 dequantization performance

* Support tile shape as Xx64x64.

Change-Id: Ib8fd37e1ba1d06f7d11f2956e7f1367b0a92bcac

* Remove debugging codes and minor optimization.

Change-Id: I6b79bd56a6e8dd823efc169967ecd3cc9a43baf4

* Fix offset bug.

Change-Id: Id7aeb91e99d6f51836f2aff22187b4f79607395e

* Fix typo.

Change-Id: I19dde93fc1c1f7e19605905c90dc46298e203952

* Restore some codes and remove some debugging logs.

Change-Id: I8d44daf82ad1c6f8174134d195e7b3fe9a3afdfb

---------

Co-authored-by: baoqiwen <baoqiwen@baidu.com>
2025-07-28 10:32:43 +08:00
YUNSHEN XIE
fb410b5f4c Add unit test run and coverage report generation (#3011)
* Add unit test run and coverage report generation

* fix

* fix: upload coverage report failure

* fix

* update

* fix

* fix

* update
2025-07-27 22:48:34 +08:00
YUNSHEN XIE
1d29dd80f7 modified dockerfile (#3026)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-07-25 21:10:23 +08:00
李泳桦
69996a40da [feat] add disable_chat_template in chat api as a substitute for previous raw_request (#3020)
* [feat] add disable_chat_template in chat api as a substitute for previous raw_request

* [fix] pre-commit code check
2025-07-25 20:57:32 +08:00
Longzhi Wang
0700c90caa [Feat] support mixed ep (#2969)
Some checks failed
Deploy GitHub Pages / deploy (push) Has been cancelled
* Support mixed ep

* fix comment

* fix comment

* update mixep

* fix conflict

* fix typo

* update

* fix typo

* fix code style

* fix conflict
2025-07-25 15:29:30 +08:00
chen
332154f504 [feature] Support FA2 (#3009) 2025-07-25 14:09:00 +08:00
YuBaoku
4b02b96467 [CI] fix codestyle_check (#3015) 2025-07-25 14:02:34 +08:00
EnflameGCU
8c167e130c [GCU] Update post_process (#3012) 2025-07-25 11:03:03 +08:00
EnflameGCU
7634ffb709 [GCU] Add CI (#3006) 2025-07-25 10:59:29 +08:00
Jiang-Jia-Jun
6ce3a8a497 Update index.md 2025-07-25 10:32:47 +08:00
249 changed files with 16017 additions and 3395 deletions

View File

@@ -2,7 +2,9 @@ name: Codestyle-Check
on:
pull_request:
branches: ["develop"]
branches:
- develop
- 'release/*'
jobs:
pre-commit:
@@ -11,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
env:
PR_ID: ${{ github.event.pull_request.number }}
BRANCH: develop
BRANCH: ${{ github.event.pull_request.base.ref }}
steps:
- name: Cleanup

View File

@@ -44,7 +44,7 @@ on:
value: ${{ jobs.fd-build.outputs.wheel_path }}
jobs:
fd-build:
runs-on: [self-hosted, GPU-h1z1-4Cards]
runs-on: [self-hosted, GPU-Build]
outputs:
wheel_path: ${{ steps.set_output.outputs.wheel_path }}
steps:
@@ -88,10 +88,10 @@ jobs:
run: |
set -x
runner_name="${{ runner.name }}"
CARD_ID=$(echo "${runner_name}" | cut -d'-' -f2)
CARD_ID=$(echo "${runner_name}" | awk -F'-' '{print $NF}')
gpu_id=$(echo "$CARD_ID" | fold -w1 | paste -sd,)
CACHE_DIR=${CACHE_DIR:-${{ github.workspace }}}
CACHE_DIR="${CACHE_DIR:-$(dirname "$(dirname "${{ github.workspace }}")")}"
echo "CACHE_DIR is set to ${CACHE_DIR}"
if [ ! -f "${CACHE_DIR}/gitconfig" ]; then
touch "${CACHE_DIR}/gitconfig"
@@ -103,6 +103,7 @@ jobs:
-v $(pwd):/workspace -w /workspace \
-v "${CACHE_DIR}/gitconfig:/etc/gitconfig:ro" \
-v "${CACHE_DIR}/.cache:/root/.cache" \
-v "${CACHE_DIR}/.ccache:/root/.ccache" \
-v "${CACHE_DIR}/ConfigDir:/root/.config" \
-e TZ="Asia/Shanghai" \
-e "COMPILE_ARCH=${compile_arch}" \
@@ -123,14 +124,12 @@ jobs:
echo "Date Only: $DATE_ONLY"
export FASTDEPLOY_VERSION="${FASTDEPLOY_VERSION}.dev${DATE_ONLY}"
fi
pip config set global.index-url http://pip.baidu.com/root/baidu/+simple/
pip config set install.trusted-host pip.baidu.com
pip config set global.extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install wheel
python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu126/
# 编译RDMA
export ENABLE_FD_RDMA=1
bash build.sh 1 python false [${COMPILE_ARCH}]

View File

@@ -0,0 +1,177 @@
name: Run FastDeploy LogProb Tests
description: "Run FastDeploy LogProb Tests"
on:
workflow_call:
inputs:
DOCKER_IMAGE:
description: "Build Images"
required: true
type: string
default: "ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleqa:cuda126-py310"
PADDLETEST_ARCHIVE_URL:
description: "URL of the compressed FastDeploy code archive."
required: true
type: string
default: "https://xly-devops.bj.bcebos.com/PaddleTest/PaddleTest.tar.gz"
FASTDEPLOY_WHEEL_URL:
description: "URL of the FastDeploy Wheel."
required: true
type: string
CACHE_DIR:
description: "Cache Dir Use"
required: false
type: string
default: ""
MODEL_CACHE_DIR:
description: "Cache Dir Use"
required: false
type: string
default: ""
jobs:
run_tests_logprob:
runs-on: [self-hosted, GPU-h20-1Cards]
steps:
- name: Code Prepare
shell: bash
env:
docker_image: ${{ inputs.DOCKER_IMAGE }}
paddletest_archive_url: ${{ inputs.PADDLETEST_ARCHIVE_URL }}
run: |
# Clean the repository directory before starting
docker run --rm --net=host -v $(pwd):/workspace -w /workspace \
-e "REPO_NAME=${REPO_NAME}" \
-e "BASE_BRANCH=${BASE_BRANCH}" \
${docker_image} /bin/bash -c '
rm -rf /workspace/*
'
wget -q ${paddletest_archive_url}
tar -xf PaddleTest.tar.gz
rm -rf PaddleTest.tar.gz
cd PaddleTest
git config --global user.name "FastDeployCI"
git config --global user.email "fastdeploy_ci@example.com"
git log -n 3 --oneline
- name: logprob test
shell: bash
env:
docker_image: ${{ inputs.DOCKER_IMAGE }}
fastdeploy_wheel_url: ${{ inputs.FASTDEPLOY_WHEEL_URL }}
CACHE_DIR: ${{ inputs.CACHE_DIR }}
MODEL_CACHE_DIR: ${{ inputs.MODEL_CACHE_DIR }}
run: |
runner_name="${{ runner.name }}"
CARD_ID=$(echo "${runner_name}" | awk -F'-' '{print $NF}')
DEVICES=$(echo "$CARD_ID" | fold -w1 | paste -sd,)
DEVICE_PORT=$(echo "$DEVICES" | cut -d',' -f1)
FLASK_PORT=$((42068 + DEVICE_PORT * 100))
FD_API_PORT=$((42088 + DEVICE_PORT * 100))
FD_ENGINE_QUEUE_PORT=$((42058 + DEVICE_PORT * 100))
FD_METRICS_PORT=$((42078 + DEVICE_PORT * 100))
echo "Test ENV Parameter:"
echo "========================================================="
echo "FLASK_PORT=${FLASK_PORT}"
echo "FD_API_PORT=${FD_API_PORT}"
echo "FD_ENGINE_QUEUE_PORT=${FD_ENGINE_QUEUE_PORT}"
echo "FD_METRICS_PORT=${FD_METRICS_PORT}"
echo "DEVICES=${DEVICES}"
echo "========================================================="
CACHE_DIR="${CACHE_DIR:-$(dirname "$(dirname "${{ github.workspace }}")")}"
echo "CACHE_DIR is set to ${CACHE_DIR}"
if [ ! -f "${CACHE_DIR}/gitconfig" ]; then
touch "${CACHE_DIR}/gitconfig"
fi
if [ ! -d "${MODEL_CACHE_DIR}" ]; then
echo "Error: MODEL_CACHE_DIR '${MODEL_CACHE_DIR}' does not exist."
exit 1
fi
PORTS=($FLASK_PORT $FD_API_PORT $FD_ENGINE_QUEUE_PORT $FD_METRICS_PORT)
LOG_FILE="./port_cleanup_$(date +%Y%m%d_%H%M%S).log"
echo "==== LOG_FILE is ${LOG_FILE} ===="
echo "==== PORT CLEAN BEFORE TASK RUN ====" | tee -a $LOG_FILE
for port in "${PORTS[@]}"; do
PIDS=$(lsof -t -i :$port || true)
if [ -n "$PIDS" ]; then
echo "Port $port is occupied by PID(s): $PIDS" | tee -a $LOG_FILE
echo "$PIDS" | xargs -r kill -9
echo "Port $port cleared" | tee -a $LOG_FILE
else
echo "Port $port is free" | tee -a $LOG_FILE
fi
done
echo "==== PORT CLEAN COMPLETE ====" | tee -a $LOG_FILE
echo "========================================================="
echo "Ensuring no stale container named ${runner_name} ..."
if [ "$(docker ps -a -q -f name=${runner_name})" ]; then
echo "Removing stale container: ${runner_name}"
docker rm -f ${runner_name} || true
fi
docker run --rm --ipc=host --pid=host --net=host \
--name ${runner_name} \
-v $(pwd):/workspace \
-w /workspace \
-e fastdeploy_wheel_url=${fastdeploy_wheel_url} \
-e "FD_API_PORT=${FD_API_PORT}" \
-e "FD_ENGINE_QUEUE_PORT=${FD_ENGINE_QUEUE_PORT}" \
-e "FD_METRICS_PORT=${FD_METRICS_PORT}" \
-e "FLASK_PORT=${FLASK_PORT}" \
-v "${MODEL_CACHE_DIR}:/MODELDATA" \
-v "${CACHE_DIR}/gitconfig:/etc/gitconfig:ro" \
-v "${CACHE_DIR}/.cache:/root/.cache" \
-v "${CACHE_DIR}/ConfigDir:/root/.config" \
-e TZ="Asia/Shanghai" \
--gpus '"device='"${DEVICES}"'"' ${docker_image} /bin/bash -xc '
python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
python -m pip install ${fastdeploy_wheel_url}
wget https://paddle-qa.bj.bcebos.com/zhengtianyu/tools/llm-deploy-linux-amd64
chmod +x ./llm-deploy-linux-amd64
./llm-deploy-linux-amd64 -python python3.10 \
-model_name ERNIE-4.5-0.3B-Paddle \
-model_path /MODELDATA \
--skip install
cd PaddleTest/framework/ServeTest
python3.10 deploy.py > dd.log 2>&1 &
sleep 3
curl -X POST http://0.0.0.0:${FLASK_PORT}/start \
-H "Content-Type: application/json" \
-d "{\"--model\": \"/MODELDATA/ERNIE-4.5-0.3B-Paddle\"}"
curl -X POST http://localhost:${FLASK_PORT}/wait_for_infer?timeout=90
set +e
rm -rf ./baseline_output
cp -r baseline/ERNIE-4.5-0.3B-Paddle ./baseline_output
LOGPROB_EXIT_CODE=0
python3.10 lanucher.py --request_template TOKEN_LOGPROB --url http://localhost:${FD_API_PORT}/v1/chat/completions --case ./cases/demo.yaml --concurrency 1 --name demo --exe logprob || LOGPROB_EXIT_CODE=$?
echo "LOGPROB_EXIT_CODE=${LOGPROB_EXIT_CODE}" > /workspace/exit_code.env
curl -X POST http://localhost:${FLASK_PORT}/stop
sleep 10s
cat *result.log
exit 0
'
if [ $? -ne 0 ];then
exit 1
fi
if [ -f exit_code.env ]; then
cat exit_code.env >> $GITHUB_ENV
fi
- name: logprob test result
if: ${{ env.LOGPROB_EXIT_CODE != 0 }}
shell: bash
run: |
echo "logprob test failed with exit code ${{ env.LOGPROB_EXIT_CODE }}"
exit 8

148
.github/workflows/_pre_ce_test.yml vendored Normal file
View File

@@ -0,0 +1,148 @@
name: Pre-CE-Test
on:
workflow_call:
inputs:
DOCKER_IMAGE:
description: "Build Images"
required: true
type: string
default: "ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:fastdeploy-ciuse-cuda126"
FASTDEPLOY_ARCHIVE_URL:
description: "URL of the compressed FastDeploy code archive."
required: true
type: string
FASTDEPLOY_WHEEL_URL:
description: "URL of the FastDeploy Wheel."
required: true
type: string
CACHE_DIR:
description: "Cache Dir Use"
required: false
type: string
default: ""
MODEL_CACHE_DIR:
description: "Cache Dir Use"
required: false
type: string
default: ""
concurrency:
group: ${{ github.event.pull_request.number }}
cancel-in-progress: true
jobs:
run_ce_cases:
runs-on: [self-hosted, PRE_CE_RUN_2Card]
steps:
- name: Print current runner name
run: |
echo "Current runner name: ${{ runner.name }}"
- name: Code Prepare
shell: bash
env:
docker_image: ${{ inputs.DOCKER_IMAGE }}
fd_archive_url: ${{ inputs.FASTDEPLOY_ARCHIVE_URL }}
run: |
set -x
REPO="https://github.com/${{ github.repository }}.git"
FULL_REPO="${{ github.repository }}"
REPO_NAME="${FULL_REPO##*/}"
BASE_BRANCH="${{ github.base_ref }}"
# Clean the repository directory before starting
docker run --rm --net=host -v $(pwd):/workspace -w /workspace \
-e "REPO_NAME=${REPO_NAME}" \
${docker_image} /bin/bash -c '
if [ -d ${REPO_NAME} ]; then
echo "Directory ${REPO_NAME} exists, removing it..."
rm -rf ${REPO_NAME}*
fi
'
wget -q ${fd_archive_url}
tar -xf FastDeploy.tar.gz
rm -rf FastDeploy.tar.gz
cd FastDeploy
git config --global user.name "FastDeployCI"
git config --global user.email "fastdeploy_ci@example.com"
git log -n 3 --oneline
- name: Run CI unittest
env:
docker_image: ${{ inputs.DOCKER_IMAGE }}
fd_wheel_url: ${{ inputs.FASTDEPLOY_WHEEL_URL }}
CACHE_DIR: ${{ inputs.CACHE_DIR }}
MODEL_CACHE_DIR: ${{ inputs.MODEL_CACHE_DIR }}
run: |
runner_name="${{ runner.name }}"
CARD_ID=$(echo "${runner_name}" | awk -F'-' '{print $NF}')
DEVICES=$(echo "$CARD_ID" | fold -w1 | paste -sd,)
DEVICE_PORT=$(echo "$DEVICES" | cut -d',' -f1)
FLASK_PORT=$((42068 + DEVICE_PORT * 100))
FD_API_PORT=$((42088 + DEVICE_PORT * 100))
FD_ENGINE_QUEUE_PORT=$((42058 + DEVICE_PORT * 100))
FD_METRICS_PORT=$((42078 + DEVICE_PORT * 100))
echo "Test ENV Parameter:"
echo "========================================================="
echo "FLASK_PORT=${FLASK_PORT}"
echo "FD_API_PORT=${FD_API_PORT}"
echo "FD_ENGINE_QUEUE_PORT=${FD_ENGINE_QUEUE_PORT}"
echo "FD_METRICS_PORT=${FD_METRICS_PORT}"
echo "DEVICES=${DEVICES}"
echo "========================================================="
CACHE_DIR="${CACHE_DIR:-$(dirname "$(dirname "${{ github.workspace }}")")}"
echo "CACHE_DIR is set to ${CACHE_DIR}"
if [ ! -f "${CACHE_DIR}/gitconfig" ]; then
touch "${CACHE_DIR}/gitconfig"
fi
PORTS=($FLASK_PORT $FD_API_PORT $FD_ENGINE_QUEUE_PORT $FD_METRICS_PORT)
LOG_FILE="./port_cleanup_$(date +%Y%m%d_%H%M%S).log"
echo "==== LOG_FILE is ${LOG_FILE} ===="
echo "==== PORT CLEAN BEFORE TASK RUN ====" | tee -a $LOG_FILE
for port in "${PORTS[@]}"; do
PIDS=$(lsof -t -i :$port || true)
if [ -n "$PIDS" ]; then
echo "Port $port is occupied by PID(s): $PIDS" | tee -a $LOG_FILE
echo "$PIDS" | xargs -r kill -9
echo "Port $port cleared" | tee -a $LOG_FILE
else
echo "Port $port is free" | tee -a $LOG_FILE
fi
done
echo "==== PORT CLEAN COMPLETE ====" | tee -a $LOG_FILE
echo "========================================================="
echo "Ensuring no stale container named ${runner_name} ..."
if [ "$(docker ps -a -q -f name=${runner_name})" ]; then
echo "Removing stale container: ${runner_name}"
docker rm -f ${runner_name} || true
fi
docker run --rm --net=host \
--name ${runner_name} \
-v $(pwd):/workspace \
-w /workspace \
-v "${CACHE_DIR}/gitconfig:/etc/gitconfig:ro" \
-v "${CACHE_DIR}/.cache:/root/.cache" \
-v "${CACHE_DIR}/ConfigDir:/root/.config" \
-v "${MODEL_CACHE_DIR}:/ModelData:ro" \
-e "MODEL_PATH=/ModelData" \
-e "FD_API_PORT=${FD_API_PORT}" \
-e "FD_ENGINE_QUEUE_PORT=${FD_ENGINE_QUEUE_PORT}" \
-e "FD_METRICS_PORT=${FD_METRICS_PORT}" \
-e "FLASK_PORT=${FLASK_PORT}" \
-e "fd_wheel_url=${fd_wheel_url}" \
--gpus "\"device=${DEVICES}\"" ${docker_image} /bin/bash -c '
git config --global --add safe.directory /workspace/FastDeploy
cd FastDeploy
python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install ${fd_wheel_url}
bash scripts/run_pre_ce.sh
'

View File

@@ -0,0 +1,282 @@
name: Run FastDeploy Unit Tests and Coverage
description: "Run FastDeploy Unit Tests and Coverage"
on:
workflow_call:
inputs:
DOCKER_IMAGE:
description: "Build Images"
required: true
type: string
default: "ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleqa:cuda126-py310"
FASTDEPLOY_ARCHIVE_URL:
description: "URL of the compressed FastDeploy code archive."
required: true
type: string
FASTDEPLOY_WHEEL_URL:
description: "URL of the FastDeploy Wheel."
required: true
type: string
CACHE_DIR:
description: "Cache Dir Use"
required: false
type: string
default: ""
MODEL_CACHE_DIR:
description: "Cache Dir Use"
required: false
type: string
default: ""
jobs:
run_tests_with_coverage:
runs-on: [self-hosted, GPU-h1z1-2Cards]
outputs:
diff_cov_file_url: ${{ steps.cov_upload.outputs.diff_cov_file_url }}
unittest_failed_url: ${{ steps.cov_upload.outputs.unittest_failed_url }}
diff_cov_result_json_url: ${{ steps.cov_upload.outputs.diff_cov_result_json_url }}
steps:
- name: Code Prepare
shell: bash
env:
docker_image: ${{ inputs.DOCKER_IMAGE }}
fd_archive_url: ${{ inputs.FASTDEPLOY_ARCHIVE_URL }}
run: |
set -x
REPO="https://github.com/${{ github.repository }}.git"
FULL_REPO="${{ github.repository }}"
REPO_NAME="${FULL_REPO##*/}"
BASE_BRANCH="${{ github.base_ref }}"
# Clean the repository directory before starting
docker run --rm --net=host -v $(pwd):/workspace -w /workspace \
-e "REPO_NAME=${REPO_NAME}" \
${docker_image} /bin/bash -c '
if [ -d ${REPO_NAME} ]; then
echo "Directory ${REPO_NAME} exists, removing it..."
rm -rf ${REPO_NAME}*
fi
'
wget -q ${fd_archive_url}
tar -xf FastDeploy.tar.gz
rm -rf FastDeploy.tar.gz
cd FastDeploy
git config --global user.name "FastDeployCI"
git config --global user.email "fastdeploy_ci@example.com"
git log -n 3 --oneline
- name: Run FastDeploy Unit Tests and Coverage
shell: bash
env:
docker_image: ${{ inputs.DOCKER_IMAGE }}
fd_wheel_url: ${{ inputs.FASTDEPLOY_WHEEL_URL }}
CACHE_DIR: ${{ inputs.CACHE_DIR }}
BASE_REF: ${{ github.event.pull_request.base.ref }}
MODEL_CACHE_DIR: ${{ inputs.MODEL_CACHE_DIR }}
run: |
set -x
runner_name="${{ runner.name }}"
CARD_ID=$(echo "${runner_name}" | awk -F'-' '{print $NF}')
DEVICES=$(echo "$CARD_ID" | fold -w1 | paste -sd,)
DEVICE_PORT=$(echo "$DEVICES" | cut -d',' -f1)
FLASK_PORT=$((42068 + DEVICE_PORT * 100))
FD_API_PORT=$((42088 + DEVICE_PORT * 100))
FD_ENGINE_QUEUE_PORT=$((42058 + DEVICE_PORT * 100))
FD_METRICS_PORT=$((42078 + DEVICE_PORT * 100))
echo "Test ENV Parameter:"
echo "========================================================="
echo "FLASK_PORT=${FLASK_PORT}"
echo "FD_API_PORT=${FD_API_PORT}"
echo "FD_ENGINE_QUEUE_PORT=${FD_ENGINE_QUEUE_PORT}"
echo "FD_METRICS_PORT=${FD_METRICS_PORT}"
echo "DEVICES=${DEVICES}"
echo "========================================================="
CACHE_DIR="${CACHE_DIR:-$(dirname "$(dirname "${{ github.workspace }}")")}"
echo "CACHE_DIR is set to ${CACHE_DIR}"
if [ ! -f "${CACHE_DIR}/gitconfig" ]; then
touch "${CACHE_DIR}/gitconfig"
fi
PORTS=($FLASK_PORT $FD_API_PORT $FD_ENGINE_QUEUE_PORT $FD_METRICS_PORT)
LOG_FILE="./port_cleanup_$(date +%Y%m%d_%H%M%S).log"
echo "==== LOG_FILE is ${LOG_FILE} ===="
echo "==== PORT CLEAN BEFORE TASK RUN ====" | tee -a $LOG_FILE
for port in "${PORTS[@]}"; do
PIDS=$(lsof -t -i :$port || true)
if [ -n "$PIDS" ]; then
echo "Port $port is occupied by PID(s): $PIDS" | tee -a $LOG_FILE
echo "$PIDS" | xargs -r kill -9
echo "Port $port cleared" | tee -a $LOG_FILE
else
echo "Port $port is free" | tee -a $LOG_FILE
fi
done
echo "==== PORT CLEAN COMPLETE ====" | tee -a $LOG_FILE
echo "========================================================="
echo "Ensuring no stale container named ${runner_name} ..."
if [ "$(docker ps -a -q -f name=${runner_name})" ]; then
echo "Removing stale container: ${runner_name}"
docker rm -f ${runner_name} || true
fi
docker run --rm --net=host \
--name ${runner_name} \
--cap-add=SYS_PTRACE --shm-size=64G \
-v $(pwd):/workspace -w /workspace \
-v "${CACHE_DIR}/gitconfig:/etc/gitconfig:ro" \
-v "${CACHE_DIR}/.cache:/root/.cache" \
-v "${CACHE_DIR}/ConfigDir:/root/.config" \
-v "${MODEL_CACHE_DIR}:/ModelData:ro" \
-e "MODEL_PATH=/ModelData" \
-e "FD_API_PORT=${FD_API_PORT}" \
-e "FD_ENGINE_QUEUE_PORT=${FD_ENGINE_QUEUE_PORT}" \
-e "FD_METRICS_PORT=${FD_METRICS_PORT}" \
-e "FLASK_PORT=${FLASK_PORT}" \
-e TZ="Asia/Shanghai" \
-e "fd_wheel_url=${fd_wheel_url}" \
-e "BASE_REF=${BASE_REF}" \
--gpus "\"device=${DEVICES}\"" ${docker_image} /bin/bash -c '
git config --global --add safe.directory /workspace/FastDeploy
cd FastDeploy
python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
pip config set global.extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
python -m pip install coverage
python -m pip install diff-cover
python -m pip install ${fd_wheel_url}
if [ -d "test/plugins" ]; then
cd test/plugins
python setup.py install
cd ../..
else
echo "Warning: test/plugins directory not found, skipping setup.py install"
fi
export COVERAGE_FILE=/workspace/FastDeploy/coveragedata/.coverage
export COVERAGE_RCFILE=/workspace/FastDeploy/scripts/.coveragerc
TEST_EXIT_CODE=0
bash scripts/coverage_run.sh || TEST_EXIT_CODE=8
git diff origin/${BASE_REF}..HEAD --unified=0 > diff.txt
echo "TEST_EXIT_CODE=${TEST_EXIT_CODE}" >> exit_code.env
coverage combine coveragedata/
coverage xml -o python_coverage_all.xml
COVERAGE_EXIT_CODE=0
diff-cover python_coverage_all.xml --diff-file=diff.txt --fail-under=80 --json-report diff_coverage.json || COVERAGE_EXIT_CODE=9
echo "COVERAGE_EXIT_CODE=${COVERAGE_EXIT_CODE}" >> exit_code.env
python scripts/generate_diff_coverage_xml.py diff.txt python_coverage_all.xml
'
if [ -f FastDeploy/exit_code.env ]; then
cat FastDeploy/exit_code.env >> $GITHUB_ENV
fi
- name: Upload unit resule and diff coverage to bos
id: cov_upload
shell: bash
run: |
cd FastDeploy
commit_id=${{ github.event.pull_request.head.sha }}
pr_num=${{ github.event.pull_request.number }}
target_path=paddle-github-action/PR/FastDeploy/${pr_num}/${commit_id}/SM${compile_arch//,/_}
wget -q --no-proxy --no-check-certificate https://paddle-qa.bj.bcebos.com/CodeSync/develop/PaddlePaddle/PaddleTest/tools/bos_tools.py
push_file=$(realpath bos_tools.py)
python -m pip install bce-python-sdk==0.9.29
diff_cov_file="diff_coverage.xml"
if [ -f ${diff_cov_file} ];then
python ${push_file} ${diff_cov_file} ${target_path}/CoverageData
target_path_stripped="${target_path#paddle-github-action/}"
DIFF_COV_FILE_URL=https://paddle-github-action.bj.bcebos.com/${target_path_stripped}/CoverageData/${diff_cov_file}
echo "diff_cov_file_url=${DIFF_COV_FILE_URL}" >> $GITHUB_OUTPUT
echo "diff_cov_file_url=${DIFF_COV_FILE_URL}" >> $GITHUB_ENV
fi
diff_cov_result_json="diff_coverage.json"
if [ -f ${diff_cov_result_json} ];then
python ${push_file} ${diff_cov_result_json} ${target_path}/CoverageData
target_path_stripped="${target_path#paddle-github-action/}"
DIFF_COV_JSON_URL=https://paddle-github-action.bj.bcebos.com/${target_path_stripped}/CoverageData/${diff_cov_result_json}
echo "diff_cov_result_json_url=${DIFF_COV_JSON_URL}" >> $GITHUB_OUTPUT
echo "diff_cov_result_json_url=${DIFF_COV_JSON_URL}" >> $GITHUB_ENV
fi
unittest_result="test/failed_tests.log"
if [ -s ${unittest_result} ];then
python ${push_file} ${unittest_result} ${target_path}/UnitTestResult
target_path_stripped="${target_path#paddle-github-action/}"
UNIT_TEST_RESULT_URL=https://paddle-github-action.bj.bcebos.com/${target_path_stripped}/UnitTestResult/${unittest_result}
echo "unittest_failed_url=${UNIT_TEST_RESULT_URL}" >> $GITHUB_OUTPUT
echo "unittest_failed_url=${UNIT_TEST_RESULT_URL}" >> $GITHUB_ENV
fi
- name: Check Unit Test Success
shell: bash
run: |
cd FastDeploy
if [ "$TEST_EXIT_CODE" -eq 8 ]; then
filename=$(basename "$unittest_failed_url")
if [ -z "${unittest_failed_url}" ]; then
echo "No diff unit failed file URL provided."
else
rm -rf "${filename}"
wget -O ${filename} ${unittest_failed_url} || echo "Download unittest file failed, but continuing..."
fi
echo "Unit tests failed (exit code 8)"
if [ -f "${filename}" ];then
echo "Failed test cases:"
cat "${filename}"
fi
exit "$TEST_EXIT_CODE"
fi
echo "All tests passed"
- name: Verify Code Coverage Threshold (80%)
shell: bash
run: |
cd FastDeploy
if [ "$COVERAGE_EXIT_CODE" -eq 9 ]; then
echo "Coverage generation failed (exit code 9)"
filename=$(basename "$diff_cov_result_json_url")
if [ -z "${diff_cov_result_json_url}" ]; then
echo "No diff cov result file URL provided."
else
rm -rf "${filename}"
wget -O ${filename} ${diff_cov_result_json_url} || echo "Download cov json file failed, but continuing..."
fi
if [ -f "${filename}" ];then
echo "Failed test cases:"
if command -v jq >/dev/null 2>&1; then
jq . "${filename}"
else
cat "${filename}"
fi
fi
exit "$COVERAGE_EXIT_CODE"
fi
echo "coverage passed"
exit 0
diff_coverage_report:
needs: run_tests_with_coverage
if: always()
runs-on: ubuntu-latest
steps:
- name: coverage diff file download
shell: bash
env:
diff_cov_file_url: ${{ needs.run_tests_with_coverage.outputs.diff_cov_file_url }}
run: |
if [ -z "${diff_cov_file_url}" ]; then
echo "No diff coverage file URL provided."
exit 0
fi
wget "${diff_cov_file_url}" -O ./diff_coverage.xml || echo "Download cov file failed, but continuing..."
- name: Upload diff coverage report
if: ${{ needs.run_tests_with_coverage.outputs.diff_cov_file_url != null && needs.run_tests_with_coverage.outputs.diff_cov_file_url != '' }}
uses: codecov/codecov-action@v5
with:
files: ./diff_coverage.xml
name: python diff coverage
verbose: true

39
.github/workflows/approve.yml vendored Normal file
View File

@@ -0,0 +1,39 @@
name: Approval
on:
pull_request:
branches:
- develop
- 'release/*'
jobs:
Approval:
name: Approval
if: ${{ github.repository_owner == 'PaddlePaddle' }}
runs-on: ubuntu-latest
env:
PR_ID: ${{ github.event.pull_request.number }}
BRANCH: ${{ github.event.pull_request.base.ref }}
steps:
- name: Checkout base repo
uses: actions/checkout@v4
with:
ref: ${{ github.event.pull_request.base.ref }}
fetch-depth: 1000
- name: Merge PR to test branch
run: |
git fetch origin pull/${PR_ID}/merge
git checkout -b test FETCH_HEAD
git log -n 3 --oneline
git remote add upstream https://github.com/PaddlePaddle/FastDeploy.git
git fetch upstream $BRANCH
- name: Setup python3.10
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Run approval check script
run: |
bash scripts/check_approval.sh

View File

@@ -1,4 +1,4 @@
name: CI
name: CI_GCU
on:
pull_request:
@@ -8,23 +8,20 @@ on:
workflow_dispatch:
concurrency:
group: ${{ github.event.pull_request.number }}
group: ${{ github.event.pull_request.number }}-gcu-ci
cancel-in-progress: true
jobs:
build:
runs-on: [self-hosted, GPU-L20-4Card]
CI_GCU:
runs-on: [self-hosted, GCU-S60-8Card]
steps:
- name: Print current runner name
run: |
echo "Current runner name: ${{ runner.name }}"
# Because the system version is lower than 2.23, the checkout cannot be used.
# - name: Checkout code
# uses: actions/checkout@v4
- name: Code Checkout
env:
docker_image: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:fastdeploy-ciuse-cuda126
docker_image: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.5.102-ubuntu20-x86_64-gcc84
run: |
REPO="https://github.com/${{ github.repository }}.git"
FULL_REPO="${{ github.repository }}"
@@ -55,35 +52,38 @@ jobs:
- name: Run CI unittest
env:
docker_image: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:fastdeploy-ciuse-cuda126
docker_image: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-gcu:topsrider3.5.102-ubuntu20-x86_64-gcc84
run: |
runner_name="${{ runner.name }}"
last_char="${runner_name: -1}"
if [ "${last_char}" = "1" ]; then
gpu_id=2
DEVICES="2,3"
if [[ "$last_char" =~ [0-3] ]]; then
gcu_id="$last_char"
else
gpu_id=0
DEVICES="0,1"
gcu_id="0"
fi
FD_API_PORT=$((9180 + gpu_id * 100))
FD_ENGINE_QUEUE_PORT=$((9150 + gpu_id * 100))
FD_METRICS_PORT=$((9170 + gpu_id * 100))
FD_API_PORT=$((9180 + gcu_id * 100))
FD_ENGINE_QUEUE_PORT=$((9150 + gcu_id * 100))
FD_METRICS_PORT=$((9170 + gcu_id * 100))
PARENT_DIR=$(dirname "$WORKSPACE")
echo "PARENT_DIR:$PARENT_DIR"
docker run --rm --net=host -v $(pwd):/workspace -w /workspace \
-v "/ssd4/GithubActions/gitconfig:/etc/gitconfig:ro" \
-v "/ssd4/GithubActions/ModelData:/ModelData:ro" \
-v "/ssd4/GithubActions/CacheDir:/root/.cache" \
-v "/ssd4/GithubActions/ConfigDir:/root/.config" \
-e "MODEL_PATH=/ModelData" \
echo "Install drivers..."
cd /work/deps
bash TopsRider_i3x_*_deb_amd64.run --driver --no-auto-load -y
cd -
docker run --rm --network=host --ipc=host -it --privileged \
-v $(pwd):/workspace -w /workspace \
-v "/home:/home" \
-v "/work:/work" \
-e "MODEL_PATH=/work/models" \
-e "http_proxy=$(git config --global --get http.proxy)" \
-e "https_proxy=$(git config --global --get https.proxy)" \
-e "FD_API_PORT=${FD_API_PORT}" \
-e "FD_ENGINE_QUEUE_PORT=${FD_ENGINE_QUEUE_PORT}" \
-e "FD_METRICS_PORT=${FD_METRICS_PORT}" \
--gpus '"device='"${DEVICES}"'"' ${docker_image} /bin/bash -c "
${docker_image} /bin/bash -c "
git config --global --add safe.directory /workspace/FastDeploy
cd FastDeploy
bash scripts/run_ci.sh
bash scripts/run_ci_gcu.sh
"

View File

@@ -13,7 +13,7 @@ concurrency:
jobs:
CI_XPU:
runs-on: [self-hosted, XPU-P800-8Card]
runs-on: [self-hosted, XPU-P800-8Card-release]
steps:
- name: Print current runner name
run: |

View File

@@ -15,7 +15,7 @@ jobs:
- uses: actions/setup-python@v5
with:
python-version: 3.x
- run: pip install mkdocs-material mkdocs-get-deps mkdocs-material-extensions mkdocs-multilang
- run: pip install mkdocs-material mkdocs-get-deps mkdocs-material-extensions mkdocs-multilang mkdocs-static-i18n
- name: Deploy to GitHub Pages
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@@ -21,7 +21,7 @@ jobs:
with:
DOCKER_IMAGE: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleqa:cuda126-py310
FASTDEPLOY_ARCHIVE_URL: ${{ needs.clone.outputs.repo_archive_url }}
COMPILE_ARCH: "90"
COMPILE_ARCH: "89,90"
WITH_NIGHTLY_BUILD: "OFF"
FD_VERSION: "0.0.0"
@@ -33,3 +33,33 @@ jobs:
- name: Print wheel path
run: |
echo "The built wheel is located at: ${{ needs.build.outputs.wheel_path }}"
unittest_coverage:
name: Run FastDeploy Unit Tests and Coverage
needs: [clone,build]
uses: ./.github/workflows/_unit_test_coverage.yml
with:
DOCKER_IMAGE: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleqa:fastdeploy-ciuse-cuda126-dailyupdate
FASTDEPLOY_ARCHIVE_URL: ${{ needs.clone.outputs.repo_archive_url }}
FASTDEPLOY_WHEEL_URL: ${{ needs.build.outputs.wheel_path }}
MODEL_CACHE_DIR: "/ssd2/actions-runner/ModelData"
logprob_test:
name: Run FastDeploy LogProb Tests
needs: [build]
uses: ./.github/workflows/_logprob_test_linux.yml
with:
DOCKER_IMAGE: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleqa:fastdeploy-ciuse-cuda126-dailyupdate
PADDLETEST_ARCHIVE_URL: "https://xly-devops.bj.bcebos.com/PaddleTest/PaddleTest.tar.gz"
FASTDEPLOY_WHEEL_URL: ${{ needs.build.outputs.wheel_path }}
MODEL_CACHE_DIR: "/ssd2/actions-runner/ModelData"
pre_ce_test:
name: Extracted partial CE model tasks to run in CI.
needs: [clone,build]
uses: ./.github/workflows/_pre_ce_test.yml
with:
DOCKER_IMAGE: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleqa:fastdeploy-ciuse-cuda126-dailyupdate
FASTDEPLOY_ARCHIVE_URL: ${{ needs.clone.outputs.repo_archive_url }}
FASTDEPLOY_WHEEL_URL: ${{ needs.build.outputs.wheel_path }}
MODEL_CACHE_DIR: "/ssd2/actions-runner/ModelData"

View File

@@ -1,3 +1,4 @@
English | [简体中文](README_CN.md)
<p align="center">
<a href="https://github.com/PaddlePaddle/FastDeploy/releases"><img src="https://github.com/user-attachments/assets/42b0039f-39e3-4279-afda-6d1865dfbffb" width="500"></a>
</p>
@@ -22,11 +23,10 @@
</p>
--------------------------------------------------------------------------------
# FastDeploy 2.0: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
# FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
## News
**[2025-07] 《FastDeploy2.0推理部署实测》专题活动已上线!** 完成文心4.5系列开源模型的推理部署等任务即可获得骨瓷马克杯等FastDeploy2.0官方周边及丰富奖金!🎁 欢迎大家体验反馈~ 📌[报名地址](https://www.wjx.top/vm/meSsp3L.aspx#) 📌[活动详情](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)
**[2025-08] 🔥 Released FastDeploy v2.1:** A brand-new KV Cache scheduling strategy has been introduced, and expanded support for PD separation and CUDA Graph across more models. Enhanced hardware support has been added for platforms like Kunlun and Hygon, along with comprehensive optimizations to improve the performance of both the service and inference engine.
**[2025-07] The FastDeploy 2.0 Inference Deployment Challenge is now live!** Complete the inference deployment task for the ERNIE 4.5 series open-source models to win official FastDeploy 2.0 merch and generous prizes! 🎁 You're welcome to try it out and share your feedback! 📌[Sign up here](https://www.wjx.top/vm/meSsp3L.aspx#) 📌[Event details](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)
@@ -50,14 +50,15 @@
## Installation
FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**, **Iluvatar GPUs**, **Enflame GCUs**, and other hardware. For detailed installation instructions:
FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**, **Iluvatar GPUs**, **Enflame GCUs**, **Hygon DCUs** and other hardware. For detailed installation instructions:
- [NVIDIA GPU](./docs/get_started/installation/nvidia_gpu.md)
- [Kunlunxin XPU](./docs/get_started/installation/kunlunxin_xpu.md)
- [Iluvatar GPU](./docs/get_started/installation/iluvatar_gpu.md)
- [Enflame GCU](./docs/get_started/installation/Enflame_gcu.md)
- [Hygon DCU](./docs/get_started/installation/hygon_dcu.md)
**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU, Hygon DCU, and MetaX GPU are currently under development and testing. Stay tuned for updates!
**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU and MetaX GPU are currently under development and testing. Stay tuned for updates!
## Get Started
@@ -68,18 +69,19 @@ Learn how to use FastDeploy through our documentation:
- [Offline Inference Development](./docs/offline_inference.md)
- [Online Service Deployment](./docs/online_serving/README.md)
- [Full Supported Models List](./docs/supported_models.md)
- [Best Practices](./docs/best_practices/README.md)
## Supported Models
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅(WINT4)| WIP |128K |
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|✅(WINT4)| WIP | 128K |
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| ✅ |128K |
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| ✅ | 128K |
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K |
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K |
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅|128K |
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | | ✅ | ✅ | WIP | ✅|128K |
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | | ✅ | ✅ | ❌ | ✅| 128K |
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | | ✅|128K |
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | | ✅ | ✅ | | ✅|128K |
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | | ✅ | ✅ | ❌ | ✅| 128K |
## Advanced Usage

94
README_CN.md Normal file
View File

@@ -0,0 +1,94 @@
[English](README.md) | 简体中文
<p align="center">
<a href="https://github.com/PaddlePaddle/FastDeploy/releases"><img src="https://github.com/user-attachments/assets/42b0039f-39e3-4279-afda-6d1865dfbffb" width="500"></a>
</p>
<p align="center">
<a href=""><img src="https://img.shields.io/badge/python-3.10-aff.svg"></a>
<a href=""><img src="https://img.shields.io/badge/os-linux-pink.svg"></a>
<a href="https://github.com/PaddlePaddle/FastDeploy/graphs/contributors"><img src="https://img.shields.io/github/contributors/PaddlePaddle/FastDeploy?color=9ea"></a>
<a href="https://github.com/PaddlePaddle/FastDeploy/commits"><img src="https://img.shields.io/github/commit-activity/m/PaddlePaddle/FastDeploy?color=3af"></a>
<a href="https://github.com/PaddlePaddle/FastDeploy/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/FastDeploy?color=9cc"></a>
<a href="https://github.com/PaddlePaddle/FastDeploy/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/FastDeploy?color=ccf"></a>
</p>
<p align="center">
<a href="https://trendshift.io/repositories/4046" target="_blank"><img src="https://trendshift.io/api/badge/repositories/4046" alt="PaddlePaddle%2FFastDeploy | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a></br>
<a href="https://paddlepaddle.github.io/FastDeploy/zh/get_started/installation/nvidia_gpu/"><b> 安装指导 </b></a>
|
<a href="https://paddlepaddle.github.io/FastDeploy/zh/get_started/quick_start"><b> 快速入门 </b></a>
|
<a href="https://paddlepaddle.github.io/FastDeploy/zh/supported_models/"><b> 支持模型列表 </b></a>
</p>
--------------------------------------------------------------------------------
# FastDeploy :基于飞桨的大语言模型与视觉语言模型推理部署工具包
## 最新活动
**[2025-08] 🔥 FastDeploy v2.1 全新发布:** 全新的KV Cache调度策略更多模型支持PD分离和CUDA Graph昆仑、海光等更多硬件支持增强全方面优化服务和推理引擎的性能。
**[2025-07] 《FastDeploy2.0推理部署实测》专题活动已上线!** 完成文心4.5系列开源模型的推理部署等任务即可获得骨瓷马克杯等FastDeploy2.0官方周边及丰富奖金!🎁 欢迎大家体验反馈~ 📌[报名地址](https://www.wjx.top/vm/meSsp3L.aspx#) 📌[活动详情](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)
## 关于
**FastDeploy** 是基于飞桨PaddlePaddle的大语言模型LLM与视觉语言模型VLM推理部署工具包提供**开箱即用的生产级部署方案**,核心技术特性包括:
- 🚀 **负载均衡式PD分解**工业级解决方案支持上下文缓存与动态实例角色切换在保障SLO达标和吞吐量的同时优化资源利用率
- 🔄 **统一KV缓存传输**轻量级高性能传输库支持智能NVLink/RDMA选择
- 🤝 **OpenAI API服务与vLLM兼容**:单命令部署,兼容[vLLM](https://github.com/vllm-project/vllm/)接口
- 🧮 **全量化格式支持**W8A16、W8A8、W4A16、W4A8、W2A16、FP8等
-**高级加速技术**推测解码、多令牌预测MTP及分块预填充
- 🖥️ **多硬件支持**NVIDIA GPU、昆仑芯XPU、海光DCU、昇腾NPU、天数智芯GPU、燧原GCU、沐曦GPU等
## 要求
- 操作系统: Linux
- Python: 3.10 ~ 3.12
## 安装
FastDeploy 支持在**英伟达NVIDIAGPU**、**昆仑芯KunlunxinXPU**、**天数IluvatarGPU**、**燧原EnflameGCU**、**海光HygonDCU** 以及其他硬件上进行推理部署。详细安装说明如下:
- [英伟达 GPU](./docs/zh/get_started/installation/nvidia_gpu.md)
- [昆仑芯 XPU](./docs/zh/get_started/installation/kunlunxin_xpu.md)
- [天数 CoreX](./docs/zh/get_started/installation/iluvatar_gpu.md)
- [燧原 S60](./docs/zh/get_started/installation/Enflame_gcu.md)
- [海光 DCU](./docs/zh/get_started/installation/hygon_dcu.md)
**注意:** 我们正在积极拓展硬件支持范围。目前包括昇腾AscendNPU 和 沐曦MetaXGPU 在内的其他硬件平台正在开发测试中。敬请关注更新!
## 入门指南
通过我们的文档了解如何使用 FastDeploy
- [10分钟快速部署](./docs/zh/get_started/quick_start.md)
- [ERNIE-4.5 部署](./docs/zh/get_started/ernie-4.5.md)
- [ERNIE-4.5-VL 部署](./docs/zh/get_started/ernie-4.5-vl.md)
- [离线推理](./docs/zh/offline_inference.md)
- [在线服务](./docs/zh/online_serving/README.md)
- [模型支持列表](./docs/zh/supported_models.md)
- [最佳实践](./docs/zh/best_practices/README.md)
## 支持模型列表
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| ✅ |128K |
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| ✅ | 128K |
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K |
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K |
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ✅ | ✅|128K |
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ✅ | ✅ | ✅ | ❌ | ✅|128K |
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ✅ | ✅ | ✅ | ❌ | ✅| 128K |
## 进阶用法
- [量化](./docs/zh/quantization/README.md)
- [分离式部署](./docs/zh/features/disaggregated.md)
- [投机解码](./docs/zh/features/speculative_decoding.md)
- [前缀缓存](./docs/zh/features/prefix_caching.md)
- [分块预填充](./docs/zh/features/chunked_prefill.md)
## 致谢
FastDeploy 依据 [Apache-2.0 开源许可证](./LICENSE). 进行授权。在开发过程中,我们参考并借鉴了 [vLLM](https://github.com/vllm-project/vllm) 的部分代码,以保持接口兼容性,在此表示衷心感谢。

View File

@@ -361,8 +361,7 @@ async def benchmark(
if not test_output.success:
raise ValueError(
"Initial test run failed - Please make sure benchmark arguments "
f"are correctly specified. Error: {test_output.error}"
f"Initial test run failed - Please make sure that 1. benchmark arguments are correctly specified and 2. the http_proxy and https_proxy are turned off. Error: {test_output.error}"
)
else:
print("Initial test run completed. Starting main benchmark run...")

View File

@@ -1061,12 +1061,11 @@ void MultiQueryAppendAttention(
if (!is_decoder) {
chunk_size = static_cast<uint32_t>(encoder_max_partition_size);
}
const int num_chunks = div_up(max_dec_len, chunk_size);
const int num_chunks = div_up(max_seq_len, chunk_size);
dim3 grids(num_blocks_x_cpu, num_chunks, kv_num_heads);
dim3 blocks(32, num_warps);
if (num_chunks <= 1) {
if (num_chunks <= 0) {
auto nosplit_kv_kernel =
multi_query_append_attention_warp1_4_kernel<NV_TYPE,
false,
@@ -1161,8 +1160,8 @@ void MultiQueryAppendAttention(
reinterpret_cast<NV_TYPE *>(const_cast<T *>(cache_k.data<T>())),
reinterpret_cast<NV_TYPE *>(const_cast<T *>(cache_v.data<T>())),
shift_bias ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
smooth_weight ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(smooth_weight.get().data<T>()))
: nullptr,
@@ -1208,8 +1207,8 @@ void MultiQueryAppendAttention(
seq_lens_encoder.data<int>(),
cu_seqlens_q.data<int>(),
shift_bias ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
smooth_weight ? reinterpret_cast<NV_TYPE *>(const_cast<T *>(
smooth_weight.get().data<T>()))
: nullptr,
@@ -1226,14 +1225,14 @@ void MultiQueryAppendAttention(
constexpr int blockx = HEAD_DIM / vec_size;
constexpr int blocky = (128 + blockx - 1) / blockx;
dim3 grids_merge(min(sm_count * 4, token_num),
num_heads);
num_heads);
dim3 blocks_merge(blockx, blocky);
merge_multi_chunks_v2_kernel<NV_TYPE,
vec_size,
blocky,
HEAD_DIM,
OUT_NV_TYPE,
ENABLE_PREFILL>
vec_size,
blocky,
HEAD_DIM,
OUT_NV_TYPE,
ENABLE_PREFILL>
<<<grids_merge, blocks_merge, 0, stream>>>(
reinterpret_cast<NV_TYPE *>(tmp_workspace->ptr()),
static_cast<float *>(tmp_m->ptr()),
@@ -1244,8 +1243,8 @@ void MultiQueryAppendAttention(
batch_id_per_token.data<int>(),
cu_seqlens_q.data<int>(),
shift_bias ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
smooth_weight ? reinterpret_cast<NV_TYPE *>(const_cast<T *>(
smooth_weight.get().data<T>()))
: nullptr,

View File

@@ -1285,10 +1285,11 @@ void MultiQueryAppendC4Attention(
if (!is_decoder) {
chunk_size = static_cast<uint32_t>(encoder_max_partition_size);
}
const int num_chunks = div_up(max_dec_len, chunk_size);
const int num_chunks = div_up(max_seq_len, chunk_size);
dim3 grids(num_blocks_x_cpu, num_chunks, kv_num_heads);
dim3 blocks(32, num_warps);
if (num_chunks <= 1) {
if (num_chunks <= 0) {
auto nosplit_kv_kernel =
multi_query_append_attention_c4_warp1_4_kernel<NV_TYPE,
uint8_t,
@@ -1392,15 +1393,15 @@ void MultiQueryAppendC4Attention(
const_cast<uint8_t *>(cache_v.data<uint8_t>()),
reinterpret_cast<NV_TYPE *>(const_cast<T *>(cache_k_scale.data<T>())),
cache_k_zp ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(cache_k_zp.get().data<T>()))
: nullptr,
const_cast<T *>(cache_k_zp.get().data<T>()))
: nullptr,
reinterpret_cast<NV_TYPE *>(const_cast<T *>(cache_v_scale.data<T>())),
cache_v_zp ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(cache_v_zp.get().data<T>()))
: nullptr,
const_cast<T *>(cache_v_zp.get().data<T>()))
: nullptr,
shift_bias ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
smooth_weight ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(smooth_weight.get().data<T>()))
: nullptr,
@@ -1445,8 +1446,8 @@ void MultiQueryAppendC4Attention(
seq_lens_encoder.data<int>(),
cu_seqlens_q.data<int>(),
shift_bias ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
smooth_weight ? reinterpret_cast<NV_TYPE *>(const_cast<T *>(
smooth_weight.get().data<T>()))
: nullptr,
@@ -1463,14 +1464,14 @@ void MultiQueryAppendC4Attention(
constexpr int blockx = HEAD_DIM / vec_size;
constexpr int blocky = (128 + blockx - 1) / blockx;
dim3 grids_merge(min(sm_count * 4, token_num),
num_heads);
num_heads);
dim3 blocks_merge(blockx, blocky);
merge_multi_chunks_v2_kernel<NV_TYPE,
vec_size,
blocky,
HEAD_DIM,
OUT_NV_TYPE,
ENABLE_PREFILL>
vec_size,
blocky,
HEAD_DIM,
OUT_NV_TYPE,
ENABLE_PREFILL>
<<<grids_merge, blocks_merge, 0, stream>>>(
reinterpret_cast<NV_TYPE *>(tmp_workspace->ptr()),
static_cast<float *>(tmp_m->ptr()),
@@ -1481,8 +1482,8 @@ void MultiQueryAppendC4Attention(
batch_id_per_token.data<int>(),
cu_seqlens_q.data<int>(),
shift_bias ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
smooth_weight ? reinterpret_cast<NV_TYPE *>(const_cast<T *>(
smooth_weight.get().data<T>()))
: nullptr,

View File

@@ -1254,10 +1254,10 @@ void MultiQueryAppendC8Attention(
chunk_size = static_cast<uint32_t>(encoder_max_partition_size);
}
const int num_chunks = div_up(max_dec_len, chunk_size);
const int num_chunks = div_up(max_seq_len, chunk_size);
dim3 grids(num_blocks_x_cpu, num_chunks, kv_num_heads);
dim3 blocks(32, num_warps);
if (num_chunks <= 1) {
if (num_chunks <= 0) {
auto nosplit_kv_kernel =
multi_query_append_attention_c8_warp1_4_kernel<NV_TYPE,
uint8_t,
@@ -1377,8 +1377,8 @@ void MultiQueryAppendC8Attention(
reinterpret_cast<NV_TYPE *>(const_cast<T *>(cache_k_scale.data<T>())),
reinterpret_cast<NV_TYPE *>(const_cast<T *>(cache_v_scale.data<T>())),
shift_bias ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
smooth_weight ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(smooth_weight.get().data<T>()))
: nullptr,
@@ -1418,8 +1418,8 @@ void MultiQueryAppendC8Attention(
seq_lens_encoder.data<int>(),
cu_seqlens_q.data<int>(),
shift_bias ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
smooth_weight ? reinterpret_cast<NV_TYPE *>(const_cast<T *>(
smooth_weight.get().data<T>()))
: nullptr,
@@ -1436,14 +1436,14 @@ void MultiQueryAppendC8Attention(
constexpr int blockx = HEAD_DIM / vec_size;
constexpr int blocky = (128 + blockx - 1) / blockx;
dim3 grids_merge(min(sm_count * 4, token_num),
num_heads);
num_heads);
dim3 blocks_merge(blockx, blocky);
merge_multi_chunks_v2_kernel<NV_TYPE,
vec_size,
blocky,
HEAD_DIM,
OUT_NV_TYPE,
ENABLE_PREFILL>
vec_size,
blocky,
HEAD_DIM,
OUT_NV_TYPE,
ENABLE_PREFILL>
<<<grids_merge, blocks_merge, 0, stream>>>(
reinterpret_cast<NV_TYPE *>(tmp_workspace->ptr()),
static_cast<float *>(tmp_m->ptr()),
@@ -1454,8 +1454,8 @@ void MultiQueryAppendC8Attention(
batch_id_per_token.data<int>(),
cu_seqlens_q.data<int>(),
shift_bias ? reinterpret_cast<NV_TYPE *>(
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
const_cast<T *>(shift_bias.get().data<T>()))
: nullptr,
smooth_weight ? reinterpret_cast<NV_TYPE *>(const_cast<T *>(
smooth_weight.get().data<T>()))
: nullptr,

View File

@@ -195,22 +195,25 @@ std::vector<paddle::Tensor> GetBlockShapeAndSplitKVBlock(
const paddle::Tensor &seq_lens_encoder,
const paddle::Tensor &seq_lens_decoder,
const paddle::Tensor &seq_lens_this_time,
const int encoder_block_shape_q, const int decoder_block_shape_q,
const int group_size, const int block_size,
const int decoder_step_token_num) {
paddle::Tensor &decoder_batch_ids, // Inplace
paddle::Tensor &decoder_tile_ids_per_batch, // Inplace
paddle::Tensor &decoder_num_blocks_x_cpu, // Inplace, Pinned Memory
paddle::Tensor &max_len_tensor_cpu, // Inplace, Pinned Memory
const int encoder_block_shape_q,
const int decoder_block_shape_q,
const int group_size,
const int block_size,
const int decoder_step_token_num)
{
auto stream = seq_lens_encoder.stream();
int bsz = seq_lens_this_time.shape()[0];
auto max_len_tensor =
GetEmptyTensor({8}, paddle::DataType::INT32, seq_lens_encoder.place());
GetMaxLen(seq_lens_decoder, seq_lens_this_time, seq_lens_encoder,
max_len_tensor, bsz);
// max_len_this_time, max_enc_len_this_time, max_dec_len_this_time,
// max_enc_dec_len_this_time, max_just_dec_len_this_time,
// max_just_dec_merged_len_this_time, max_system_len,
// max_just_dec_len_without_system
auto max_len_cpu = max_len_tensor.copy_to(paddle::CPUPlace(), false);
auto max_len_cpu_ptr = max_len_cpu.data<int>();
paddle::Tensor max_len_tensor_gpu = GetEmptyTensor({max_len_tensor_cpu.shape()[0]}, paddle::DataType::INT32, seq_lens_this_time.place());
GetMaxLen(seq_lens_decoder, seq_lens_this_time, seq_lens_encoder,
max_len_tensor_gpu, bsz);
max_len_tensor_cpu.copy_(max_len_tensor_gpu, max_len_tensor_cpu.place(), false);
auto max_len_cpu_ptr = max_len_tensor_cpu.data<int>();
int max_len_this_time = max_len_cpu_ptr[0];
int max_enc_len_this_time = max_len_cpu_ptr[1];
int max_dec_len_this_time = max_len_cpu_ptr[2];
@@ -222,14 +225,11 @@ std::vector<paddle::Tensor> GetBlockShapeAndSplitKVBlock(
paddle::Tensor encoder_batch_ids;
paddle::Tensor encoder_tile_ids_per_batch;
paddle::Tensor encoder_num_blocks_x_cpu; /*cpu*/
paddle::Tensor encoder_num_blocks_x_cpu; /*cpu*/
paddle::Tensor kv_batch_ids;
paddle::Tensor kv_tile_ids_per_batch;
paddle::Tensor kv_num_blocks_x_cpu; /*cpu*/
paddle::Tensor decoder_batch_ids;
paddle::Tensor decoder_tile_ids_per_batch;
paddle::Tensor decoder_num_blocks_x_cpu; /*cpu*/
paddle::Tensor max_len_kv_cpu; /*cpu*/
paddle::Tensor kv_num_blocks_x_cpu; /*cpu*/
paddle::Tensor max_len_kv_cpu; /*cpu*/
auto max_len_kv =
GetEmptyTensor({1}, paddle::DataType::INT32, seq_lens_decoder.place());
@@ -291,92 +291,64 @@ std::vector<paddle::Tensor> GetBlockShapeAndSplitKVBlock(
kv_num_blocks_x_cpu =
GetEmptyTensor({0}, paddle::DataType::INT32, seq_lens_encoder.place());
}
if (max_just_dec_len_this_time > 0) {
const uint32_t decoder_max_tile_size_per_bs_q =
div_up((decoder_step_token_num * group_size), decoder_block_shape_q);
decoder_batch_ids =
GetEmptyTensor({bsz * decoder_max_tile_size_per_bs_q},
paddle::DataType::INT32, seq_lens_encoder.place());
decoder_tile_ids_per_batch =
GetEmptyTensor({bsz * decoder_max_tile_size_per_bs_q},
paddle::DataType::INT32, seq_lens_encoder.place());
if (max_just_dec_len_this_time > 0) {
// Clear buffer
const uint32_t decoder_max_tile_size_per_bs_q = div_up((decoder_step_token_num * group_size), decoder_block_shape_q);
const uint32_t decoder_batch_shape = bsz * decoder_max_tile_size_per_bs_q;
PADDLE_ENFORCE_GPU_SUCCESS(cudaMemsetAsync(decoder_batch_ids.data<int>(), 0, decoder_batch_shape * sizeof(int32_t), stream));
PADDLE_ENFORCE_GPU_SUCCESS(cudaMemsetAsync(decoder_tile_ids_per_batch.data<int>(), 0, decoder_batch_shape * sizeof(int32_t), stream));
PADDLE_ENFORCE_GPU_SUCCESS(cudaMemsetAsync(decoder_num_blocks_x_cpu.data<int>(), 0, sizeof(int32_t), stream));
auto decoder_num_blocks_x =
GetEmptyTensor({1}, paddle::DataType::INT32, seq_lens_encoder.place());
split_q_block<<<1, 32, 0, stream>>>(
seq_lens_this_time.data<int>(), seq_lens_encoder.data<int>(),
decoder_batch_ids.data<int>(), decoder_tile_ids_per_batch.data<int>(),
decoder_num_blocks_x.data<int>(), bsz, decoder_block_shape_q,
seq_lens_this_time.data<int>(),
seq_lens_encoder.data<int>(),
decoder_batch_ids.data<int>(),
decoder_tile_ids_per_batch.data<int>(),
decoder_num_blocks_x.data<int>(),
bsz,
decoder_block_shape_q,
group_size);
decoder_num_blocks_x_cpu =
decoder_num_blocks_x.copy_to(paddle::CPUPlace(), false);
} else {
decoder_batch_ids =
GetEmptyTensor({0}, paddle::DataType::INT32, seq_lens_encoder.place());
decoder_tile_ids_per_batch =
GetEmptyTensor({0}, paddle::DataType::INT32, seq_lens_encoder.place());
decoder_num_blocks_x_cpu =
GetEmptyTensor({0}, paddle::DataType::INT32, paddle::CPUPlace());
decoder_num_blocks_x_cpu.copy_(decoder_num_blocks_x, decoder_num_blocks_x_cpu.place(), false);
}
return {encoder_batch_ids,
encoder_tile_ids_per_batch,
encoder_num_blocks_x_cpu, /*cpu*/
kv_batch_ids,
kv_tile_ids_per_batch,
kv_num_blocks_x_cpu, /*cpu*/
decoder_batch_ids,
decoder_tile_ids_per_batch,
decoder_num_blocks_x_cpu, /*cpu*/
max_len_kv_cpu /*cpu*/,
max_len_cpu};
}
std::vector<paddle::DataType> GetBlockShapeAndSplitKVBlockInferDtype(
const paddle::DataType &seq_lens_encoder_dtype,
const paddle::DataType &seq_lens_decoder_dtype,
const paddle::DataType &seq_lens_this_time_dtype) {
return {
paddle::DataType::INT32, paddle::DataType::INT32, paddle::DataType::INT32,
paddle::DataType::INT32, paddle::DataType::INT32, paddle::DataType::INT32,
paddle::DataType::INT32, paddle::DataType::INT32, paddle::DataType::INT32,
paddle::DataType::INT32, paddle::DataType::INT32};
}
std::vector<std::vector<int64_t>> GetBlockShapeAndSplitKVBlockInferShape(
const std::vector<int64_t> &seq_lens_encoder_shape,
const std::vector<int64_t> &seq_lens_decoder_shape,
const std::vector<int64_t> &seq_lens_this_time_shape) {
std::vector<int64_t> dynamic_shape = {-1};
return {dynamic_shape,
dynamic_shape,
{1},
dynamic_shape,
dynamic_shape,
{1},
dynamic_shape,
dynamic_shape,
{1},
{1},
{8}};
encoder_batch_ids,
encoder_tile_ids_per_batch,
encoder_num_blocks_x_cpu, /*cpu*/
kv_batch_ids,
kv_tile_ids_per_batch,
kv_num_blocks_x_cpu, /*cpu*/
max_len_kv_cpu, /*cpu*/
};
}
PD_BUILD_STATIC_OP(get_block_shape_and_split_kv_block)
.Inputs({"seq_lens_encoder", "seq_lens_decoder", "seq_lens_this_time"})
.Outputs({paddle::Optional("encoder_batch_ids"),
paddle::Optional("encoder_tile_ids_per_batch"),
paddle::Optional("encoder_num_blocks"),
paddle::Optional("kv_batch_ids"),
paddle::Optional("kv_tile_ids_per_batch"),
paddle::Optional("kv_num_blocks"),
paddle::Optional("decoder_batch_ids"),
paddle::Optional("decoder_tile_ids_per_batch"),
paddle::Optional("decoder_num_blocks"),
paddle::Optional("max_len_kv"), "set_max_lengths"})
.Attrs({"encoder_block_shape_q: int", "decoder_block_shape_q: int",
"group_size: int", "block_size: int",
"decoder_step_token_num: int"})
.SetKernelFn(PD_KERNEL(GetBlockShapeAndSplitKVBlock))
.SetInferShapeFn(PD_INFER_SHAPE(GetBlockShapeAndSplitKVBlockInferShape))
.SetInferDtypeFn(PD_INFER_DTYPE(GetBlockShapeAndSplitKVBlockInferDtype));
.Inputs({
"seq_lens_encoder",
"seq_lens_decoder",
"seq_lens_this_time",
"decoder_batch_ids",
"decoder_tile_ids_per_batch",
"decoder_num_blocks_x_cpu",
"max_len_tensor_cpu"
})
.Outputs({
paddle::Optional("encoder_batch_ids"),
paddle::Optional("encoder_tile_ids_per_batch"),
paddle::Optional("encoder_num_blocks_x_cpu"),
paddle::Optional("kv_batch_ids"),
paddle::Optional("kv_tile_ids_per_batch"),
paddle::Optional("kv_num_blocks_x_cpu"),
"max_len_kv_cpu"
})
.Attrs({
"encoder_block_shape_q: int",
"decoder_block_shape_q: int",
"group_size: int",
"block_size: int",
"decoder_step_token_num: int"
})
.SetKernelFn(PD_KERNEL(GetBlockShapeAndSplitKVBlock));

View File

@@ -586,9 +586,9 @@ __global__ void append_cache_kv_c4(
#pragma unroll
for (uint32_t i = wid * 32 + tid; i < HEAD_DIM; i += 128) {
cache_k_scale_smem[i] = cache_k_scale_now[i];
cache_k_zero_point_smem[i] = cache_k_zp_now[i] - static_cast<T>(136.f);
cache_k_zero_point_smem[i] = cache_k_zp_now[i] + static_cast<T>(136.f);
cache_v_scale_smem[i] = cache_v_scale_now[i];
cache_v_zero_point_smem[i] = cache_v_zp_now[i] - static_cast<T>(136.f);
cache_v_zero_point_smem[i] = cache_v_zp_now[i] + static_cast<T>(136.f);
}
smem_t k_smem(smem);
@@ -640,25 +640,25 @@ __global__ void append_cache_kv_c4(
convert_int4(frag_dq_T + 8, k_frag[2 * i + 1]);
if (row_idx < end_idx) {
k_tile_ptr0[0] = frag_dq_T[0] * cache_k_scale_smem[col_idx] + cache_k_zero_point_smem[col_idx];
k_tile_ptr0[1] = frag_dq_T[1] * cache_k_scale_smem[col_idx + 1] + cache_k_zero_point_smem[col_idx + 1];
k_tile_ptr0[8] = frag_dq_T[2] * cache_k_scale_smem[col_idx + 8] + cache_k_zero_point_smem[col_idx + 8];
k_tile_ptr0[9] = frag_dq_T[3] * cache_k_scale_smem[col_idx + 9] + cache_k_zero_point_smem[col_idx + 9];
k_tile_ptr0[16] = frag_dq_T[8] * cache_k_scale_smem[col_idx + 16] + cache_k_zero_point_smem[col_idx + 16];
k_tile_ptr0[17] = frag_dq_T[9] * cache_k_scale_smem[col_idx + 17] + cache_k_zero_point_smem[col_idx + 17];
k_tile_ptr0[24] = frag_dq_T[10] * cache_k_scale_smem[col_idx + 24] + cache_k_zero_point_smem[col_idx + 24];
k_tile_ptr0[25] = frag_dq_T[11] * cache_k_scale_smem[col_idx + 25] + cache_k_zero_point_smem[col_idx + 25];
k_tile_ptr0[0] = (frag_dq_T[0] - cache_k_zero_point_smem[col_idx]) * cache_k_scale_smem[col_idx];
k_tile_ptr0[1] = (frag_dq_T[1] - cache_k_zero_point_smem[col_idx + 1]) * cache_k_scale_smem[col_idx + 1];
k_tile_ptr0[8] = (frag_dq_T[2] - cache_k_zero_point_smem[col_idx + 8]) * cache_k_scale_smem[col_idx + 8];
k_tile_ptr0[9] = (frag_dq_T[3] - cache_k_zero_point_smem[col_idx + 9]) * cache_k_scale_smem[col_idx + 9];
k_tile_ptr0[16] = (frag_dq_T[8] - cache_k_zero_point_smem[col_idx + 16]) * cache_k_scale_smem[col_idx + 16];
k_tile_ptr0[17] = (frag_dq_T[9] - cache_k_zero_point_smem[col_idx + 17]) * cache_k_scale_smem[col_idx + 17];
k_tile_ptr0[24] = (frag_dq_T[10] - cache_k_zero_point_smem[col_idx + 24]) * cache_k_scale_smem[col_idx + 24];
k_tile_ptr0[25] = (frag_dq_T[11] - cache_k_zero_point_smem[col_idx + 25]) * cache_k_scale_smem[col_idx + 25];
}
if (row_idx + 8 < end_idx) {
k_tile_ptr1[0] = frag_dq_T[4] * cache_k_scale_smem[col_idx] + cache_k_zero_point_smem[col_idx];
k_tile_ptr1[1] = frag_dq_T[5] * cache_k_scale_smem[col_idx + 1] + cache_k_zero_point_smem[col_idx + 1];
k_tile_ptr1[8] = frag_dq_T[6] * cache_k_scale_smem[col_idx + 8] + cache_k_zero_point_smem[col_idx + 8];
k_tile_ptr1[9] = frag_dq_T[7] * cache_k_scale_smem[col_idx + 9] + cache_k_zero_point_smem[col_idx + 9];
k_tile_ptr1[16] = frag_dq_T[12] * cache_k_scale_smem[col_idx + 16] + cache_k_zero_point_smem[col_idx + 16];
k_tile_ptr1[17] = frag_dq_T[13] * cache_k_scale_smem[col_idx + 17] + cache_k_zero_point_smem[col_idx + 17];
k_tile_ptr1[24] = frag_dq_T[14] * cache_k_scale_smem[col_idx + 24] + cache_k_zero_point_smem[col_idx + 24];
k_tile_ptr1[25] = frag_dq_T[15] * cache_k_scale_smem[col_idx + 25] + cache_k_zero_point_smem[col_idx + 25];
k_tile_ptr1[0] = (frag_dq_T[4] - cache_k_zero_point_smem[col_idx]) * cache_k_scale_smem[col_idx];
k_tile_ptr1[1] = (frag_dq_T[5] - cache_k_zero_point_smem[col_idx + 1]) * cache_k_scale_smem[col_idx + 1];
k_tile_ptr1[8] = (frag_dq_T[6] - cache_k_zero_point_smem[col_idx + 8]) * cache_k_scale_smem[col_idx + 8];
k_tile_ptr1[9] = (frag_dq_T[7] - cache_k_zero_point_smem[col_idx + 9]) * cache_k_scale_smem[col_idx + 9];
k_tile_ptr1[16] = (frag_dq_T[12] - cache_k_zero_point_smem[col_idx + 16]) * cache_k_scale_smem[col_idx + 16];
k_tile_ptr1[17] = (frag_dq_T[13] - cache_k_zero_point_smem[col_idx + 17]) * cache_k_scale_smem[col_idx + 17];
k_tile_ptr1[24] = (frag_dq_T[14] - cache_k_zero_point_smem[col_idx + 24]) * cache_k_scale_smem[col_idx + 24];
k_tile_ptr1[25] = (frag_dq_T[15] - cache_k_zero_point_smem[col_idx + 25]) * cache_k_scale_smem[col_idx + 25];
}
col_idx += 32;
}
@@ -711,36 +711,36 @@ __global__ void append_cache_kv_c4(
convert_int4(frag_dq_T, v_frag[2 * i]);
convert_int4(frag_dq_T + 8, v_frag[2 * i + 1]);
if (kv_idx < end_idx) {
v_tile_ptr0[0] = frag_dq_T[0] * cache_v_scale_smem[dim_idx] + cache_v_zero_point_smem[dim_idx];
v_tile_ptr1[0] = frag_dq_T[4] * cache_v_scale_smem[dim_idx + 8] + cache_v_zero_point_smem[dim_idx + 8];
v_tile_ptr0[0] = (frag_dq_T[0] - cache_v_zero_point_smem[dim_idx]) * cache_v_scale_smem[dim_idx];
v_tile_ptr1[0] = (frag_dq_T[4] - cache_v_zero_point_smem[dim_idx + 8]) * cache_v_scale_smem[dim_idx + 8];
}
if (kv_idx + 1 < end_idx) {
v_tile_ptr0[kv_t_stride] = frag_dq_T[1] * cache_v_scale_smem[dim_idx] + cache_v_zero_point_smem[dim_idx];
v_tile_ptr1[kv_t_stride] = frag_dq_T[5] * cache_v_scale_smem[dim_idx + 8] + cache_v_zero_point_smem[dim_idx + 8];
v_tile_ptr0[kv_t_stride] = (frag_dq_T[1] - cache_v_zero_point_smem[dim_idx]) * cache_v_scale_smem[dim_idx];
v_tile_ptr1[kv_t_stride] = (frag_dq_T[5] - cache_v_zero_point_smem[dim_idx + 8]) * cache_v_scale_smem[dim_idx + 8];
}
if (kv_idx + 8 < end_idx) {
v_tile_ptr0[8 * kv_t_stride] = frag_dq_T[2] * cache_v_scale_smem[dim_idx] + cache_v_zero_point_smem[dim_idx];
v_tile_ptr1[8 * kv_t_stride] = frag_dq_T[6] * cache_v_scale_smem[dim_idx + 8] + cache_v_zero_point_smem[dim_idx + 8];
v_tile_ptr0[8 * kv_t_stride] = (frag_dq_T[2] - cache_v_zero_point_smem[dim_idx]) * cache_v_scale_smem[dim_idx];
v_tile_ptr1[8 * kv_t_stride] = (frag_dq_T[6] - cache_v_zero_point_smem[dim_idx + 8]) * cache_v_scale_smem[dim_idx + 8];
}
if (kv_idx + 9 < end_idx) {
v_tile_ptr0[9 * kv_t_stride] = frag_dq_T[3] * cache_v_scale_smem[dim_idx] + cache_v_zero_point_smem[dim_idx];
v_tile_ptr1[9 * kv_t_stride] = frag_dq_T[7] * cache_v_scale_smem[dim_idx + 8] + cache_v_zero_point_smem[dim_idx + 8];
v_tile_ptr0[9 * kv_t_stride] = (frag_dq_T[3] - cache_v_zero_point_smem[dim_idx]) * cache_v_scale_smem[dim_idx];
v_tile_ptr1[9 * kv_t_stride] = (frag_dq_T[7] - cache_v_zero_point_smem[dim_idx + 8]) * cache_v_scale_smem[dim_idx + 8];
}
if (kv_idx + 16 < end_idx) {
v_tile_ptr0[16 * kv_t_stride] = frag_dq_T[8] * cache_v_scale_smem[dim_idx] + cache_v_zero_point_smem[dim_idx];
v_tile_ptr1[16 * kv_t_stride] = frag_dq_T[12] * cache_v_scale_smem[dim_idx + 8] + cache_v_zero_point_smem[dim_idx + 8];
v_tile_ptr0[16 * kv_t_stride] = (frag_dq_T[8] - cache_v_zero_point_smem[dim_idx]) * cache_v_scale_smem[dim_idx];
v_tile_ptr1[16 * kv_t_stride] = (frag_dq_T[12] - cache_v_zero_point_smem[dim_idx + 8]) * cache_v_scale_smem[dim_idx + 8];
}
if (kv_idx + 17 < end_idx) {
v_tile_ptr0[17 * kv_t_stride] = frag_dq_T[9] * cache_v_scale_smem[dim_idx] + cache_v_zero_point_smem[dim_idx];
v_tile_ptr1[17 * kv_t_stride] = frag_dq_T[13] * cache_v_scale_smem[dim_idx + 8] + cache_v_zero_point_smem[dim_idx + 8];
v_tile_ptr0[17 * kv_t_stride] = (frag_dq_T[9] - cache_v_zero_point_smem[dim_idx]) * cache_v_scale_smem[dim_idx];
v_tile_ptr1[17 * kv_t_stride] = (frag_dq_T[13] - cache_v_zero_point_smem[dim_idx + 8]) * cache_v_scale_smem[dim_idx + 8];
}
if (kv_idx + 24 < end_idx) {
v_tile_ptr0[24 * kv_t_stride] = frag_dq_T[10] * cache_v_scale_smem[dim_idx] + cache_v_zero_point_smem[dim_idx];
v_tile_ptr1[24 * kv_t_stride] = frag_dq_T[14] * cache_v_scale_smem[dim_idx + 8] + cache_v_zero_point_smem[dim_idx + 8];
v_tile_ptr0[24 * kv_t_stride] = (frag_dq_T[10] - cache_v_zero_point_smem[dim_idx]) * cache_v_scale_smem[dim_idx];
v_tile_ptr1[24 * kv_t_stride] = (frag_dq_T[14] - cache_v_zero_point_smem[dim_idx + 8]) * cache_v_scale_smem[dim_idx + 8];
}
if (kv_idx + 25 < end_idx) {
v_tile_ptr0[25 * kv_t_stride] = frag_dq_T[11] * cache_v_scale_smem[dim_idx] + cache_v_zero_point_smem[dim_idx];
v_tile_ptr1[25 * kv_t_stride] = frag_dq_T[15] * cache_v_scale_smem[dim_idx + 8] + cache_v_zero_point_smem[dim_idx + 8];
v_tile_ptr0[25 * kv_t_stride] = (frag_dq_T[11] - cache_v_zero_point_smem[dim_idx]) * cache_v_scale_smem[dim_idx];
v_tile_ptr1[25 * kv_t_stride] = (frag_dq_T[15] - cache_v_zero_point_smem[dim_idx + 8]) * cache_v_scale_smem[dim_idx + 8];
}
kv_idx += 32;
}
@@ -956,6 +956,30 @@ std::vector<paddle::Tensor> GQARopeWriteCacheKernel(
rotary_embs.dims()[2],
head_dim,
stream);
if (token_num < kv_token_num) {
AppendCacheKV<data_t, 128, 64>(
key_cache,
value_cache,
cache_k_dequant_scales.get(),
cache_v_dequant_scales.get(),
cache_k_zp.get(),
cache_v_zp.get(),
seq_lens_this_time,
seq_lens_decoder,
cu_seqlens_k,
block_tables,
cache_batch_ids,
cache_tile_ids,
cache_num_blocks,
max_blocks_per_seq,
kv_num_heads,
cache_quant_type,
&k,
&v,
stream
);
}
// write cache
if (cache_quant_type == "none") {
CascadeAppendWriteCacheKVQKV<data_t>(
@@ -1038,30 +1062,6 @@ std::vector<paddle::Tensor> GQARopeWriteCacheKernel(
}
}
}
if (token_num < kv_token_num) {
AppendCacheKV<data_t, 128, 64>(
key_cache,
value_cache,
cache_k_dequant_scales.get(),
cache_v_dequant_scales.get(),
cache_k_zp.get(),
cache_v_zp.get(),
seq_lens_this_time,
seq_lens_decoder,
cu_seqlens_k,
block_tables,
cache_batch_ids,
cache_tile_ids,
cache_num_blocks,
max_blocks_per_seq,
kv_num_heads,
cache_quant_type,
&k,
&v,
stream
);
}
return {q, k, v, qkv_out};
}

View File

@@ -235,8 +235,14 @@ std::vector<paddle::Tensor> GetBlockShapeAndSplitKVBlock(
const paddle::Tensor &seq_lens_encoder,
const paddle::Tensor &seq_lens_decoder,
const paddle::Tensor &seq_lens_this_time,
const int encoder_block_shape_q, const int decoder_block_shape_q,
const int group_size, const int block_size,
paddle::Tensor &decoder_batch_ids, // Inplace
paddle::Tensor &decoder_tile_ids_per_batch, // Inplace
paddle::Tensor &decoder_num_blocks_x_cpu, // Inplace, Pinned Memory
paddle::Tensor &max_len_tensor_cpu, // Inplace, Pinned Memory
const int encoder_block_shape_q,
const int decoder_block_shape_q,
const int group_size,
const int block_size,
const int decoder_step_token_num);
std::vector<paddle::Tensor> GetPaddingOffset(const paddle::Tensor &input_ids,
@@ -266,13 +272,12 @@ void GetStopFlagsMulti(const paddle::Tensor &topk_ids,
const paddle::Tensor &seq_lens,
const paddle::Tensor &end_ids,
const paddle::Tensor &next_tokens,
const paddle::Tensor &pre_ids,
const paddle::Tensor &step_idx,
const paddle::Tensor &stop_seqs,
const paddle::Tensor &stop_seqs_len,
const bool beam_search);
void GetStopFlagsMultiSeqs(
const paddle::Tensor &topk_ids, const paddle::Tensor &pre_ids,
const paddle::Tensor &step_idx, const paddle::Tensor &stop_flags,
const paddle::Tensor &seq_lens, const paddle::Tensor &stop_seqs,
const paddle::Tensor &stop_seqs_len, const paddle::Tensor &end_ids);
void UpdateInputes(const paddle::Tensor &stop_flags,
const paddle::Tensor &not_need_stop, // only on cpu
@@ -954,12 +959,6 @@ PYBIND11_MODULE(fastdeploy_ops, m) {
m.def("set_stop_value_multi_ends", &GetStopFlagsMulti,
"update_inputs function");
/**
* stop_generation_multi_stop_seqs.cu
* set_stop_value_multi_seqs
*/
m.def("set_stop_value_multi_seqs", &GetStopFlagsMultiSeqs,
"update_inputs function");
/**
* update_inputs.cu

View File

@@ -133,10 +133,18 @@ public:
template <typename TypeA, typename Arch>
struct LayoutDetailsB<TypeA, uint2b_t, Arch, typename platform::enable_if<Arch::kMinComputeCapability >= 75>::type>
{
static constexpr int ThreadblockK = 128 * 8 / cutlass::sizeof_bits<TypeA>::value;
using Layout = layout::RowMajor;
static constexpr int ElementsPerAccess = 128 / cutlass::sizeof_bits<TypeA>::value;
using Operator = cutlass::arch::OpMultiplyAdd;
static constexpr int ThreadblockK = 128 * 8 / cutlass::sizeof_bits<TypeA>::value; // 64
private:
static constexpr int ElementsPerCacheLine = 128 * 8 / sizeof_bits<uint2b_t>::value;
static constexpr int ColumnsInterleaved = ElementsPerCacheLine / ThreadblockK; // 8
public:
// using Layout = layout::ColumnMajor;
// static constexpr int ElementsPerAccess = 16; // at least 4-bytes
using Layout = layout::ColumnMajorTileInterleave<ThreadblockK, ColumnsInterleaved>;
static constexpr int ElementsPerAccess = 128 / cutlass::sizeof_bits<uint2b_t>::value; // 64
using Operator = cutlass::arch::OpMultiplyAddDequantizeInterleavedBToA;
};
template <typename TypeA, typename Arch>

View File

@@ -18,14 +18,12 @@
#include "cutlass_extensions/gemm/threadblock/default_dq_mma_multistage.h"
#include "cutlass_extensions/gemm/threadblock/default_dq_mma_pipelined.h"
#include "cutlass_extensions/gemm/threadblock/default_wint2x_mma.h"
#include "cutlass_extensions/gemm/threadblock/default_mma_bf16.h"
namespace cutlass
{
namespace gemm
{
namespace threadblock
{
namespace cutlass {
namespace gemm {
namespace threadblock {
////////////////////////////////////////////////////////////////////////////////
@@ -378,38 +376,23 @@ template <
struct DefaultMma<cutlass::half_t, LayoutA, kAlignmentA, uint2b_t, LayoutB, kAlignmentB, ElementAccumulator,
layout::RowMajor, arch::OpClassTensorOp, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator>
{
static cutlass::arch::CacheOperation::Kind const CacheOpA =
((sizeof_bits<half_t>::value * kAlignmentA) == 128) ? cutlass::arch::CacheOperation::Global
: cutlass::arch::CacheOperation::Always;
static cutlass::arch::CacheOperation::Kind const CacheOpB =
((sizeof_bits<half_t>::value * kAlignmentB) == 128) ? cutlass::arch::CacheOperation::Global
: cutlass::arch::CacheOperation::Always;
private:
using Mma = DefaultWint2xMma<half_t, LayoutA, kAlignmentA, uint2b_t, LayoutB, kAlignmentB,
ElementAccumulator, layout::RowMajor, arch::OpClassTensorOp, ArchTag, ThreadblockShape,
WarpShape, InstructionShape, 2, Operator>;
public:
// Define the MmaCore components
using MmaCore =
typename cutlass::gemm::threadblock::DefaultMmaCore<ThreadblockShape, WarpShape, InstructionShape, half_t,
LayoutA, half_t, LayoutB, ElementAccumulator, layout::RowMajor, arch::OpClassTensorOp, 3, Operator,
false, CacheOpA, CacheOpB>;
using MmaCore = typename Mma::MmaCore;
// Define iterators over tiles from the A operand
using ThreadMapA = typename MmaCore::IteratorThreadMapA;
using AccessTypeA = cutlass::Array<half_t, kAlignmentA>;
using IteratorA = cutlass::transform::threadblock::PredicatedTileAccessIterator<
cutlass::MatrixShape<ThreadblockShape::kM, ThreadblockShape::kK>, half_t, LayoutA, 1, ThreadMapA,
AccessTypeA>;
using IteratorA = typename Mma::IteratorA;
// Define iterators over tiles from the B operand
using ThreadMapB = typename MmaCore::IteratorThreadMapB;
using AccessTypeB = cutlass::Array<half_t, kAlignmentB>;
using IteratorB = cutlass::transform::threadblock::PredicatedTileAccessIterator<
cutlass::MatrixShape<ThreadblockShape::kK, ThreadblockShape::kN>, half_t, LayoutB, 0, ThreadMapB,
AccessTypeB>;
using IteratorB = typename Mma::IteratorB;
// Define the threadblock-scoped multistage matrix multiply
using ThreadblockMma = cutlass::gemm::threadblock::Wint2xMmaMultistage<typename MmaCore::Shape, IteratorA,
typename MmaCore::SmemIteratorA, MmaCore::kCacheOpA, IteratorB, typename MmaCore::SmemIteratorB,
MmaCore::kCacheOpB, ElementAccumulator, layout::RowMajor, typename MmaCore::MmaPolicy, 2>;
using ThreadblockMma = typename Mma::ThreadblockMma;
};
template <
@@ -441,38 +424,23 @@ struct DefaultMma<half_t, LayoutA, kAlignmentA, uint2b_t, LayoutB, kAlignmentB,
layout::RowMajor, arch::OpClassTensorOp, ArchTag, ThreadblockShape, WarpShape, InstructionShape, kStages, Operator,
false, SharedMemoryClear>
{
static cutlass::arch::CacheOperation::Kind const CacheOpA =
((sizeof_bits<half_t>::value * kAlignmentA) == 128) ? cutlass::arch::CacheOperation::Global
: cutlass::arch::CacheOperation::Always;
static cutlass::arch::CacheOperation::Kind const CacheOpB =
((sizeof_bits<half_t>::value * kAlignmentB) == 128) ? cutlass::arch::CacheOperation::Global
: cutlass::arch::CacheOperation::Always;
private:
using Mma = DefaultWint2xMma<half_t, LayoutA, kAlignmentA, uint2b_t, LayoutB, kAlignmentB,
ElementAccumulator, layout::RowMajor, arch::OpClassTensorOp, ArchTag, ThreadblockShape,
WarpShape, InstructionShape, kStages, Operator, SharedMemoryClear>;
public:
// Define the MmaCore components
using MmaCore =
typename cutlass::gemm::threadblock::DefaultMmaCore<ThreadblockShape, WarpShape, InstructionShape, half_t,
LayoutA, half_t, LayoutB, ElementAccumulator, layout::RowMajor, arch::OpClassTensorOp, kStages, Operator,
false, CacheOpA, CacheOpB>;
using MmaCore = typename Mma::MmaCore;
// Define iterators over tiles from the A operand
using ThreadMapA = typename MmaCore::IteratorThreadMapA;
using AccessTypeA = cutlass::Array<half_t, kAlignmentA>;
using IteratorA = cutlass::transform::threadblock::PredicatedTileAccessIterator<
cutlass::MatrixShape<ThreadblockShape::kM, ThreadblockShape::kK>, half_t, LayoutA, 1, ThreadMapA,
AccessTypeA>;
using IteratorA = typename Mma::IteratorA;
// Define iterators over tiles from the B operand
using ThreadMapB = typename MmaCore::IteratorThreadMapB;
using AccessTypeB = cutlass::Array<half_t, kAlignmentB>;
using IteratorB = cutlass::transform::threadblock::PredicatedTileAccessIterator<
cutlass::MatrixShape<ThreadblockShape::kK, ThreadblockShape::kN>, half_t, LayoutB, 0, ThreadMapB,
AccessTypeB>;
using IteratorB = typename Mma::IteratorB;
// Define the threadblock-scoped multistage matrix multiply
using ThreadblockMma = cutlass::gemm::threadblock::Wint2xMmaMultistage<typename MmaCore::Shape, IteratorA,
typename MmaCore::SmemIteratorA, MmaCore::kCacheOpA, IteratorB, typename MmaCore::SmemIteratorB,
MmaCore::kCacheOpB, ElementAccumulator, layout::RowMajor, typename MmaCore::MmaPolicy, kStages, SharedMemoryClear>;
using ThreadblockMma = typename Mma::ThreadblockMma;
};
} // namespace threadblock

View File

@@ -19,7 +19,7 @@
#include "cutlass/gemm/threadblock/default_mma.h"
#include "cutlass_extensions/gemm/threadblock/default_dq_mma_multistage.h"
#include "cutlass_extensions/gemm/threadblock/default_dq_mma_pipelined.h"
#include "cutlass_extensions/gemm/threadblock/wint2x_mma_multistage.h"
#include "cutlass_extensions/gemm/threadblock/default_wint2x_mma.h"
namespace cutlass {
namespace gemm {
@@ -379,38 +379,23 @@ template <
struct DefaultMma<cutlass::bfloat16_t, LayoutA, kAlignmentA, uint2b_t, LayoutB, kAlignmentB, ElementAccumulator,
layout::RowMajor, arch::OpClassTensorOp, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator>
{
static cutlass::arch::CacheOperation::Kind const CacheOpA =
((sizeof_bits<bfloat16_t>::value * kAlignmentA) == 128) ? cutlass::arch::CacheOperation::Global
: cutlass::arch::CacheOperation::Always;
static cutlass::arch::CacheOperation::Kind const CacheOpB =
((sizeof_bits<bfloat16_t>::value * kAlignmentB) == 128) ? cutlass::arch::CacheOperation::Global
: cutlass::arch::CacheOperation::Always;
private:
using Mma = DefaultWint2xMma<bfloat16_t, LayoutA, kAlignmentA, uint2b_t, LayoutB, kAlignmentB,
ElementAccumulator, layout::RowMajor, arch::OpClassTensorOp, ArchTag, ThreadblockShape,
WarpShape, InstructionShape, 2, Operator>;
public:
// Define the MmaCore components
using MmaCore =
typename cutlass::gemm::threadblock::DefaultMmaCore<ThreadblockShape, WarpShape, InstructionShape, bfloat16_t,
LayoutA, bfloat16_t, LayoutB, ElementAccumulator, layout::RowMajor, arch::OpClassTensorOp, 3, Operator,
false, CacheOpA, CacheOpB>;
using MmaCore = typename Mma::MmaCore;
// Define iterators over tiles from the A operand
using ThreadMapA = typename MmaCore::IteratorThreadMapA;
using AccessTypeA = cutlass::Array<bfloat16_t, kAlignmentA>;
using IteratorA = cutlass::transform::threadblock::PredicatedTileAccessIterator<
cutlass::MatrixShape<ThreadblockShape::kM, ThreadblockShape::kK>, bfloat16_t, LayoutA, 1, ThreadMapA,
AccessTypeA>;
using IteratorA = typename Mma::IteratorA;
// Define iterators over tiles from the B operand
using ThreadMapB = typename MmaCore::IteratorThreadMapB;
using AccessTypeB = cutlass::Array<bfloat16_t, kAlignmentB>;
using IteratorB = cutlass::transform::threadblock::PredicatedTileAccessIterator<
cutlass::MatrixShape<ThreadblockShape::kK, ThreadblockShape::kN>, bfloat16_t, LayoutB, 0, ThreadMapB,
AccessTypeB>;
using IteratorB = typename Mma::IteratorB;
// Define the threadblock-scoped multistage matrix multiply
using ThreadblockMma = cutlass::gemm::threadblock::Wint2xMmaMultistage<typename MmaCore::Shape, IteratorA,
typename MmaCore::SmemIteratorA, MmaCore::kCacheOpA, IteratorB, typename MmaCore::SmemIteratorB,
MmaCore::kCacheOpB, ElementAccumulator, layout::RowMajor, typename MmaCore::MmaPolicy, 2>;
using ThreadblockMma = typename Mma::ThreadblockMma;
};
template <
@@ -442,38 +427,23 @@ struct DefaultMma<bfloat16_t, LayoutA, kAlignmentA, uint2b_t, LayoutB, kAlignmen
layout::RowMajor, arch::OpClassTensorOp, ArchTag, ThreadblockShape, WarpShape, InstructionShape, kStages, Operator,
false, SharedMemoryClear>
{
static cutlass::arch::CacheOperation::Kind const CacheOpA =
((sizeof_bits<bfloat16_t>::value * kAlignmentA) == 128) ? cutlass::arch::CacheOperation::Global
: cutlass::arch::CacheOperation::Always;
static cutlass::arch::CacheOperation::Kind const CacheOpB =
((sizeof_bits<bfloat16_t>::value * kAlignmentB) == 128) ? cutlass::arch::CacheOperation::Global
: cutlass::arch::CacheOperation::Always;
private:
using Mma = DefaultWint2xMma<bfloat16_t, LayoutA, kAlignmentA, uint2b_t, LayoutB, kAlignmentB,
ElementAccumulator, layout::RowMajor, arch::OpClassTensorOp, ArchTag, ThreadblockShape,
WarpShape, InstructionShape, kStages, Operator, SharedMemoryClear>;
public:
// Define the MmaCore components
using MmaCore =
typename cutlass::gemm::threadblock::DefaultMmaCore<ThreadblockShape, WarpShape, InstructionShape, bfloat16_t,
LayoutA, bfloat16_t, LayoutB, ElementAccumulator, layout::RowMajor, arch::OpClassTensorOp, kStages, Operator,
false, CacheOpA, CacheOpB>;
using MmaCore = typename Mma::MmaCore;
// Define iterators over tiles from the A operand
using ThreadMapA = typename MmaCore::IteratorThreadMapA;
using AccessTypeA = cutlass::Array<bfloat16_t, kAlignmentA>;
using IteratorA = cutlass::transform::threadblock::PredicatedTileAccessIterator<
cutlass::MatrixShape<ThreadblockShape::kM, ThreadblockShape::kK>, bfloat16_t, LayoutA, 1, ThreadMapA,
AccessTypeA>;
using IteratorA = typename Mma::IteratorA;
// Define iterators over tiles from the B operand
using ThreadMapB = typename MmaCore::IteratorThreadMapB;
using AccessTypeB = cutlass::Array<bfloat16_t, kAlignmentB>;
using IteratorB = cutlass::transform::threadblock::PredicatedTileAccessIterator<
cutlass::MatrixShape<ThreadblockShape::kK, ThreadblockShape::kN>, bfloat16_t, LayoutB, 0, ThreadMapB,
AccessTypeB>;
using IteratorB = typename Mma::IteratorB;
// Define the threadblock-scoped multistage matrix multiply
using ThreadblockMma = cutlass::gemm::threadblock::Wint2xMmaMultistage<typename MmaCore::Shape, IteratorA,
typename MmaCore::SmemIteratorA, MmaCore::kCacheOpA, IteratorB, typename MmaCore::SmemIteratorB,
MmaCore::kCacheOpB, ElementAccumulator, layout::RowMajor, typename MmaCore::MmaPolicy, kStages, SharedMemoryClear>;
using ThreadblockMma = typename Mma::ThreadblockMma;
};
} // namespace threadblock

View File

@@ -0,0 +1,182 @@
/***************************************************************************************************
* Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: BSD-3-Clause
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* 1. Redistributions of source code must retain the above copyright notice, this
* list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
*
* 3. Neither the name of the copyright holder nor the names of its
* contributors may be used to endorse or promote products derived from
* this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
* SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
* CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
#pragma once
#include "cutlass/gemm/threadblock/default_mma_core_sm80.h"
namespace cutlass {
namespace gemm {
namespace threadblock {
/// Partial specialization:
///
/// A: row-major
/// B: uint2b_t, column-major
/// Operator: tensor op class
///
/// This uses the default warp-level operator given tile sizes
template <
/// Shape of threadblock-scoped matrix multiply operator (concept:
/// GemmShape)
typename Shape_,
/// Shape of warp-level matrix multiply operator (concept: GemmShape)
typename WarpShape_,
/// Shape of one matrix production operation (concept: GemmShape)
typename InstructionShape_,
/// Data type of A operand
typename ElementA_,
/// Data type of accumulator
typename ElementC_,
/// Layout of accumulator
typename LayoutC_,
/// Number of stages
int Stages,
/// Operation performed by MMA
typename Operator_,
/// Cache operation of operand A
cutlass::arch::CacheOperation::Kind CacheOpA,
/// Cache operation of operand B
cutlass::arch::CacheOperation::Kind CacheOpB>
struct DefaultMmaCore<Shape_, WarpShape_, InstructionShape_, ElementA_,
layout::RowMajor, uint2b_t, layout::ColumnMajor,
ElementC_, LayoutC_, arch::OpClassTensorOp, Stages,
Operator_, false, CacheOpA, CacheOpB> {
using Shape = Shape_;
using WarpShape = WarpShape_;
using InstructionShape = InstructionShape_;
using ElementA = ElementA_;
using LayoutA = layout::RowMajor;
using ElementB = uint2b_t;
using LayoutB = layout::ColumnMajor;
using ElementC = ElementC_;
using LayoutC = LayoutC_;
static int const kStages = Stages;
static cutlass::arch::CacheOperation::Kind const kCacheOpA = CacheOpA;
static cutlass::arch::CacheOperation::Kind const kCacheOpB = CacheOpB;
/// Number of warps present
using WarpCount = GemmShape<Shape::kM / WarpShape::kM,
Shape::kN / WarpShape::kN,
Shape::kK / WarpShape::kK>;
// Divisility requirements
static_assert(
!(Shape::kM % WarpShape::kM) && !(Shape::kN % WarpShape::kN),
"Threadblock-scoped GEMM should be divisible by warp-scoped GEMM size.");
/// Number of threads per warp
static int const kWarpSize = warp::WarpSize<arch::OpClassTensorOp>::value;
/// Size of a threadblock-scoped access
static int const kAccessSizeInBits = 128;
/// Number of threads total
static int const kThreads = WarpCount::kCount * kWarpSize;
/// Size of a threadblock-scoped access of B
static constexpr int kMaxThreadsForB =
(Shape::kK * Shape::kN * sizeof_bits<ElementB>::value) / kAccessSizeInBits;
static constexpr int kThreadsForB =
kMaxThreadsForB > kThreads ? kThreads : kMaxThreadsForB;
/// Default Operator
using Operator = Operator_;
// Warp thread arrangement
static int const kWarpThreadArrangementContiguousA =
Shape::kK / (kAccessSizeInBits / sizeof_bits<ElementA>::value);
static int const kWarpThreadArrangementStridedA =
kWarpSize / kWarpThreadArrangementContiguousA;
static int const kWarpThreadArrangementContiguousB =
Shape::kK / (kAccessSizeInBits / sizeof_bits<ElementB>::value);
static int const kWarpThreadArrangementStridedB =
kWarpSize / kWarpThreadArrangementContiguousB;
//
// Shared memory layouts
//
using SmemLayoutA = layout::RowMajorTensorOpMultiplicandCrosswise<
sizeof_bits<ElementA>::value, Shape::kK>;
// Shared memory layout
using SmemLayoutB = layout::ColumnMajorTensorOpMultiplicandCrosswise<
sizeof_bits<ElementB>::value, Shape::kK>;
//
// Iterators to write to shared memory
//
/// ThreadMap of iterator A
using IteratorThreadMapA = transform::PitchLinearWarpRakedThreadMap<
layout::PitchLinearShape<Shape::kK, Shape::kM>, kThreads,
layout::PitchLinearShape<kWarpThreadArrangementContiguousA,
kWarpThreadArrangementStridedA>,
kAccessSizeInBits / sizeof_bits<ElementA>::value>;
/// Shared memory iterator to A operand
using SmemIteratorA = transform::threadblock::RegularTileAccessIterator<
MatrixShape<Shape::kM, Shape::kK>, ElementA, SmemLayoutA, 0,
IteratorThreadMapA>;
/// ThreadMap of iterator B
using IteratorThreadMapB = transform::PitchLinearWarpRakedThreadMap<
layout::PitchLinearShape<Shape::kK, Shape::kN>, kThreadsForB,
layout::PitchLinearShape<kWarpThreadArrangementContiguousB,
kWarpThreadArrangementStridedB>,
kAccessSizeInBits / sizeof_bits<ElementB>::value>;
/// Shared memory iterator to B operand
using SmemIteratorB = transform::threadblock::RegularTileAccessIterator<
MatrixShape<Shape::kK, Shape::kN>, ElementB, SmemLayoutB, 1,
IteratorThreadMapB>;
//
// Warp-level matrix multiply operator
//
// Define the warp-level tensor op
using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
WarpShape, InstructionShape, ElementA, SmemLayoutA, ElementB, SmemLayoutB,
ElementC, LayoutC, Operator, WarpCount::kK>::Type;
/// Policy used to define MmaPipelined
using MmaPolicy = MmaPolicy<MmaTensorOp, MatrixShape<0, 0>,
MatrixShape<0, 0>, WarpCount::kK>;
};
} // namespace threadblock
} // namespace gemm
} // namespace cutlass

View File

@@ -0,0 +1,246 @@
/*
* SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: Apache-2.0
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once
#include "cutlass_extensions/arch/mma.h"
#include "cutlass_extensions/gemm/threadblock/default_dq_mma.h"
#include "cutlass_extensions/gemm/threadblock/default_mma_core.h"
#include "cutlass_extensions/gemm/threadblock/wint2x_mma_multistage.h"
#include "cutlass_extensions/gemm/threadblock/wint2x_params_accessor.h"
namespace cutlass {
namespace gemm {
namespace threadblock {
////////////////////////////////////////////////////////////////////////////////
template <typename ThreadblockShape, typename ElementT, int GroupSize>
struct DefaultQuantParamsIterators {
private:
static constexpr int kAlignment = 128 / sizeof_bits<ElementT>::value;
static_assert((ThreadblockShape::kN % kAlignment) == 0, "");
static constexpr int kRows =
(GroupSize == -1) ? 1 : (ThreadblockShape::kK + GroupSize - 1) / GroupSize;
static constexpr int kColumns = ThreadblockShape::kN;
using IteratorThreadMap = transform::PitchLinearStripminedThreadMap<
layout::PitchLinearShape<kColumns, kRows>,
kColumns / kAlignment, kAlignment>;
public:
using Iterator = cutlass::transform::threadblock::PredicatedTileIterator<
MatrixShape<kRows, kColumns>, ElementT, layout::RowMajor, 0,
IteratorThreadMap, kAlignment>;
using SmemIterator = Iterator;
};
template <typename ThreadblockShape, int GroupSize>
struct DefaultQuantParamsIterators<ThreadblockShape, uint4b_t, GroupSize> {
private:
static constexpr int kAlignment = 32 / sizeof_bits<uint4b_t>::value;
static_assert((ThreadblockShape::kN % kAlignment) == 0, "");
static constexpr int kRows =
(GroupSize == -1) ? 1 : (ThreadblockShape::kK + 2 * GroupSize - 1) / (2 * GroupSize);
static constexpr int kColumns =
(GroupSize == -1) ? ThreadblockShape::kN : ThreadblockShape::kN * 2;
using IteratorThreadMap = transform::PitchLinearStripminedThreadMap<
layout::PitchLinearShape<kColumns, kRows>,
kColumns / kAlignment, kAlignment>;
public:
using AccessType = cutlass::Array<uint4b_t, kAlignment>;
using Iterator = cutlass::transform::threadblock::PredicatedTileAccessIterator<
MatrixShape<kRows, kColumns>, uint4b_t, layout::RowMajor,
0, IteratorThreadMap, AccessType>;
using SmemIterator = Iterator;
};
template <
/// Element type for A matrix operand
typename ElementA_,
/// Layout type for A matrix operand
typename LayoutA_,
/// Access granularity of A matrix in units of elements
int kAlignmentA,
/// Element type for B matrix operand
typename ElementB_,
/// Layout type for B matrix operand
typename LayoutB_,
/// Access granularity of B matrix in units of elements
int kAlignmentB,
/// Element type for internal accumulation
typename ElementAccumulator_,
/// Layout type for C and D matrix operands
typename LayoutC_,
/// Operator class tag
typename OperatorClass_,
/// Tag indicating architecture to tune for
typename ArchTag_,
/// Threadblock-level tile size (concept: GemmShape)
typename ThreadblockShape_,
/// Warp-level tile size (concept: GemmShape)
typename WarpShape_,
/// Instruction-level tile size (concept: GemmShape)
typename InstructionShape_,
/// Number of stages used in the pipelined mainloop
int Stages,
/// Operation performed by GEMM
typename Operator_,
/// Use zfill or predicate for out-of-bound cp.async
SharedMemoryClearOption SharedMemoryClear = SharedMemoryClearOption::kNone>
struct DefaultWint2xMma;
////////////////////////////////////////////////////////////////////////////////
template <
/// Type for element A
typename ElementA,
/// Layout type for A matrix operand
typename LayoutA,
/// Access granularity of A matrix in units of elements
int kAlignmentA,
/// Type for element B
typename ElementB,
/// Layout type for B matrix operand
typename LayoutB,
/// Access granularity of B matrix in units of elements
int kAlignmentB,
/// Element type for internal accumulation
typename ElementAccumulator,
/// Operator class tag
typename OperatorClass,
/// Tag indicating architecture to tune for
typename ArchTag,
/// Threadblock-level tile size (concept: GemmShape)
typename ThreadblockShape,
/// Warp-level tile size (concept: GemmShape)
typename WarpShape,
/// Instruction-level tile size (concept: GemmShape)
typename InstructionShape,
/// Stages in GEMM
int kStages,
/// Operator performed by GEMM
typename Operator,
/// Use zfill or predicate for out-of-bound cp.async
SharedMemoryClearOption SharedMemoryClear>
struct DefaultWint2xMma<ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator,
layout::RowMajor, OperatorClass, ArchTag, ThreadblockShape, WarpShape, InstructionShape,
kStages, Operator, SharedMemoryClear>
{
public:
static_assert(platform::is_same<ElementA, half_t>::value || platform::is_same<ElementA, bfloat16_t>::value,
"Element A must be fp16 or bf16");
static_assert(platform::is_same<ElementB, uint2b_t>::value,
"Element B must be uint2b_t");
static_assert(platform::is_same<Operator, arch::OpMultiplyAddDequantizeInterleavedBToA>::value,
"Mma multistage must dequantize after ldsm");
using ElementSuperScale = ElementA;
using ElementLocalScale = uint4b_t;
using ElementCodeScaleZp = float;
static constexpr int kGroupSize = 64;
static cutlass::arch::CacheOperation::Kind const CacheOpA = ((sizeof_bits<ElementA>::value * kAlignmentA) == 128)
? cutlass::arch::CacheOperation::Global
: cutlass::arch::CacheOperation::Always;
static cutlass::arch::CacheOperation::Kind const CacheOpB = ((sizeof_bits<ElementB>::value * kAlignmentB) == 128)
? cutlass::arch::CacheOperation::Global
: cutlass::arch::CacheOperation::Always;
// Define the MmaCore components
// Mma core does not depend on stages, so pass in at least 3 here to mma multistage pieces are created
using MmaCore = typename cutlass::gemm::threadblock::DefaultMmaCore<ThreadblockShape, WarpShape, InstructionShape,
ElementA, LayoutA, ElementB, layout::ColumnMajor, ElementAccumulator, layout::RowMajor, OperatorClass,
std::max(kStages, 3), Operator, false, CacheOpA, CacheOpB>;
// Define iterators over tiles from the A operand
using ThreadMapA = typename MmaCore::IteratorThreadMapA;
using AccessTypeA = cutlass::Array<ElementA, kAlignmentA>;
using IteratorA = cutlass::transform::threadblock::PredicatedTileAccessIterator<
cutlass::MatrixShape<ThreadblockShape::kM, ThreadblockShape::kK>, ElementA, LayoutA, 1, ThreadMapA,
AccessTypeA>;
private:
static constexpr int kColumnsInterleaved = LayoutB::kColumnsInterleaved;
static constexpr int kRowsPerTile = LayoutB::kRowsPerTile;
static_assert(!(MmaCore::Shape::kN % kColumnsInterleaved), "ThreadblockShape must be disivle by kColumnsInterleaved");
static_assert(kRowsPerTile == MmaCore::Shape::kK, "");
using ThreadMapB = typename MmaCore::IteratorThreadMapB;
using WarpArrangement = typename ThreadMapB::Detail::WarpThreadArrangement;
static_assert(!(WarpArrangement::kStrided % kColumnsInterleaved), "");
using IteratorShapeB = MatrixShape<
MmaCore::Shape::kK * kColumnsInterleaved, MmaCore::Shape::kN / kColumnsInterleaved>;
using InterleavedThreadMapB = transform::PitchLinearWarpRakedThreadMap<
layout::PitchLinearShape<IteratorShapeB::kRow, IteratorShapeB::kColumn>,
ThreadMapB::kThreads,
layout::PitchLinearShape<WarpArrangement::kContiguous * kColumnsInterleaved,
WarpArrangement::kStrided / kColumnsInterleaved>,
MmaCore::kAccessSizeInBits / sizeof_bits<ElementB>::value>;
public:
// Define iterators over tiles from the B operand
using AccessTypeB = cutlass::Array<ElementB, kAlignmentB>;
using IteratorB = cutlass::transform::threadblock::PredicatedTileAccessIterator<
IteratorShapeB, ElementB, layout::ColumnMajor, 0, InterleavedThreadMapB,
AccessTypeB>;
private:
// Define iterators over tiles from extra quant params for B operand
using IteratorSuperScale = typename DefaultQuantParamsIterators<
ThreadblockShape, ElementSuperScale, -1>::Iterator;
using SmemIteratorSuperScale = typename DefaultQuantParamsIterators<
ThreadblockShape, ElementSuperScale, -1>::SmemIterator;
using IteratorLocalScale = typename DefaultQuantParamsIterators<
ThreadblockShape, ElementLocalScale, kGroupSize>::Iterator;
using SmemIteratorLocalScale = typename DefaultQuantParamsIterators<
ThreadblockShape, ElementLocalScale, kGroupSize>::SmemIterator;
using IteratorCodeScaleZp = typename DefaultQuantParamsIterators<
ThreadblockShape, ElementCodeScaleZp, -1>::Iterator;
using SmemIteratorCodeScaleZp = typename DefaultQuantParamsIterators<
ThreadblockShape, ElementCodeScaleZp, -1>::Iterator;
public:
using QuantParamsAccessor = Wint2ParamsAccessor<
ElementA, ThreadblockShape, IteratorSuperScale, SmemIteratorSuperScale,
IteratorLocalScale, SmemIteratorLocalScale,
IteratorCodeScaleZp, SmemIteratorCodeScaleZp, kStages, kGroupSize>;
// Define the threadblock-scoped multistage matrix multiply
using ThreadblockMma = cutlass::gemm::threadblock::Wint2xMmaMultistage<
typename MmaCore::Shape,
IteratorA, typename MmaCore::SmemIteratorA, MmaCore::kCacheOpA,
IteratorB, typename MmaCore::SmemIteratorB, MmaCore::kCacheOpB,
ElementAccumulator, layout::RowMajor, typename MmaCore::MmaPolicy,
kStages, QuantParamsAccessor, SharedMemoryClear>;
};
} // namespace threadblock
} // namespace gemm
} // namespace cutlass

View File

@@ -63,8 +63,8 @@ template <
typename Policy_,
/// Number of stages,
int Stages,
/// Used for partial specialization
typename Enable = bool>
/// Size of extra quantized params
typename QuantParamsShape>
class Wint2xMmaBase {
public:
///< Size of the Gemm problem - concept: gemm::GemmShape<>
@@ -93,6 +93,14 @@ public:
static int const kWarpGemmIterations =
(WarpGemm::kK / Operator::Policy::MmaShape::kK);
/// Number of warp-level GEMM oeprations per load for B
static constexpr int kWarpGemmIterationsPerLoadForB =
Operator::IteratorB::InstructionShape::kRow / Operator::InstructionShape::kK;
static_assert(!(kWarpGemmIterations % kWarpGemmIterationsPerLoadForB), "");
static constexpr int kWarpLoadIterationsForB =
kWarpGemmIterations / kWarpGemmIterationsPerLoadForB;
/// Number of stages
static int const kStages = Stages;
@@ -104,8 +112,6 @@ public:
using TensorRefB =
TensorRef<typename Operator::ElementB, typename Operator::LayoutB>;
// using TensorRefZippedB = TensorRef<uint8_t, typename Operator::LayoutB>;
static_assert(kWarpGemmIterations > 1,
"The pipelined structure requires at least two warp-level "
"GEMM operations.");
@@ -130,20 +136,11 @@ public:
Shape::kK * kStages + Policy::SmemPaddingA::kColumn>;
/// Shape of the B matrix operand in shared memory
using ShapeB = MatrixShape<Shape::kK + Policy::SmemPaddingB::kRow,
using ShapeB = MatrixShape<Shape::kK * kStages + Policy::SmemPaddingB::kRow,
Shape::kN + Policy::SmemPaddingB::kColumn>;
// w uint8; local_scale uint8;
constexpr static int kZippedRowsPerStages =
Shape::kK / 4 + (Shape::kK + 127) / 128;
// code_scale float; code_zp float; super_scale ElementB
constexpr static int kColumnWiseParamsRows = 2 * sizeof(float) +
sizeof_bits<typename Operator::ElementB>::value / 8;
using ZippedShapeB = MatrixShape<kColumnWiseParamsRows + kZippedRowsPerStages * kStages, Shape::kN>;
using NopaddingShapeB = MatrixShape<Shape::kK, Shape::kN>;
/// Shape of all quant params in shared memory
using QuantParamsShapeB = QuantParamsShape;
public:
//
@@ -156,12 +153,8 @@ public:
/// Buffer for B operand
AlignedBuffer<typename Operator::ElementB, ShapeB::kCount> operand_B;
/// Buffer for quanted B operand
AlignedBuffer<uint8_t, ZippedShapeB::kCount> operand_zipped_B;
/// Buffer for unzip B operand
AlignedBuffer<typename Operator::ElementB, NopaddingShapeB::kCount>
operand_unzip_B;
/// Buffer for extra quant params of B operand
AlignedBuffer<uint8_t, QuantParamsShapeB::kCount> operand_quant_params_B;
public:
//
@@ -191,14 +184,6 @@ public:
TensorRefB operand_B_ref() {
return TensorRefB{operand_B.data(), LayoutB()};
}
CUTLASS_HOST_DEVICE
uint8_t *operand_zipped_B_ptr() { return operand_zipped_B.data(); }
CUTLASS_HOST_DEVICE
typename Operator::ElementB *operand_unzip_B_ptr() {
return operand_unzip_B.data();
}
};
protected:

View File

@@ -45,7 +45,8 @@
#include "cutlass_extensions/arch/memory_copy_sm80.h"
#include "cutlass_extensions/gemm/threadblock/wint2x_mma_base.h"
#include "cutlass_extensions/gemm/threadblock/wint2x_tile_dequanter.h"
#include "cutlass_extensions/gemm/threadblock/wint2x_params_accessor.h"
#include "cutlass_extensions/gemm/warp/mma_tensorop_wint2x_dequantizer.h"
/////////////////////////////////////////////////////////////////////////////////////////////////
@@ -86,15 +87,15 @@ template <
typename Policy_,
/// Number of stages,
int Stages,
/// Accessor for extra quantized params
typename QuantParamsAccessor_,
/// Use zfill or predicate for out-of-bound cp.async
SharedMemoryClearOption SharedMemoryClear = SharedMemoryClearOption::kNone,
/// Used for partial specialization
typename Enable = bool>
SharedMemoryClearOption SharedMemoryClear = SharedMemoryClearOption::kNone>
class Wint2xMmaMultistage :
public Wint2xMmaBase<Shape_, Policy_, Stages> {
public Wint2xMmaBase<Shape_, Policy_, Stages, typename QuantParamsAccessor_::QuantParamsShape> {
public:
///< Base class
using Base = Wint2xMmaBase<Shape_, Policy_, Stages>;
using Base = Wint2xMmaBase<Shape_, Policy_, Stages, typename QuantParamsAccessor_::QuantParamsShape>;
///< Size of the Gemm problem - concept: gemm::GemmShape<>
using Shape = Shape_;
///< Iterates over tiles of A operand in global memory
@@ -107,8 +108,11 @@ public:
using LayoutC = LayoutC_;
///< Policy describing tuning details
using Policy = Policy_;
/// Accessor for extra quantized params
using QuantParamsAccessor = QuantParamsAccessor_;
using QuantArguments = typename QuantParamsAccessor::Arguments;
using ZippedShapeB = typename Base::SharedStorage::ZippedShapeB;
static constexpr int kInterleave = IteratorB::Shape::kRow / Shape::kK;
using SmemIteratorA = SmemIteratorA_;
using SmemIteratorB = SmemIteratorB_;
@@ -129,6 +133,18 @@ public:
/// Minimum architecture is Sm80 to support cp.async
using ArchTag = arch::Sm80;
//using LayoutScale = typename QuantParamsAccessor::IteratorSuperScale::Layout;
using LayoutScale = layout::RowMajor;
using WarpTransformedFragmentB = typename Operator::TransformedFragmentB;
using WarpDequantizer =
warp::MmaTensorOpWin2xDequantizer<Operator,
typename Base::WarpGemm,
Operand::kB,
typename WarpTransformedFragmentB::Element,
LayoutScale,
QuantParamsAccessor::kGroupSize>;
static_assert(sizeof(WarpDequantizer) > 0, "WarpDequantizer template instantiation failed");
/// Complex transform on A operand
static ComplexTransform const kTransformA = Operator::kTransformA;
@@ -174,18 +190,37 @@ public:
using WarpTransformedFragmentA = typename Operator::TransformedFragmentA;
using WarpTransformedFragmentB = typename Operator::TransformedFragmentB;
using FragmentSuperScale = typename WarpDequantizer::FragmentSuperScale;
using FragmentCodeScaleZp = typename WarpDequantizer::FragmentCodeScaleZp;
using FragmentLocalScale = typename WarpDequantizer::FragmentLocalScale;
/// Temporary accumulator to facilitate staged-accumulation
FragmentC tmp_accum_;
/// Pair of A fragments used to overlap shared memory loads and math instructions
WarpLoadedFragmentA warp_loaded_frag_A_[2];
WarpTransformedFragmentA warp_transformed_frag_A_[2];
WarpTransformedFragmentA warp_frag_A_[2];
/// Pair of B fragments used to overlap shared memory loads and math instructions
WarpLoadedFragmentB warp_loaded_frag_B_[2];
WarpTransformedFragmentB warp_transformed_frag_B_[2];
WarpLoadedFragmentB warp_loaded_frag_B_;
WarpTransformedFragmentB warp_frag_B_[2];
/// channel-wise quant params
FragmentCodeScaleZp warp_frag_code_scale_;
FragmentCodeScaleZp warp_frag_code_zp_;
FragmentSuperScale warp_frag_super_scale_;
/// group-wise quant params
FragmentLocalScale warp_frag_local_scale_;
};
using ElementA = typename IteratorA::Element;
using ElementB = typename IteratorB::Element;
using LayoutDetailsForB = kernel::LayoutDetailsB<ElementA, ElementB, ArchTag>;
static constexpr bool IsTileInterleaveLayout =
layout::IsColumnMajorTileInterleave<typename LayoutDetailsForB::Layout>::value;
static_assert(!IsTileInterleaveLayout || (IsTileInterleaveLayout && (Shape::kK == LayoutDetailsForB::ThreadblockK)),
"Layout K must match threadblockK");
private:
@@ -202,17 +237,18 @@ public:
/// Iterator to write threadblock-scoped tile of B operand to shared memory
SmemIteratorB smem_iterator_B_;
/// Accessor for extra quant params for B
QuantParamsAccessor quant_params_accessor_B_;
// Wint2 unzip operator
WarpDequantizer warp_dequantizer_;
/// Shared memory write stage index
int smem_write_stage_idx_;
/// Shared memory read stage index
int smem_read_stage_idx_;
uint8_t* column_wise_smem_ptr_B_;
uint8_t* smem_zipped_ptr_B_;
int smem_zipped_bytes_per_stage_B_;
public:
/// Construct from tensor references
@@ -226,10 +262,15 @@ public:
int warp_idx,
///< ID of each thread within a warp
int lane_idx
):
Base(shared_storage, thread_idx, warp_idx, lane_idx),
) : Base(shared_storage, thread_idx, warp_idx, lane_idx),
smem_iterator_A_(shared_storage.operand_A_ref(), thread_idx),
smem_iterator_B_(shared_storage.operand_B_ref(), thread_idx),
quant_params_accessor_B_(shared_storage.operand_quant_params_B.data(), thread_idx, warp_idx, lane_idx),
warp_dequantizer_(quant_params_accessor_B_.super_scale_ref(),
quant_params_accessor_B_.local_scale_ref(),
quant_params_accessor_B_.code_scale_ref(),
quant_params_accessor_B_.code_zp_ref(),
(warp_idx % (Base::WarpCount::kM * Base::WarpCount::kN)) / Base::WarpCount::kM, lane_idx),
smem_write_stage_idx_(0),
smem_read_stage_idx_(0)
{
@@ -250,11 +291,6 @@ public:
{warp_idx_m, Base::kWarpGemmIterations * warp_idx_k});
this->warp_tile_iterator_B_.add_tile_offset(
{Base::kWarpGemmIterations * warp_idx_k, warp_idx_n});
column_wise_smem_ptr_B_ = shared_storage.operand_zipped_B_ptr();
smem_zipped_ptr_B_ = column_wise_smem_ptr_B_ + Base::SharedStorage::kColumnWiseParamsRows * ZippedShapeB::kColumn;
smem_zipped_bytes_per_stage_B_ = Base::SharedStorage::kZippedRowsPerStages * ZippedShapeB::kColumn;
}
/// Advance shared memory read-iterators to the next stage
@@ -266,28 +302,22 @@ public:
if (smem_read_stage_idx_ == Base::kStages) {
// Wrap back around to the 'start' of the circular buffer in shared memory
this->warp_tile_iterator_A_.add_tile_offset({0, -Base::kStages * Policy::kPartitionsK * Base::kWarpGemmIterations});
// this->warp_tile_iterator_B_.add_tile_offset({-Base::kStages * Policy::kPartitionsK * Base::kWarpGemmIterations, 0});
this->warp_tile_iterator_B_.add_tile_offset({-Base::kStages * Policy::kPartitionsK * Base::kWarpLoadIterationsForB, 0});
smem_read_stage_idx_ = 0;
}
this->warp_tile_iterator_B_.add_tile_offset({-Policy::kPartitionsK * Base::kWarpGemmIterations, 0});
}
/// Advance global memory read-iterators and shared memory write-iterators to the stage
template <typename TileDequanterB>
CUTLASS_DEVICE
void advance_smem_write_stage(
IteratorA &iterator_A,
IteratorB &iterator_B,
TileDequanterB &tile_dequanter_B)
void advance_smem_write_stage(IteratorA &iterator_A, IteratorB &iterator_B)
{
// Advance global iterators
iterator_A.add_tile_offset({0, 1});
//iterator_B.add_tile_offset({1, 0});
tile_dequanter_B.AddTileOffset({1, 0});
iterator_B.add_tile_offset({1, 0});
// Advance shared iterators
smem_iterator_A_.add_tile_offset({0, 1});
//smem_iterator_B_.add_tile_offset({1, 0});
smem_iterator_B_.add_tile_offset({1, 0});
// Increment shared memory write stage index
++smem_write_stage_idx_;
@@ -295,7 +325,7 @@ public:
if (smem_write_stage_idx_ == Base::kStages) {
// Wrap back around to the 'start' of the circular buffer in shared memory
smem_iterator_A_.add_tile_offset({0, -Base::kStages});
//smem_iterator_B_.add_tile_offset({-Base::kStages, 0});
smem_iterator_B_.add_tile_offset({-Base::kStages, 0});
smem_write_stage_idx_ = 0;
}
}
@@ -338,9 +368,14 @@ public:
}
}
template <bool GlobalToSharedB>
CUTLASS_DEVICE
void copy_tiles_and_advance_B(IteratorB &iterator_B, int group_start_B = 0) {
if constexpr (SharedMemoryClear == SharedMemoryClearOption::kZfill) {
if (threadIdx.x >= IteratorB::ThreadMap::kThreads) {
return;
}
}
iterator_B.set_iteration_index(group_start_B *
IteratorB::kAccessesPerVector);
this->smem_iterator_B_.set_iteration_index(group_start_B);
@@ -360,13 +395,14 @@ public:
CUTLASS_PRAGMA_UNROLL
for (int v = 0; v < IteratorB::kAccessesPerVector; ++v) {
auto gmem_ptr = iterator_B.get();
bool is_valid = (threadIdx.x < IteratorB::ThreadMap::kThreads) ? iterator_B.valid() : false;
if (SharedMemoryClear == SharedMemoryClearOption::kZfill) {
cutlass::arch::copy_zfill<kSrcBytes, kCacheOpB, GlobalToSharedB>(
dst_ptr + v, gmem_ptr, iterator_B.valid());
cutlass::arch::cp_async_zfill<kSrcBytes, kCacheOpB>(
dst_ptr + v, gmem_ptr, is_valid);
} else {
cutlass::arch::copy<kSrcBytes, kCacheOpB, GlobalToSharedB>(
dst_ptr + v, gmem_ptr, iterator_B.valid());
cutlass::arch::cp_async<kSrcBytes, kCacheOpB>(
dst_ptr + v, gmem_ptr, is_valid);
}
++iterator_B;
@@ -375,7 +411,6 @@ public:
++this->smem_iterator_B_;
}
}
__syncthreads();
}
CUTLASS_DEVICE
@@ -399,8 +434,6 @@ public:
IteratorA::ThreadMap::kElementsPerAccess /
IteratorA::kAccessesPerVector / 8;
int src_bytes = (iterator_A.valid() ? kSrcBytes : 0);
cutlass::arch::cp_async_zfill<kSrcBytes, kCacheOpA>(
dst_ptr + v, iterator_A.get(), iterator_A.valid());
@@ -411,9 +444,12 @@ public:
}
}
template <bool GlobalToSharedB, bool InitStage>
CUTLASS_DEVICE
void copy_tiles_and_advance_per_stage_B(IteratorB &iterator_B) {
if (threadIdx.x >= IteratorB::ThreadMap::kThreads) {
return;
}
iterator_B.set_iteration_index(0);
this->smem_iterator_B_.set_iteration_index(0);
@@ -433,35 +469,23 @@ public:
IteratorB::ThreadMap::kElementsPerAccess /
IteratorB::kAccessesPerVector / 8;
if (InitStage) {
cutlass::arch::copy_zfill<kSrcBytes, kCacheOpB, GlobalToSharedB>(
dst_ptr + v, iterator_B.get(), iterator_B.valid());
} else {
if (SharedMemoryClear == SharedMemoryClearOption::kZfill) {
cutlass::arch::copy_zfill<kSrcBytes, kCacheOpB, GlobalToSharedB>(
dst_ptr + v, gmem_ptr, iterator_B.valid());
} else {
cutlass::arch::copy<kSrcBytes, kCacheOpB, GlobalToSharedB>(
dst_ptr + v, gmem_ptr, iterator_B.valid());
}
}
cutlass::arch::cp_async_zfill<kSrcBytes, kCacheOpB>(
dst_ptr + v, iterator_B.get(), iterator_B.valid());
++iterator_B;
}
++this->smem_iterator_B_;
}
__syncthreads();
}
/// GEMM prologue. Bootstrap the global->shared memory pipeline by fetching
/// the global fragments needed by the first kStages-1 threadblock mainloop iterations
template <typename TileDequanterB>
CUTLASS_DEVICE
void prologue(
IteratorA &iterator_A, ///< [in|out] iterator over A operand in global memory
IteratorB &iterator_B, ///< [in|out] iterator over B operand in global memory
TileDequanterB &tile_dequanter_B,
QuantArguments &mma_quant_args, ///< iterators for extra quant params for B
int &gemm_k_iterations) ///< [in|out] number of threadblock mainloop iterations remaining
{
// Issue several complete stages
@@ -476,11 +500,18 @@ public:
copy_tiles_and_advance_per_stage_A(iterator_A);
// Async copy zipped B to shared memory.
tile_dequanter_B.Load(smem_zipped_ptr_B_ + (stage % Base::kStages) * smem_zipped_bytes_per_stage_B_,
column_wise_smem_ptr_B_, stage);
copy_tiles_and_advance_per_stage_B(iterator_B);
// Async copy other quantized params to shared memory, local_scale, code_scale, code_zp, super_scale.
if (stage == 0) {
quant_params_accessor_B_.copy_tiles_and_advance_per_stage<true>(mma_quant_args, stage);
} else {
quant_params_accessor_B_.copy_tiles_and_advance_per_stage<false>(mma_quant_args, stage);
}
// Move to the next write stage
advance_smem_write_stage(iterator_A, iterator_B, tile_dequanter_B);
advance_smem_write_stage(iterator_A, iterator_B);
quant_params_accessor_B_.advance_smem_write_stage(mma_quant_args);
// Defines the boundary of a stage of cp.async.
cutlass::arch::cp_async_fence();
@@ -510,6 +541,10 @@ public:
++last_smem_iterator_A;
}
if (threadIdx.x >= IteratorB::ThreadMap::kThreads) {
return;
}
/// Iterator to write threadblock-scoped tile of B operand to shared memory
SmemIteratorB last_smem_iterator_B(this->smem_iterator_B_);
typename IteratorB::AccessType zero_B;
@@ -542,57 +577,57 @@ public:
}
/// Perform a threadblock mainloop iteration of matrix multiply-accumulate
template <typename TileDequanterB>
CUTLASS_DEVICE
void mac_loop_iter(
PipeState &pipe_state, ///< [in|out] loop-carried pipeline state
FragmentC &accum, ///< [in|out] destination accumulator tile
IteratorA &iterator_A, ///< [in|out] iterator over A operand in global memory
IteratorB &iterator_B, ///< [in|out] iterator over B operand in global memory
TileDequanterB &tile_dequanter_B, ///< [in|out] tile dequantizer for B operand
int &gemm_k_iterations, ///< [in|out] number of threadblock mainloop iterations remaining
QuantArguments &mma_quant_args, ///< iterators for extra quant params for B
int &gemm_k_iterations, ///< [in|out] number of threadblock mainloop iterations remaining
int stage)
{
const int mma_stage = stage - Base::kStages + 1;
// Unroll the warp-level MMA tiles of a threadblock's mainloop iteration
CUTLASS_PRAGMA_UNROLL
for (int warp_mma_k = 0; warp_mma_k < Base::kWarpGemmIterations; ++warp_mma_k) {
// CUTLASS_TRACE_DEVICE(" [MMa] stage=%d, warp_mma_k=%d", stage, warp_mma_k);
int warp_k_compute_offset_B = warp_mma_k % Base::kWarpGemmIterationsPerLoadForB;
if (warp_k_compute_offset_B == Base::kWarpGemmIterationsPerLoadForB - 1) {
// Load the next warp-tile's B fragment from shared memory
this->warp_tile_iterator_B_.set_kgroup_index(((warp_mma_k + 1) % Base::kWarpGemmIterations) / Base::kWarpLoadIterationsForB);
this->warp_tile_iterator_B_.load(pipe_state.warp_loaded_frag_B_);
++this->warp_tile_iterator_B_;
}
// load next-tile of group-wise local_scale from shared memory
if (warp_mma_k == Base::kWarpGemmIterations - 1) {
warp_dequantizer_.load(pipe_state.warp_frag_local_scale_);
}
// Load the next warp-tile's A fragment from shared memory
this->warp_tile_iterator_A_.set_kgroup_index((warp_mma_k + 1) % Base::kWarpGemmIterations);
this->warp_tile_iterator_A_.load(pipe_state.warp_loaded_frag_A_[(warp_mma_k + 1) % 2]);
this->warp_tile_iterator_A_.load(pipe_state.warp_frag_A_[(warp_mma_k + 1) % 2]);
++this->warp_tile_iterator_A_;
if (warp_mma_k + 1 == Base::kWarpGemmIterations) {
// Unpack and dequant the first stage of B.
int unpack_stage = stage - Base::kStages + 2;
tile_dequanter_B.UnpackAndDequant(smem_zipped_ptr_B_ + (unpack_stage % Base::kStages) * smem_zipped_bytes_per_stage_B_,
column_wise_smem_ptr_B_, unpack_stage);
// Copy dequatized data to shared memory used by mma core.
copy_tiles_and_advance_per_stage_B<false, false>(iterator_B);
}
// Load the next warp-tile's B fragment from shared memory
this->warp_tile_iterator_B_.set_kgroup_index((warp_mma_k + 1) % Base::kWarpGemmIterations);
this->warp_tile_iterator_B_.load(pipe_state.warp_loaded_frag_B_[(warp_mma_k + 1) % 2]);
++this->warp_tile_iterator_B_;
// Except for the first warp-tile, all warp-tiles convert their incoming shared memory fragments as necessary
if (warp_mma_k > 0) {
warp_mma_.transform(
pipe_state.warp_transformed_frag_A_[warp_mma_k % 2],
pipe_state.warp_transformed_frag_B_[warp_mma_k % 2],
pipe_state.warp_loaded_frag_A_[warp_mma_k % 2],
pipe_state.warp_loaded_frag_B_[warp_mma_k % 2]);
}
// dequantizes next warp-tile
warp_dequantizer_.dequantize(pipe_state.warp_frag_local_scale_,
pipe_state.warp_frag_code_scale_,
pipe_state.warp_frag_code_zp_,
pipe_state.warp_frag_super_scale_,
pipe_state.warp_loaded_frag_B_,
pipe_state.warp_frag_B_[(warp_mma_k + 1) % 2],
((warp_mma_k == Base::kWarpGemmIterations - 1) ? (mma_stage + 1) : mma_stage) * Shape::kK,
(warp_mma_k + 1) % Base::kWarpGemmIterationsPerLoadForB);
// Execute the current warp-tile of MMA operations
if (Detail::kStagedAccumulation) {
if constexpr (Detail::kStagedAccumulation) {
warp_mma_(
pipe_state.tmp_accum_,
pipe_state.warp_transformed_frag_A_[warp_mma_k % 2],
pipe_state.warp_transformed_frag_B_[warp_mma_k % 2],
pipe_state.warp_frag_A_[warp_mma_k % 2],
pipe_state.warp_frag_B_[warp_mma_k % 2],
pipe_state.tmp_accum_
);
@@ -604,22 +639,22 @@ public:
} else {
warp_mma_(
accum,
pipe_state.warp_transformed_frag_A_[warp_mma_k % 2],
pipe_state.warp_transformed_frag_B_[warp_mma_k % 2],
accum
);
pipe_state.warp_frag_A_[warp_mma_k % 2],
pipe_state.warp_frag_B_[warp_mma_k % 2],
accum);
}
// Except for the last warp-tile, all warp-tiles issue their share of
// global->shared fragment copies
if (warp_mma_k < Base::kWarpGemmIterations - 1) {
int group_start_iteration_A = warp_mma_k * Detail::kAccessesPerGroupA;
int group_start_iteration_B = warp_mma_k * Detail::kAccessesPerGroupB;
copy_tiles_and_advance_A(iterator_A, group_start_iteration_A);
copy_tiles_and_advance_B(iterator_B, group_start_iteration_B);
if (warp_mma_k == 0) {
tile_dequanter_B.Load(smem_zipped_ptr_B_ + (stage % Base::kStages) * smem_zipped_bytes_per_stage_B_,
column_wise_smem_ptr_B_, stage);
quant_params_accessor_B_.copy_tiles_and_advance_per_stage<false>(mma_quant_args, stage);
}
}
@@ -628,9 +663,15 @@ public:
// - moves to the next global fetch stage
if (warp_mma_k + 2 == Base::kWarpGemmIterations) {
// Performs the last warp-tile's share of global->shared fragment copies
int group_start_iteration_A = (warp_mma_k + 1) * Detail::kAccessesPerGroupA;
if constexpr (Detail::AsyncCopyIterationsPerStageA >= Base::kWarpGemmIterations) {
int group_start_iteration_A = (warp_mma_k + 1) * Detail::kAccessesPerGroupA;
copy_tiles_and_advance_A(iterator_A, group_start_iteration_A);
}
copy_tiles_and_advance_A(iterator_A, group_start_iteration_A);
if constexpr (Detail::AsyncCopyIterationsPerStageB >= Base::kWarpGemmIterations) {
int group_start_iteration_B = (warp_mma_k + 1) * Detail::kAccessesPerGroupB;
copy_tiles_and_advance_B(iterator_B, group_start_iteration_B);
}
// Inserts a memory fence between stages of cp.async instructions.
cutlass::arch::cp_async_fence();
@@ -639,69 +680,66 @@ public:
gmem_wait();
// Move to the next global fetch stage
advance_smem_write_stage(iterator_A, iterator_B, tile_dequanter_B);
advance_smem_write_stage(iterator_A, iterator_B);
quant_params_accessor_B_.advance_smem_write_stage(mma_quant_args);
advance_smem_read_stage();
int byte_offset = quant_params_accessor_B_.advance_smem_read_stage();
warp_dequantizer_.add_pointer_offset(byte_offset);
// Disable global fetching when done with global fetch iterations
--gemm_k_iterations;
iterator_A.clear_mask(gemm_k_iterations == 0);
iterator_B.clear_mask(gemm_k_iterations == (-Base::kStages + 1));
}
// The last warp-tile also converts the shared memory fragments used by
// the first warp-tile of the next iteration, if necessary (so we can
// immediately start issuing MMA instructions at the top of the loop )
if (warp_mma_k + 1 == Base::kWarpGemmIterations) {
warp_mma_.transform(
pipe_state.warp_transformed_frag_A_[(warp_mma_k + 1) % 2],
pipe_state.warp_transformed_frag_B_[(warp_mma_k + 1) % 2],
pipe_state.warp_loaded_frag_A_[(warp_mma_k + 1) % 2],
pipe_state.warp_loaded_frag_B_[(warp_mma_k + 1) % 2]);
iterator_B.clear_mask(gemm_k_iterations == 0);
quant_params_accessor_B_.clear_mask(mma_quant_args, gemm_k_iterations == 0);
}
}
}
/// Perform the specified number of threadblock mainloop iterations of matrix
/// multiply-accumulate. Assumes prologue has been initiated.
template <typename TileDequanterB>
CUTLASS_DEVICE
void gemm_iters(
int gemm_k_iterations, ///< number of threadblock mainloop iterations
FragmentC &accum, ///< [in|out] accumulator tile
IteratorA &iterator_A, ///< [in|out] iterator over A operand in global memory
IteratorB &iterator_B,
TileDequanterB &tile_dequanter_B) ///< [in|out] iterator over B operand in global memory
IteratorB &iterator_B, ///< [in|out] iterator over B operand in global memory
QuantArguments &mma_quant_args)
{
PipeState pipe_state;
// Unpack and dequant the first stage of B.
tile_dequanter_B.UnpackAndDequant(smem_zipped_ptr_B_, column_wise_smem_ptr_B_, 0);
// Disable global fetching if done with global fetch iterations
iterator_A.clear_mask(gemm_k_iterations == 0);
iterator_B.clear_mask(gemm_k_iterations == (-Base::kStages + 1));
// Load first warp-tile's A fragment from shared memory
this->warp_tile_iterator_A_.set_kgroup_index(0);
this->warp_tile_iterator_A_.load(pipe_state.warp_loaded_frag_A_[0]);
++this->warp_tile_iterator_A_;
// Copy dequatized data to shared memory used by mma core.
copy_tiles_and_advance_per_stage_B<false, true>(iterator_B);
iterator_B.clear_mask(gemm_k_iterations == 0);
quant_params_accessor_B_.clear_mask(mma_quant_args, gemm_k_iterations == 0);
// Load first warp-tile's B fragment from shared memory
this->warp_tile_iterator_B_.set_kgroup_index(0);
this->warp_tile_iterator_B_.load(pipe_state.warp_loaded_frag_B_[0]);
this->warp_tile_iterator_B_.load(pipe_state.warp_loaded_frag_B_);
++this->warp_tile_iterator_B_;
// Transform, if necessary, the first warp-tile's shared memory fragments
warp_mma_.transform(
pipe_state.warp_transformed_frag_A_[0],
pipe_state.warp_transformed_frag_B_[0],
pipe_state.warp_loaded_frag_A_[0],
pipe_state.warp_loaded_frag_B_[0]);
warp_dequantizer_.load(pipe_state.warp_frag_code_scale_,
pipe_state.warp_frag_code_zp_,
pipe_state.warp_frag_super_scale_);
if (Detail::kStagedAccumulation) {
warp_dequantizer_.load(pipe_state.warp_frag_local_scale_);
// Load first warp-tile's A fragment from shared memory
this->warp_tile_iterator_A_.set_kgroup_index(0);
this->warp_tile_iterator_A_.load(pipe_state.warp_frag_A_[0]);
++this->warp_tile_iterator_A_;
// Dequantize B to in register
warp_dequantizer_.dequantize(pipe_state.warp_frag_local_scale_,
pipe_state.warp_frag_code_scale_,
pipe_state.warp_frag_code_zp_,
pipe_state.warp_frag_super_scale_,
pipe_state.warp_loaded_frag_B_,
pipe_state.warp_frag_B_[0],
0,
0);
if constexpr (Detail::kStagedAccumulation) {
pipe_state.tmp_accum_.clear();
}
@@ -715,13 +753,13 @@ public:
accum,
iterator_A,
iterator_B,
tile_dequanter_B,
mma_quant_args,
gemm_k_iterations,
stage);
stage += 1;
}
if (Detail::kStagedAccumulation) {
if constexpr (Detail::kStagedAccumulation) {
plus<FragmentC> plus_accum;
accum = plus_accum(accum, pipe_state.tmp_accum_);
}
@@ -761,14 +799,12 @@ public:
else
{
this->warp_tile_iterator_A_.add_tile_offset({0, ((Base::kStages - 2) * kStageIters)});
//this->warp_tile_iterator_B_.add_tile_offset({((Base::kStages - 2) * kStageIters), 0});
this->warp_tile_iterator_B_.add_tile_offset({(-2 * kStageIters), 0});
this->warp_tile_iterator_B_.add_tile_offset({((Base::kStages - 2) * kStageIters), 0});
}
smem_read_stage_idx_ = smem_write_stage_idx_;
}
/// Perform a threadblock-scoped matrix multiply-accumulate, pre-load B to shared memory.
template <typename TileDequanterB>
CUTLASS_DEVICE
void operator()(
///< problem size of GEMM
@@ -779,13 +815,13 @@ public:
IteratorA iterator_A,
///< iterator over B operand in global memory
IteratorB iterator_B,
///< pre-load and dequantize B to shared memory
TileDequanterB tile_dequanter_B,
///< iterators for extra quant params for B
QuantArguments mma_quant_args,
///< initial value of accumulator
FragmentC const &src_accum) {
// Prologue (start fetching iterations of global fragments into shared memory)
prologue(iterator_A, iterator_B, tile_dequanter_B, gemm_k_iterations);
prologue(iterator_A, iterator_B, mma_quant_args, gemm_k_iterations);
// Wait until we have at least one completed global fetch stage
gmem_wait();
@@ -794,7 +830,7 @@ public:
accum = src_accum;
// Perform the MAC-iterations
gemm_iters(gemm_k_iterations, accum, iterator_A, iterator_B, tile_dequanter_B);
gemm_iters(gemm_k_iterations, accum, iterator_A, iterator_B, mma_quant_args);
}
};

View File

@@ -0,0 +1,315 @@
// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#pragma once
#include "cutlass/arch/memory_sm80.h"
#include "cutlass/cutlass.h"
#include "cutlass/gemm/gemm.h"
#include "cutlass/matrix_shape.h"
#include "cutlass/trace.h"
namespace cutlass {
namespace gemm {
namespace threadblock {
template <
/// Original data type
typename T,
/// Size of the Gemm problem - concept: gemm::GemmShape<>
typename Shape_,
/// Iterators over super scales in global memory
typename IteratorSuperScale_,
/// Iterators over super scales in shared memory
typename SmemIteratorSuperScale_,
/// Iterators over local scales in global memory
typename IteratorLocalScale_,
/// Iterators over local scales in shared memory
typename SmemIteratorLocalScale_,
/// Iterators over code scales and zps in global memory
typename IteratorCodeScaleZp_,
/// Iterators over code scales and zps in shared memory
typename SmemIteratorCodeScaleZp_,
/// Number of stages,
int Stages_,
/// Group size for quantization
int GroupSize_>
class Wint2ParamsAccessor {
public:
static_assert(platform::is_same<T, half_t>::value || platform::is_same<T, bfloat16_t>::value,
"T must be fp16 or bf16");
using ElementType = T;
using Shape = Shape_;
using IteratorSuperScale = IteratorSuperScale_;
using SmemIteratorSuperScale = SmemIteratorSuperScale_;
using IteratorLocalScale = IteratorLocalScale_;
using SmemIteratorLocalScale = SmemIteratorLocalScale_;
using IteratorCodeScaleZp = IteratorCodeScaleZp_;
using SmemIteratorCodeScaleZp = SmemIteratorCodeScaleZp_;
constexpr static int kStages = Stages_;
constexpr static int kGroupSize = GroupSize_;
using ElementSuperScale = typename IteratorSuperScale::Element;
using LayoutSuperScale = typename IteratorSuperScale::Layout;
/// local_scale uint4 and group-wise
using ElementLocalScale = typename IteratorLocalScale::Element;
using LayoutLocalScale = typename IteratorLocalScale::Layout;
static_assert(platform::is_same<ElementLocalScale, uint4b_t>::value,
"local_scale's type must be uint4b_t.");
using ElementCodeScaleZp = typename IteratorCodeScaleZp::Element;
using LayoutCodeScaleZp = typename IteratorCodeScaleZp::Layout;
/// 2 uint4b_t values are stored in a single uint8_t
constexpr static int kStagesPerLocalScaleLoad = 2 * kGroupSize / Shape::kK;
constexpr static int kLocalScaleRows =
IteratorLocalScale::Shape::kRow * IteratorLocalScale::Shape::kColumn * sizeof_bits<ElementLocalScale>::value / 8 / Shape::kN;
using SmemElement = uint8_t;
constexpr static int kSmemRows =
kLocalScaleRows * kStages + sizeof(ElementSuperScale) + sizeof(ElementCodeScaleZp) * 2;
constexpr static int kSmemColumns = Shape::kN;
using QuantParamsShape = MatrixShape<kSmemRows, kSmemColumns>;
constexpr static int kSuperScaleSmemOffset = 0;
constexpr static int kCodeScaleSmemOffset = kSmemColumns * sizeof(ElementSuperScale);
constexpr static int kCodeZpSmemOffset = kCodeScaleSmemOffset + kSmemColumns * sizeof(ElementCodeScaleZp);
constexpr static int kLocalScaleSmemOffset = kCodeZpSmemOffset + kSmemColumns * sizeof(ElementCodeScaleZp);
/// TensorRef type for loading element from a tensor
using SuperTensorRef = cutlass::TensorRef<ElementSuperScale, LayoutSuperScale>;
using LocalTensorRef = cutlass::TensorRef<ElementLocalScale, LayoutLocalScale>;
using CodeTensorRef = cutlass::TensorRef<ElementCodeScaleZp, LayoutCodeScaleZp>;
struct Arguments {
IteratorSuperScale iterator_super_scale;
IteratorLocalScale iterator_local_scale;
IteratorCodeScaleZp iterator_code_scale;
IteratorCodeScaleZp iterator_code_zp;
int local_scale_pointer_offset;
CUTLASS_DEVICE
Arguments(IteratorSuperScale iterator_super_scale,
IteratorLocalScale iterator_local_scale,
IteratorCodeScaleZp iterator_code_scale,
IteratorCodeScaleZp iterator_code_zp,
int local_scale_pointer_offset)
: iterator_super_scale(iterator_super_scale),
iterator_local_scale(iterator_local_scale),
iterator_code_scale(iterator_code_scale),
iterator_code_zp(iterator_code_zp),
local_scale_pointer_offset(local_scale_pointer_offset) {}
};
private:
//
// Data members
//
/// Begin address of shared memory
uint8_t* smem_pointer_;
/// Iterator to write threadblock-scoped tile of super scale operand to shared memory
SmemIteratorSuperScale smem_iterator_super_scale_;
/// Iterator to write threadblock-scoped tile of local scale operand to shared memory
SmemIteratorLocalScale smem_iterator_local_scale_;
/// Iterator to write threadblock-scoped tile of code scale operand to shared memory
SmemIteratorCodeScaleZp smem_iterator_code_scale_;
/// Iterator to write threadblock-scoped tile of code zp operand to shared memory
SmemIteratorCodeScaleZp smem_iterator_code_zp_;
/// Shared memory write stage index
int smem_write_stage_idx_;
/// Shared memory read stage index
int smem_read_stage_idx_;
CUTLASS_DEVICE
ElementSuperScale* get_super_scale_smem_ptr() {
return reinterpret_cast<ElementSuperScale*>(smem_pointer_ + kSuperScaleSmemOffset);
}
CUTLASS_DEVICE
ElementLocalScale* get_local_scale_smem_ptr() {
return reinterpret_cast<ElementLocalScale*>(smem_pointer_ + kLocalScaleSmemOffset);
}
CUTLASS_DEVICE
ElementCodeScaleZp* get_code_scale_smem_ptr() {
return reinterpret_cast<ElementCodeScaleZp*>(smem_pointer_ + kCodeScaleSmemOffset);
}
CUTLASS_DEVICE
ElementCodeScaleZp* get_code_zp_smem_ptr() {
return reinterpret_cast<ElementCodeScaleZp*>(smem_pointer_ + kCodeZpSmemOffset);
}
public:
/// Construct from tensor references
CUTLASS_DEVICE
Wint2ParamsAccessor(
///< prointer of shared memory
uint8_t* smem_pointer,
///< ID within the threadblock
int thread_idx,
///< ID of warp
int warp_idx,
///< ID of each thread within a warp
int lane_idx)
: smem_pointer_(smem_pointer),
smem_iterator_super_scale_(LayoutSuperScale(IteratorSuperScale::Shape::kColumn),
get_super_scale_smem_ptr(), {1, IteratorSuperScale::Shape::kColumn}, thread_idx),
smem_iterator_local_scale_(LayoutLocalScale(IteratorLocalScale::Shape::kColumn),
get_local_scale_smem_ptr(), {1, IteratorLocalScale::Shape::kColumn}, thread_idx),
smem_iterator_code_scale_(LayoutCodeScaleZp(IteratorCodeScaleZp::Shape::kColumn),
get_code_scale_smem_ptr(), {1, IteratorCodeScaleZp::Shape::kColumn}, thread_idx),
smem_iterator_code_zp_(LayoutCodeScaleZp(IteratorCodeScaleZp::Shape::kColumn),
get_code_zp_smem_ptr(), {1, IteratorCodeScaleZp::Shape::kColumn}, thread_idx),
smem_write_stage_idx_(0),
smem_read_stage_idx_(0) {}
CUTLASS_DEVICE
SuperTensorRef super_scale_ref() {
return {get_super_scale_smem_ptr(), LayoutSuperScale(IteratorSuperScale::Shape::kColumn)};
}
CUTLASS_DEVICE
LocalTensorRef local_scale_ref() {
return {get_local_scale_smem_ptr(), LayoutLocalScale(IteratorLocalScale::Shape::kColumn)};
}
CUTLASS_DEVICE
CodeTensorRef code_scale_ref() {
return {get_code_scale_smem_ptr(), LayoutCodeScaleZp(IteratorCodeScaleZp::Shape::kColumn)};
}
CUTLASS_DEVICE
CodeTensorRef code_zp_ref() {
return {get_code_zp_smem_ptr(), LayoutCodeScaleZp(IteratorCodeScaleZp::Shape::kColumn)};
}
template <bool IsFirstStage>
CUTLASS_DEVICE
void copy_tiles_and_advance_per_stage(Arguments &quant_args, int stage) {
if constexpr (IsFirstStage) {
// Load channel-wise super_scale to shared memory, which only needs to be done once.
typename IteratorSuperScale::Fragment tb_frag_super_scale;
tb_frag_super_scale.clear();
quant_args.iterator_super_scale.load(tb_frag_super_scale);
this->smem_iterator_super_scale_.store(tb_frag_super_scale);
// Load channel-wise code_scale to shared memory, which only needs to be done once.
typename IteratorCodeScaleZp::Fragment tb_frag_code_scale;
tb_frag_code_scale.clear();
quant_args.iterator_code_scale.load(tb_frag_code_scale);
this->smem_iterator_code_scale_.store(tb_frag_code_scale);
// Load channel-wise code_zp to shared memory, which only needs to be done once.
typename IteratorCodeScaleZp::Fragment tb_frag_code_zp;
tb_frag_code_zp.clear();
quant_args.iterator_code_zp.load(tb_frag_code_zp);
this->smem_iterator_code_zp_.store(tb_frag_code_zp);
}
if ((stage % kStagesPerLocalScaleLoad) == 0) {
// Load group-wise local_scale to shared memory, which only needs to be done at each stage.
// Since 2 uint4b_t values of local_scale are saved in a single uint8_t, local_scale needs to be loaded once every two stages.
using AccessType = typename IteratorLocalScale::AccessType;
cutlass::arch::CacheOperation::Kind const kCacheOp = (sizeof_bits<AccessType>::value == 128)
? cutlass::arch::CacheOperation::Global : cutlass::arch::CacheOperation::Always;
quant_args.iterator_local_scale.set_iteration_index(0);
this->smem_iterator_local_scale_.set_iteration_index(0);
// Async Copy for local_scale
CUTLASS_PRAGMA_UNROLL
for (int j = 0; j < IteratorLocalScale::ThreadMap::Iterations::kCount; ++j) {
AccessType *dst_ptr =
reinterpret_cast<AccessType *>(this->smem_iterator_local_scale_.get());
CUTLASS_PRAGMA_UNROLL
for (int v = 0; v < IteratorLocalScale::kAccessesPerVector; ++v) {
auto gmem_ptr = quant_args.iterator_local_scale.get();
int const kSrcBytes =
sizeof_bits<typename IteratorLocalScale::Element>::value *
IteratorLocalScale::ThreadMap::kElementsPerAccess /
IteratorLocalScale::kAccessesPerVector / 8;
cutlass::arch::cp_async<kSrcBytes, kCacheOp>(
dst_ptr + v, gmem_ptr, quant_args.iterator_local_scale.valid());
}
++quant_args.iterator_local_scale;
}
++this->smem_iterator_local_scale_;
}
}
CUTLASS_DEVICE
void advance_smem_write_stage(Arguments &quant_args) {
if (smem_write_stage_idx_ % kStagesPerLocalScaleLoad == 0) {
// Advance global iterators
quant_args.iterator_local_scale.add_pointer_offset(quant_args.local_scale_pointer_offset);
// Advance shared iterators
int smem_pointer_offset = IteratorLocalScale::Shape::kRow * IteratorLocalScale::Shape::kColumn;
smem_iterator_local_scale_.add_pointer_offset(smem_pointer_offset);
}
// Increment shared memory write stage index
++smem_write_stage_idx_;
if (smem_write_stage_idx_ == kStagesPerLocalScaleLoad * kStages) {
// Wrap back around to the 'start' of the circular buffer in shared memory
int pointer_offset = - kStages * IteratorLocalScale::Shape::kRow * IteratorLocalScale::Shape::kColumn;
smem_iterator_local_scale_.add_pointer_offset(pointer_offset);
smem_write_stage_idx_ = 0;
}
}
CUTLASS_DEVICE
int advance_smem_read_stage() {
int byte_offset = 0;
++smem_read_stage_idx_;
if (smem_read_stage_idx_ % kStagesPerLocalScaleLoad == 0) {
byte_offset = kLocalScaleRows * kSmemColumns;
}
if (smem_read_stage_idx_ == kStagesPerLocalScaleLoad * kStages) {
smem_read_stage_idx_ = 0;
byte_offset = - (kStages - 1) * kLocalScaleRows * kSmemColumns;
}
return byte_offset;
}
CUTLASS_DEVICE
int clear_mask(Arguments &quant_args, bool cond) {
quant_args.iterator_local_scale.clear_mask(cond);
}
};
} // namespace threadblock
} // namespace gemm
} // namespace cutlass

View File

@@ -1,130 +0,0 @@
// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#pragma once
#include "cutlass/gemm_coord.h"
#include "cutlass/trace.h"
#include "cutlass_extensions/gemm/threadblock/wint2x_unzip.h"
namespace cutlass {
namespace gemm {
namespace threadblock {
template <typename ElementT, typename ScaleElementT, int Rows, int Columns,
int Stages, int NumThreads, WintQuantMethod Method>
struct TileDequanter {
using WeightQuantTraits = WintQuantTraits<ElementT, Method>;
using MmaElementT = typename WeightQuantTraits::MmaWeightType;
using QuantArguments = typename WeightQuantTraits::Arguments;
using UnzipAndDequantFunctor =
UnzipAndDequantFunctor<MmaElementT, Method, Rows, Columns, NumThreads>;
static constexpr bool kUseSharedMemory = true;
static constexpr int kRows = Rows;
static constexpr int kColumns = Columns;
static constexpr int kStages = Stages;
MmaElementT *out_smem_ptr{nullptr};
char *pointer{nullptr};
int64_t ldm{0};
cutlass::MatrixCoord tb_offset;
cutlass::MatrixCoord extent;
ScaleElementT *super_scale_ptr{nullptr};
cutlass::MatrixCoord tb_offset_scale;
QuantArguments quant_args;
int64_t block_start_rows[kStages];
bool need_preload{true};
UnzipAndDequantFunctor unzip_functor;
CUTLASS_DEVICE
TileDequanter(MmaElementT *out_smem_ptr, char *pointer, int64_t ldm,
const cutlass::MatrixCoord &extent,
const cutlass::MatrixCoord &tb_offset,
ScaleElementT *super_scale_ptr,
const cutlass::MatrixCoord &tb_offset_scale,
const QuantArguments &quant_args)
: out_smem_ptr(out_smem_ptr), pointer(pointer), ldm(ldm), extent(extent),
tb_offset(tb_offset), super_scale_ptr(super_scale_ptr),
tb_offset_scale(tb_offset_scale), quant_args(quant_args) {}
CUTLASS_DEVICE
MmaElementT *GetOutPtr() { return out_smem_ptr; }
CUTLASS_DEVICE
void AddTileOffset(const cutlass::MatrixCoord &tile_offset) {
tb_offset.row() += tile_offset.row() * kRows;
tb_offset.column() += tile_offset.column() * kColumns;
tb_offset_scale.column() += tile_offset.column() * kColumns;
}
CUTLASS_DEVICE
void Load(uint8_t *zipped_smem_ptr, uint8_t *column_wise_smem_ptr, int stage) {
int zipped_row = WeightQuantTraits::CaclPackedDim(tb_offset.row());
if (tb_offset.row() >= extent.row() ||
tb_offset.column() >= extent.column()) {
return;
}
block_start_rows[stage % kStages] = tb_offset.row();
using ZippedT = typename WeightQuantTraits::WeightType;
ZippedT *in_ptr = reinterpret_cast<ZippedT *>(pointer) + zipped_row * ldm +
tb_offset.column();
ScaleElementT *scale_ptr = super_scale_ptr + tb_offset_scale.column();
if constexpr (Method == WintQuantMethod::kWeightOnlyInt2) {
const uint8_t *local_scale_ptr = quant_args.local_scale_ptr +
(tb_offset.row() / 128) * ldm +
tb_offset_scale.column();
const float *code_scale_ptr =
quant_args.code_scale_ptr + tb_offset_scale.column();
const float *code_zp_ptr =
quant_args.code_zp_ptr + tb_offset_scale.column();
typename UnzipAndDequantFunctor::Arguments args(zipped_smem_ptr, column_wise_smem_ptr);
unzip_functor.LoadAsync(in_ptr, local_scale_ptr, code_scale_ptr, code_zp_ptr,
scale_ptr, &args, ldm, need_preload);
need_preload = false;
} else {
// CUTLASS_TRACE_DEVICE("Not Supported!");
}
}
CUTLASS_DEVICE
void UnpackAndDequant(uint8_t *zipped_smem_ptr, uint8_t *column_wise_smem_ptr, int stage) {
int64_t block_start_row = block_start_rows[stage % kStages];
if (block_start_row >= extent.row()) {
return;
}
if constexpr (Method == WintQuantMethod::kWeightOnlyInt2) {
typename UnzipAndDequantFunctor::Arguments args(zipped_smem_ptr, column_wise_smem_ptr);
unzip_functor.ComputeVectorized(args, out_smem_ptr, block_start_row);
} else {
// CUTLASS_TRACE_DEVICE("Not Supported!");
}
}
};
} // namespace threadblock
} // namespace gemm
} // namespace cutlass

View File

@@ -41,12 +41,9 @@
#include "cutlass_extensions/arch/mma.h"
#include "cutlass_extensions/gemm/warp/mma_tensorop_compute_B_with_f16.h"
namespace cutlass
{
namespace gemm
{
namespace warp
{
namespace cutlass {
namespace gemm {
namespace warp {
/////////////////////////////////////////////////////////////////////////////////////////////////
@@ -81,7 +78,7 @@ private:
// Shape for computing the FP16s
using ComputeInstructionShape = InstructionShape_;
// Chosen so we get K=16 for int8 and K=32 for int4.
// Chosen so we get K=16 for int8, K=32 for int4, K=64 for int2.
static constexpr int LoadInstructionK = 128 / sizeof_bits<ElementB>::value;
// Shape for loading the narrow data type from shared memory

View File

@@ -58,15 +58,12 @@
/////////////////////////////////////////////////////////////////////////////////////////////////
namespace cutlass
{
namespace gemm
{
namespace warp
{
namespace cutlass {
namespace gemm {
namespace warp {
/////////////////////////////////////////////////////////////////////////////////////////////////
/// Structure to compute the matrix product targeting CUDA cores and SIMT math instructions.
/// Structure to compute the matrix product targeting Tensor Cores, for the case when A is floating point and B is quantized integer.
template <
/// Size of the Gemm problem - concept: gemm::GemmShape<>
typename Shape_,
@@ -297,6 +294,235 @@ public:
}
};
/////////////////////////////////////////////////////////////////////////////////////////////////
/// Structure to compute the matrix product targeting Tensor Cores, for the case when A is floating point and B is quantized integer.
/// Specialization for B of uint2b_t.
template <
/// Size of the Gemm problem - concept: gemm::GemmShape<>
typename Shape_,
/// Data type of A elements
typename ElementA_,
/// Layout of A matrix (concept: MatrixLayout)
typename LayoutA_,
/// Layout of B matrix (concept: MatrixLayout)
typename LayoutB_,
/// Element type of C matrix
typename ElementC_,
/// Layout of C matrix (concept: MatrixLayout)
typename LayoutC_,
/// Policy describing warp-level MmaTensorOp (concept: MmaTensorOp policy)
typename Policy_,
/// Instruction shape to override shared memory iterators with
typename SharedMemoryInstructionShape_,
/// Number of partitions along K dimension
int PartitionsK_,
/// Store the accumulators in row major or column major. Row major is used
/// when output layout is interleaved.
bool AccumulatorsInRowMajor>
class MmaTensorOpComputeBWithF16<
Shape_,
ElementA_,
LayoutA_,
uint2b_t,
LayoutB_,
ElementC_,
LayoutC_,
Policy_,
SharedMemoryInstructionShape_,
PartitionsK_,
AccumulatorsInRowMajor>
{
public:
/// Shape of warp-level matrix operation (concept: GemmShape)
using Shape = Shape_;
/// Data type of multiplicand A
using ElementA = ElementA_;
/// Layout of multiplicand A
using LayoutA = LayoutA_;
/// Data type of multiplicand B
using ElementB = uint2b_t;
/// Layout of multiplicand B
using LayoutB = LayoutB_;
/// Data type of accumulator matrix C
using ElementC = ElementC_;
/// Layout of accumulator matrix C
using LayoutC = LayoutC_;
/// Shape of the warp in units of thread (concept: MmaLanePolicySimt)
using Policy = Policy_;
/// Underlying matrix multiply operator (concept: arch::Mma)
using ArchMmaOperator = typename Policy::Operator;
/// Indicates math operator
using MathOperator = typename ArchMmaOperator::Operator;
/// Architecture tag from underlying instruction
using ArchTag = typename ArchMmaOperator::ArchTag;
static_assert((platform::is_same<typename ArchMmaOperator::ElementA, half_t>::value
&& platform::is_same<typename ArchMmaOperator::ElementB, half_t>::value)
|| (platform::is_same<typename ArchMmaOperator::ElementA, bfloat16_t>::value
&& platform::is_same<typename ArchMmaOperator::ElementB, bfloat16_t>::value
&& ArchTag::kMinComputeCapability >= 80),
"MmaTensorOpCvtBToA only supports underlying HMMA/QMMA");
static_assert(platform::is_same<ElementA, half_t>::value
|| (platform::is_same<ElementA, bfloat16_t>::value && ArchTag::kMinComputeCapability >= 80),
"MmaTensorOpCvtBToA only supports Fp16 A or Bf16 A on Ampere+");
/// Indicates class of matrix operator
using OperatorClass = arch::OpClassTensorOp;
/// Shape of underlying instruction
using InstructionShape = typename ArchMmaOperator::Shape;
/// Instruction shape to override shared memory iterators with
using SharedMemoryInstructionShape = SharedMemoryInstructionShape_;
static_assert(
SharedMemoryInstructionShape::kM == InstructionShape::kM, "M dimension of compute instruction must match load");
static_assert(
SharedMemoryInstructionShape::kN == InstructionShape::kN, "N dimension of compute instruction must match load");
static constexpr int kExpansionFactor = SharedMemoryInstructionShape::kK / InstructionShape::kK;
static_assert(!(Shape::kK % SharedMemoryInstructionShape::kK), "");
/// Complex transform on A operand
static ComplexTransform const kTransformA = ComplexTransform::kNone;
/// Complex transform on B operand
static ComplexTransform const kTransformB = ComplexTransform::kNone;
/// Number of threads participating in warp-level matrix product
static int const kThreadCount = 32;
/// Number of partitions along K dimension
static int const kPartitionsK = PartitionsK_;
public:
/// Iterates over the A operand in memory
using IteratorA
= MmaTensorOpMultiplicandTileIterator<MatrixShape<Shape::kM, Shape::kK>, Operand::kA, ElementA, LayoutA,
MatrixShape<InstructionShape::kM, InstructionShape::kK>, Policy::OpDelta::kRow, kThreadCount, kPartitionsK>;
/// Storage for A tile
using FragmentA = typename IteratorA::Fragment;
/// Storage for transformed A tile
using TransformedFragmentA = Array<typename ArchMmaOperator::ElementA, FragmentA::kElements>;
/// Iterates over the B operand in memory
using IteratorB = MmaTensorOpMultiplicandTileIterator<MatrixShape<Shape::kK, Shape::kN>, Operand::kB, ElementB,
LayoutB, MatrixShape<SharedMemoryInstructionShape::kK, InstructionShape::kN>, Policy::OpDelta::kRow,
kThreadCount, kPartitionsK>;
/// Storage for B tile
using FragmentB = typename IteratorB::Fragment;
/// Storage for transformed B tile
using TransformedFragmentB =
Array<typename ArchMmaOperator::ElementB, FragmentB::kElements / kExpansionFactor>;
/// Iterates over the C operand in memory
using IteratorC = MmaTensorOpAccumulatorTileIterator<MatrixShape<Shape::kM, Shape::kN>, ElementC, LayoutC,
typename ArchMmaOperator::Shape, typename Policy::OpDelta>;
/// Storage for C tile
using FragmentC = typename IteratorC::Fragment;
/// Number of mma operations performed
using MmaIterations = MatrixShape<(Shape::kM + ArchMmaOperator::Shape::kM - 1) / ArchMmaOperator::Shape::kM,
(Shape::kN + ArchMmaOperator::Shape::kN - 1) / ArchMmaOperator::Shape::kN>;
public:
/// Underlying matrix multiply operator (concept: arch::Mma)
ArchMmaOperator mma;
public:
//
// Methods
//
/// Ctor
CUTLASS_DEVICE
MmaTensorOpComputeBWithF16() {}
/// Performs a warp-level matrix multiply-accumulate operation
CUTLASS_DEVICE
void operator()(FragmentC& D, TransformedFragmentA const& A, TransformedFragmentB const& B, FragmentC const& C) const
{
using MmaOperandA = typename ArchMmaOperator::FragmentA;
using MmaOperandB = typename ArchMmaOperator::FragmentB;
using MmaOperandC = typename ArchMmaOperator::FragmentC;
D = C;
MmaOperandA const* ptr_A = reinterpret_cast<MmaOperandA const*>(&A);
MmaOperandB const* ptr_B = reinterpret_cast<MmaOperandB const*>(&B);
MmaOperandC* ptr_D = reinterpret_cast<MmaOperandC*>(&D);
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 800)
// Serpentine visitation order maximizing reuse of Rb
CUTLASS_PRAGMA_UNROLL
for (int n = 0; n < MmaIterations::kColumn; ++n)
{
CUTLASS_PRAGMA_UNROLL
for (int m = 0; m < MmaIterations::kRow; ++m)
{
int m_serpentine = ((n % 2) ? (MmaIterations::kRow - 1 - m) : m);
if (AccumulatorsInRowMajor)
{ // matrix B is reordered
mma(ptr_D[n + m_serpentine * MmaIterations::kColumn], ptr_A[m_serpentine], ptr_B[n],
ptr_D[n + m_serpentine * MmaIterations::kColumn]);
}
else
{
mma(ptr_D[m_serpentine + n * MmaIterations::kRow], ptr_A[m_serpentine], ptr_B[n],
ptr_D[m_serpentine + n * MmaIterations::kRow]);
}
}
}
#elif defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
// Serpentine visitation order maximizing reuse of Ra
CUTLASS_PRAGMA_UNROLL
for (int m = 0; m < MmaIterations::kRow; ++m)
{
CUTLASS_PRAGMA_UNROLL
for (int n = 0; n < MmaIterations::kColumn; ++n)
{
int n_serpentine = ((m % 2) ? (MmaIterations::kColumn - 1 - n) : n);
if (AccumulatorsInRowMajor)
{ // matrix B is reordered
mma(ptr_D[n_serpentine + m * MmaIterations::kColumn], ptr_A[m], ptr_B[n_serpentine],
ptr_D[n_serpentine + m * MmaIterations::kColumn]);
}
else
{
mma(ptr_D[m + n_serpentine * MmaIterations::kRow], ptr_A[m], ptr_B[n_serpentine],
ptr_D[m + n_serpentine * MmaIterations::kRow]);
}
}
}
#else
assert(0);
#endif
}
};
/////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace warp

View File

@@ -0,0 +1,442 @@
/***************************************************************************************************
* Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights
*reserved. SPDX-License-Identifier: BSD-3-Clause
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* 1. Redistributions of source code must retain the above copyright notice,
*this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
*
* 3. Neither the name of the copyright holder nor the names of its
* contributors may be used to endorse or promote products derived from
* this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
*ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
*LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
*CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
*SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
*INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
*CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
*ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
*POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines iterators used by warp-level matrix multiply operations
targeting Tensor Cores.
*/
#pragma once
#include "cutlass/cutlass.h"
#include "cutlass/array.h"
#include "cutlass/matrix_shape.h"
#include "cutlass/numeric_types.h"
#include "cutlass/tensor_ref.h"
#include "cutlass/arch/arch.h"
#include "cutlass/arch/memory_sm75.h"
#include "cutlass/gemm/gemm.h"
#include "cutlass/layout/matrix.h"
#include "cutlass/layout/pitch_linear.h"
#include "cutlass/layout/tensor.h"
#include "cutlass/functional.h"
#include "cutlass/platform/platform.h"
#include "cutlass_extensions/interleaved_numeric_conversion.h"
namespace cutlass {
namespace gemm {
namespace warp {
namespace detail {
template <typename T>
struct DataTypeTraits;
template <>
struct DataTypeTraits<bfloat16_t> {
using Type = __nv_bfloat16;
using DualType = __nv_bfloat162;
};
template <>
struct DataTypeTraits<half_t> {
using Type = __half;
using DualType = __half2;
};
template <typename T, int N, typename Enable = void>
struct LocalScaleConverter {
using FragmentSource = Array<uint8_t, N>;
using FragmentResult = Array<T, N>;
CUTLASS_DEVICE
static void Apply(FragmentSource const& local_scale_frag,
FragmentResult const& super_scale_frag,
FragmentResult& scale_frag,
int shift_bit) {
constexpr uint32_t kLocalScaleMask = 0xf;
CUTLASS_PRAGMA_UNROLL
for (int i = 0; i < N; ++i) {
int32_t shifted_value = (static_cast<int32_t>(local_scale_frag[i]) >> shift_bit) & kLocalScaleMask;
scale_frag[i] = static_cast<T>(shifted_value) * super_scale_frag[i];
}
}
};
template <int N>
struct LocalScaleConverter<half_t, N, typename platform::enable_if<N % 4 == 0>::type> {
using FragmentSource = Array<uint8_t, N>;
using FragmentResult = Array<half_t, N>;
CUTLASS_DEVICE
static void Apply(FragmentSource const& local_scale_frag,
FragmentResult const& super_scale_frag,
FragmentResult& scale_frag,
int shift_bit) {
constexpr uint32_t immLut = (0xf0 & 0xcc) | 0xaa;
constexpr uint32_t MASK = 0x000f000f;
// 2^10 = 1024
constexpr uint32_t I4s_TO_FP16s_MAGIC_NUM = 0x64006400;
// -2^10 = -1024
constexpr uint32_t FP16_BIAS = 0xE400E400;
// 1.0
constexpr uint32_t FP16_ONE = 0x3C003C00;
__half2* scale_ptr = reinterpret_cast<__half2 *>(&scale_frag);
__half2 const* super_scale_ptr = reinterpret_cast<__half2 const*>(&super_scale_frag);
uint32_t const* local_scale_ptr = reinterpret_cast<uint32_t const*>(&local_scale_frag);
CUTLASS_PRAGMA_UNROLL
for (int i = 0; i < N / 4; ++i) {
int i4s = local_scale_ptr[i] >> shift_bit;
// unpack: 0, 1
int32_t low = __byte_perm(i4s, i4s, 0xF1F0);
int32_t unpack0 = lop3<immLut>(low, MASK, I4s_TO_FP16s_MAGIC_NUM);
// unpack: 2, 3
int32_t high = __byte_perm(i4s, i4s, 0xF3F2);
int32_t unpack1 = lop3<immLut>(high, MASK, I4s_TO_FP16s_MAGIC_NUM);
__half2 scale0 = __hfma2(*reinterpret_cast<__half2*>(&unpack0),
*reinterpret_cast<const __half2*>(&FP16_ONE),
*reinterpret_cast<const __half2*>(&FP16_BIAS));
__half2 scale1 = __hfma2(*reinterpret_cast<__half2*>(&unpack1),
*reinterpret_cast<const __half2*>(&FP16_ONE),
*reinterpret_cast<const __half2*>(&FP16_BIAS));
scale_ptr[2 * i] = __hmul2(scale0, super_scale_ptr[2 * i]);
scale_ptr[2 * i + 1] = __hmul2(scale1, super_scale_ptr[2 * i + 1]);
}
}
};
template <int N>
struct LocalScaleConverter<bfloat16_t, N, typename platform::enable_if<N % 4 == 0>::type> {
using FragmentSource = Array<uint8_t, N>;
using FragmentResult = Array<bfloat16_t, N>;
CUTLASS_DEVICE
static void Apply(FragmentSource const& local_scale_frag,
FragmentResult const& super_scale_frag,
FragmentResult& scale_frag,
int shift_bit) {
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800) && defined(ENABLE_BF16))
constexpr uint32_t immLut = (0xF0 & 0xCC) | 0xAA;
constexpr uint32_t MASK = 0x000F000F;
constexpr uint32_t I4s_TO_BF16s_MAGIC_NUM = 0x43004300;
constexpr uint32_t BF16_BIAS = 0xC300C300;
constexpr uint32_t BF16_ONE = 0x3F803F80;
__nv_bfloat162* scale_ptr = reinterpret_cast<__nv_bfloat162 *>(&scale_frag);
__nv_bfloat162 const* super_scale_ptr = reinterpret_cast<__nv_bfloat162 const*>(&super_scale_frag);
uint32_t const* local_scale_ptr = reinterpret_cast<uint32_t const*>(&local_scale_frag);
CUTLASS_PRAGMA_UNROLL
for (int i = 0; i < N / 4; ++i) {
int i4s = local_scale_ptr[i] >> shift_bit;
// unpack: 0, 1
int32_t low = __byte_perm(i4s, i4s, 0xF1F0);
int32_t unpack0 = lop3<immLut>(low, MASK, I4s_TO_BF16s_MAGIC_NUM);
// unpack: 2, 3
int32_t high = __byte_perm(i4s, i4s, 0xF3F2);
int32_t unpack1 = lop3<immLut>(high, MASK, I4s_TO_BF16s_MAGIC_NUM);
nv_bfloat162 scale0 = __hfma2(*reinterpret_cast<nv_bfloat162*>(&unpack0),
*reinterpret_cast<const nv_bfloat162*>(&BF16_ONE),
*reinterpret_cast<const nv_bfloat162*>(&BF16_BIAS));
nv_bfloat162 scale1 = __hfma2(*reinterpret_cast<nv_bfloat162*>(&unpack1),
*reinterpret_cast<const nv_bfloat162*>(&BF16_ONE),
*reinterpret_cast<const nv_bfloat162*>(&BF16_BIAS));
scale_ptr[2 * i] = __hmul2(scale0, super_scale_ptr[2 * i]);
scale_ptr[2 * i + 1] = __hmul2(scale1, super_scale_ptr[2 * i + 1]);
}
#else
// Slow path not implemented here on purpose. If we need to do HMMA on older arch, scale conversion should
// happen before scales are stored to shared memory and we should use the fp16 dequantizer. This will avoid
// numerous conversion instructions in GEMM main loop.
arch::device_breakpoint();
#endif
}
};
} // namespace detail
////////////////////////////////////////////////////////////////////////////////
template <
/// Matrix multiply operator
typename MmaOperator_,
/// Size of the matrix to load (concept: MatrixShape)
typename Shape_,
/// Operand identity
Operand Operand,
/// Data type of Scale elements
typename ElementOperand_,
/// Layout of operand
typename Layout_,
/// Group size for quantization
int GroupSize_,
///
typename Enable = void>
class MmaTensorOpWin2xDequantizer {
//static_assert(false, "Not Supported!");
};
////////////////////////////////////////////////////////////////////////////////
// Bfloat specialization for Ampere
template <
/// Underlying matrix multiply operator (concept: MmaTensorOp)
typename MmaOperator_,
/// Shape of the warp level matrix multiply (concept: GemmShape)
typename Shape_,
/// Data type of Scale elements
typename ElementOperand_,
/// Group size for quantization
int GroupSize_>
class MmaTensorOpWin2xDequantizer<
MmaOperator_,
Shape_,
Operand::kB,
ElementOperand_,
layout::RowMajor,
GroupSize_>
//typename platform::enable_if<MmaOperator_::ArchTag::kMinComputeCapability >= 80
// && platform::is_same<typename MmaOperator_::ArchMmaOperator::LayoutB, layout::ColumnMajor>::value>::type>
{
public:
static_assert(platform::is_same<ElementOperand_, half_t>::value || platform::is_same<ElementOperand_, bfloat16_t>::value,
"T must be fp16 or bf16");
/// Mma Operator
using MmaOperator = MmaOperator_;
// The architecture specific mma ooperator being used
using ArchMmaOperator = typename MmaOperator::ArchMmaOperator;
// Mma Instruction Shape
using InstructionShape = typename ArchMmaOperator::Shape;
/// Warp mma shape
using Shape = Shape_;
/// Type of mma operand
using ElementOperand = ElementOperand_;
/// Layout of the scales in shared memory
using Layout = layout::RowMajor;
/// Group size for quantization
static constexpr int kGroupSize = GroupSize_;
/// Type of input
using ElementB = typename MmaOperator::FragmentB::Element;
static_assert(platform::is_same<ElementB, uint2b_t>::value, "ElementB must be uint2b_t");
/// Type of the scales
using ElementLocalScale = uint4b_t;
using ElementSuperScale = ElementOperand;
using ElementCodeScaleZp = float;
// Fragment to hold scale data to apply to B before mma
// We need 1 fp16 per matrix iteration in the N dimension
static constexpr int kWarpIterationsAlongN = MmaOperator::MmaIterations::kColumn;
// use uint8_t to save 2 4-bits local scales
using FragmentLocalScale = Array<uint8_t, kWarpIterationsAlongN>;
using FragmentSuperScale = Array<ElementSuperScale, kWarpIterationsAlongN>;
using FragmentCodeScaleZp = Array<ElementCodeScaleZp, kWarpIterationsAlongN>;
/// Fragment to hold B data before Mma
using FragmentInput = Array<ElementB, MmaOperator::FragmentB::kElements>;
// This is the ratio of the load instruction vs the compute instruction.
static constexpr int kExpansionFactor = MmaOperator::IteratorB::InstructionShape::kRow / InstructionShape::kK;
static constexpr int kNumPacks = sizeof_bits<uint8_t>::value / sizeof_bits<ElementB>::value;
static constexpr int kUnpackFactor = MmaOperator::FragmentB::kElements / (kWarpIterationsAlongN * kNumPacks);
static constexpr int kUnpackInterval = kExpansionFactor / kUnpackFactor;
/// Unpack 4 uint2b_t values compreseed in a uint8_t to floating points.
using Uint2Converter = FastInterleavedAndBiasedNumericArrayConverter<
ElementOperand, ElementB, MmaOperator::FragmentB::kElements / kUnpackFactor>;
using FragmentInputUnpack = typename Uint2Converter::result_type;
/// Fragment to hold internal scales before Mma
using FragmentScale = Array<ElementOperand, FragmentLocalScale::kElements>;
/// Fragment of dequantized B
using FragmentOutput = Array<ElementOperand, MmaOperator::FragmentB::kElements / kExpansionFactor>;
/// TensorRef type for loading element from a tensor
using SuperTensorRef = cutlass::TensorRef<ElementSuperScale, Layout>;
using LocalTensorRef = cutlass::TensorRef<ElementLocalScale, Layout>;
using CodeTensorRef = cutlass::TensorRef<ElementCodeScaleZp, Layout>;
private:
//
// Data members
//
uint8_t* pointer_local_scale_;
ElementCodeScaleZp* pointer_code_scale_;
ElementCodeScaleZp* pointer_code_zp_;
ElementSuperScale* pointer_super_scale_;
//FragmentInputUnpack unpacked_frag_;
FragmentScale scale_frag_;
public:
CUTLASS_DEVICE
MmaTensorOpWin2xDequantizer(SuperTensorRef smem_super_scale,
LocalTensorRef smem_local_scale,
CodeTensorRef smem_code_scale,
CodeTensorRef smem_code_zp,
int warp_idx_n,
int lane_idx) {
int warp_offset = warp_idx_n * Shape::kN;
int quad = lane_idx / 4;
int thread_offset = warp_offset + quad;
pointer_super_scale_ = smem_super_scale.data() + thread_offset;
pointer_code_scale_ = smem_code_scale.data() + thread_offset;
pointer_code_zp_ = smem_code_zp.data() + thread_offset;
pointer_local_scale_ = reinterpret_cast<uint8_t *>(smem_local_scale.data()) + thread_offset;
}
/// Channel-wise params, need to load just once
CUTLASS_DEVICE
void load(FragmentCodeScaleZp& code_scale_frag,
FragmentCodeScaleZp& code_zp_frag,
FragmentSuperScale& super_scale_frag) {
CUTLASS_PRAGMA_UNROLL
for (int mma_n_iter = 0; mma_n_iter < kWarpIterationsAlongN; ++mma_n_iter) {
super_scale_frag[mma_n_iter] = pointer_super_scale_[mma_n_iter * InstructionShape::kN]; // bank conflict
code_scale_frag[mma_n_iter] = pointer_code_scale_[mma_n_iter * InstructionShape::kN];
code_zp_frag[mma_n_iter] = pointer_code_zp_[mma_n_iter * InstructionShape::kN];
}
}
/// Group-wise params, need to load multiple times
CUTLASS_DEVICE
void load(FragmentLocalScale& local_scale_frag) {
CUTLASS_PRAGMA_UNROLL
for (int mma_n_iter = 0; mma_n_iter < kWarpIterationsAlongN; ++mma_n_iter) {
local_scale_frag[mma_n_iter] = pointer_local_scale_[mma_n_iter * InstructionShape::kN]; // bank conflict
}
}
CUTLASS_DEVICE
void dequantize(const FragmentLocalScale& local_scale_frag,
const FragmentCodeScaleZp& code_scale_frag,
const FragmentCodeScaleZp& code_zp_frag,
const FragmentSuperScale& super_scale_frag,
const FragmentInput& input_frag,
FragmentOutput& output_frag,
int tb_offset_k,
int warp_k_compute_offset) {
if constexpr (kUnpackInterval != 1) {
// unsupport now
arch::device_breakpoint();
}
typename Uint2Converter::source_type source_frag;
int in_offset = warp_k_compute_offset * kUnpackInterval;
uint8_t const* ptr_input = reinterpret_cast<uint8_t const*>(&input_frag);
uint8_t* ptr_source = reinterpret_cast<uint8_t *>(&source_frag);
CUTLASS_PRAGMA_UNROLL
for (int mma_n_iter = 0; mma_n_iter < kWarpIterationsAlongN; ++mma_n_iter) {
ptr_source[mma_n_iter] = ptr_input[mma_n_iter * kUnpackFactor + in_offset];
}
FragmentInputUnpack unpacked_frag = Uint2Converter::convert(source_frag, code_scale_frag, code_zp_frag);
// dequantize local_scale
if (warp_k_compute_offset == 0) {
using LocalScaleConverter = detail::LocalScaleConverter<ElementOperand, FragmentLocalScale::kElements>;
// special for TileRows = 64
int local_scale_shift = (((tb_offset_k / kGroupSize) + 1) & 1) * 4;
LocalScaleConverter::Apply(local_scale_frag, super_scale_frag, scale_frag_, local_scale_shift);
}
// unscale
// After applying LOP3 optimizations for performance, the B operand requires data rearrangement.
// reorder: [0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15]
const int kWarpIterationsAlongK = FragmentOutput::kElements / kWarpIterationsAlongN;
using Type = typename detail::DataTypeTraits<ElementOperand>::Type;
using DualType = typename detail::DataTypeTraits<ElementOperand>::DualType;
Type* output_ptr = reinterpret_cast<Type *>(&output_frag);
DualType const* unpacked_ptr = reinterpret_cast<DualType const*>(&unpacked_frag);
DualType const* scale_ptr = reinterpret_cast<DualType const*>(&scale_frag_);
CUTLASS_PRAGMA_UNROLL
for (int mma_n_iter = 0; mma_n_iter < kWarpIterationsAlongN; mma_n_iter += 2) {
int mapped_idx_base = (mma_n_iter / 2) * kWarpIterationsAlongK;
DualType scalex2 = scale_ptr[mma_n_iter / 2];
CUTLASS_PRAGMA_UNROLL
for (int mma_k_iter = 0; mma_k_iter < kWarpIterationsAlongK; ++mma_k_iter) {
DualType unpacked_valuex2 = unpacked_ptr[mapped_idx_base + mma_k_iter];
DualType scaled_value = __hmul2(unpacked_valuex2, scalex2);
output_ptr[mma_n_iter * kWarpIterationsAlongK + mma_k_iter] = scaled_value.x;
output_ptr[(mma_n_iter + 1) * kWarpIterationsAlongK + mma_k_iter] = scaled_value.y;
}
}
}
/// Add an offset to pointer in units of elements.
/// Only group-wise params needs.
CUTLASS_DEVICE
void add_pointer_offset(int64_t const& offset) {
pointer_local_scale_ += offset;
}
};
////////////////////////////////////////////////////////////////////////////////
} // namespace warp
} // namespace gemm
} // namespace cutlass
////////////////////////////////////////////////////////////////////////////////

View File

@@ -39,18 +39,25 @@
#include "cutlass/array.h"
#include "cutlass/half.h"
#include "cutlass/numeric_types.h"
#include "cutlass/trace.h"
namespace cutlass
{
namespace cutlass {
template <int lut>
__device__ inline int lop3(int a, int b, int c) {
int res;
asm volatile("lop3.b32 %0, %1, %2, %3, %4;\n"
: "=r"(res)
: "r"(a), "r"(b), "r"(c), "n"(lut));
return res;
}
// This converter is meant to be used with data interleaved in a 32-bit register where the even elements are in the low
// bits and the odd elemeents are in the high bits of the register. In addition, it assumes elements were originally
// signed and had a bias of 2**(b-1) added (where b is the number of bits in the type) to make all numbers unsigned.
// This converter will uninterleave the data and subtract the bias while converting to the result type.
template <typename T, typename S, int N>
struct FastInterleavedAndBiasedNumericArrayConverter
{
};
struct FastInterleavedAndBiasedNumericArrayConverter;
template <>
struct FastInterleavedAndBiasedNumericArrayConverter<half_t, uint8_t, 4>
@@ -440,6 +447,329 @@ struct FastInterleavedAndBiasedNumericArrayConverter<bfloat16_t, uint4b_t, N>
}
};
template <>
struct FastInterleavedAndBiasedNumericArrayConverter<half_t, uint2b_t, 16>
{
using result_type = Array<half_t, 16>;
using source_type = Array<uint2b_t, 16>;
using ScaleComputeT = float;
using code_type = Array<ScaleComputeT, 4>;
CUTLASS_DEVICE
static result_type convert(source_type const& source, ScaleComputeT code_scale, ScaleComputeT code_zp)
{
uint32_t const i8s = reinterpret_cast<uint32_t const&>(source);
// 2^23 = 8388608
static constexpr uint32_t FP32_BASE = 0x4B000000;
float fp32_intermediates[4];
uint32_t* fp32_intermediates_casted = reinterpret_cast<uint32_t*>(fp32_intermediates);
fp32_intermediates_casted[0] = __byte_perm(i8s, FP32_BASE, 0x7650);
fp32_intermediates_casted[1] = __byte_perm(i8s, FP32_BASE, 0x7651);
fp32_intermediates_casted[2] = __byte_perm(i8s, FP32_BASE, 0x7652);
fp32_intermediates_casted[3] = __byte_perm(i8s, FP32_BASE, 0x7653);
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[0]) : "r"(fp32_intermediates_casted[0]), "r"(FP32_BASE));
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[1]) : "r"(fp32_intermediates_casted[1]), "r"(FP32_BASE));
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[2]) : "r"(fp32_intermediates_casted[2]), "r"(FP32_BASE));
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[3]) : "r"(fp32_intermediates_casted[3]), "r"(FP32_BASE));
int32_t decode_value[4];
ScaleComputeT new_code_zp = code_zp + 0.5f;
decode_value[0] = __float2int_rd(fmaf(fp32_intermediates[0], code_scale, new_code_zp));
decode_value[1] = __float2int_rd(fmaf(fp32_intermediates[1], code_scale, new_code_zp));
decode_value[2] = __float2int_rd(fmaf(fp32_intermediates[2], code_scale, new_code_zp));
decode_value[3] = __float2int_rd(fmaf(fp32_intermediates[3], code_scale, new_code_zp));
return convert_impl(decode_value);
}
CUTLASS_DEVICE
static result_type convert(source_type const& source, code_type const& code_scale, code_type const& code_zp)
{
uint32_t const i8s = reinterpret_cast<uint32_t const&>(source);
// 2^23 = 8388608
static constexpr uint32_t FP32_BASE = 0x4B000000;
float fp32_intermediates[4];
uint32_t* fp32_intermediates_casted = reinterpret_cast<uint32_t*>(fp32_intermediates);
fp32_intermediates_casted[0] = __byte_perm(i8s, FP32_BASE, 0x7650);
fp32_intermediates_casted[1] = __byte_perm(i8s, FP32_BASE, 0x7651);
fp32_intermediates_casted[2] = __byte_perm(i8s, FP32_BASE, 0x7652);
fp32_intermediates_casted[3] = __byte_perm(i8s, FP32_BASE, 0x7653);
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[0]) : "r"(fp32_intermediates_casted[0]), "r"(FP32_BASE));
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[1]) : "r"(fp32_intermediates_casted[1]), "r"(FP32_BASE));
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[2]) : "r"(fp32_intermediates_casted[2]), "r"(FP32_BASE));
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[3]) : "r"(fp32_intermediates_casted[3]), "r"(FP32_BASE));
int32_t decode_value[4];
decode_value[0] = __float2int_rd(fmaf(fp32_intermediates[0], code_scale[0], code_zp[0] + 0.5f));
decode_value[1] = __float2int_rd(fmaf(fp32_intermediates[1], code_scale[1], code_zp[1] + 0.5f));
decode_value[2] = __float2int_rd(fmaf(fp32_intermediates[2], code_scale[2], code_zp[2] + 0.5f));
decode_value[3] = __float2int_rd(fmaf(fp32_intermediates[3], code_scale[3], code_zp[3] + 0.5f));
return convert_impl(decode_value);
}
CUTLASS_DEVICE
static result_type convert_impl(int32_t* decode_value)
{
result_type result;
static constexpr uint32_t immLut = (0xF0 & 0xCC) | 0xAA;
static constexpr uint32_t MASK = 0x003F003F;
// 2^10 = 1024
static constexpr uint32_t EX = 0x64006400;
uint32_t* h = reinterpret_cast<uint32_t*>(&result);
int32_t q0 = __byte_perm(decode_value[0], decode_value[1], 0x5410);
int32_t q1 = __byte_perm(decode_value[2], decode_value[3], 0x5410);
h[0] = lop3<immLut>(q0 >> 9, MASK, EX);
h[1] = lop3<immLut>(q0 >> 6, MASK, EX);
h[2] = lop3<immLut>(q0 >> 3, MASK, EX);
h[3] = lop3<immLut>(q0, MASK, EX);
h[4] = lop3<immLut>(q1 >> 9, MASK, EX);
h[5] = lop3<immLut>(q1 >> 6, MASK, EX);
h[6] = lop3<immLut>(q1 >> 3, MASK, EX);
h[7] = lop3<immLut>(q1, MASK, EX);
// 1024 + 32 = 1056
static constexpr uint32_t SUB = 0x64206420;
asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[0]) : "r"(h[0]), "r"(SUB));
asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[1]) : "r"(h[1]), "r"(SUB));
asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[2]) : "r"(h[2]), "r"(SUB));
asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[3]) : "r"(h[3]), "r"(SUB));
asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[4]) : "r"(h[4]), "r"(SUB));
asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[5]) : "r"(h[5]), "r"(SUB));
asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[6]) : "r"(h[6]), "r"(SUB));
asm volatile("sub.f16x2 %0, %1, %2;\n" : "=r"(h[7]) : "r"(h[7]), "r"(SUB));
return result;
}
CUTLASS_DEVICE
result_type operator()(source_type const& s, ScaleComputeT code_scale, ScaleComputeT code_zp)
{
return convert(s, code_scale, code_zp);
}
};
template <>
struct FastInterleavedAndBiasedNumericArrayConverter<bfloat16_t, uint2b_t, 16>
{
using result_type = Array<bfloat16_t, 16>;
using source_type = Array<uint2b_t, 16>;
using ScaleComputeT = float;
using code_type = Array<ScaleComputeT, 4>;
CUTLASS_DEVICE
static result_type convert(source_type const& source, ScaleComputeT code_scale, ScaleComputeT code_zp)
{
uint32_t const i8s = reinterpret_cast<uint32_t const&>(source);
// 2^23 = 8388608
static constexpr uint32_t FP32_BASE = 0x4B000000;
float fp32_intermediates[4];
uint32_t* fp32_intermediates_casted = reinterpret_cast<uint32_t*>(fp32_intermediates);
fp32_intermediates_casted[0] = __byte_perm(i8s, FP32_BASE, 0x7650);
fp32_intermediates_casted[1] = __byte_perm(i8s, FP32_BASE, 0x7651);
fp32_intermediates_casted[2] = __byte_perm(i8s, FP32_BASE, 0x7652);
fp32_intermediates_casted[3] = __byte_perm(i8s, FP32_BASE, 0x7653);
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[0]) : "r"(fp32_intermediates_casted[0]), "r"(FP32_BASE));
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[1]) : "r"(fp32_intermediates_casted[1]), "r"(FP32_BASE));
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[2]) : "r"(fp32_intermediates_casted[2]), "r"(FP32_BASE));
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[3]) : "r"(fp32_intermediates_casted[3]), "r"(FP32_BASE));
int32_t decode_value[4];
ScaleComputeT new_code_zp = code_zp + 0.5f;
decode_value[0] = __float2int_rd(fmaf(fp32_intermediates[0], code_scale, new_code_zp));
decode_value[1] = __float2int_rd(fmaf(fp32_intermediates[1], code_scale, new_code_zp));
decode_value[2] = __float2int_rd(fmaf(fp32_intermediates[2], code_scale, new_code_zp));
decode_value[3] = __float2int_rd(fmaf(fp32_intermediates[3], code_scale, new_code_zp));
return convert_impl(decode_value);
}
CUTLASS_DEVICE
static result_type convert(source_type const& source, code_type const& code_scale, code_type const& code_zp)
{
uint32_t const i8s = reinterpret_cast<uint32_t const&>(source);
// 2^23 = 8388608
static constexpr uint32_t FP32_BASE = 0x4B000000;
float fp32_intermediates[4];
uint32_t* fp32_intermediates_casted = reinterpret_cast<uint32_t*>(fp32_intermediates);
fp32_intermediates_casted[0] = __byte_perm(i8s, FP32_BASE, 0x7650);
fp32_intermediates_casted[1] = __byte_perm(i8s, FP32_BASE, 0x7651);
fp32_intermediates_casted[2] = __byte_perm(i8s, FP32_BASE, 0x7652);
fp32_intermediates_casted[3] = __byte_perm(i8s, FP32_BASE, 0x7653);
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[0]) : "r"(fp32_intermediates_casted[0]), "r"(FP32_BASE));
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[1]) : "r"(fp32_intermediates_casted[1]), "r"(FP32_BASE));
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[2]) : "r"(fp32_intermediates_casted[2]), "r"(FP32_BASE));
asm volatile("sub.f32 %0, %1, %2;\n" : "=r"(fp32_intermediates_casted[3]) : "r"(fp32_intermediates_casted[3]), "r"(FP32_BASE));
int32_t decode_value[4];
decode_value[0] = __float2int_rd(fmaf(fp32_intermediates[0], code_scale[0], code_zp[0] + 0.5f));
decode_value[1] = __float2int_rd(fmaf(fp32_intermediates[1], code_scale[1], code_zp[1] + 0.5f));
decode_value[2] = __float2int_rd(fmaf(fp32_intermediates[2], code_scale[2], code_zp[2] + 0.5f));
decode_value[3] = __float2int_rd(fmaf(fp32_intermediates[3], code_scale[3], code_zp[3] + 0.5f));
return convert_impl(decode_value);
}
CUTLASS_DEVICE
static result_type convert_impl(int32_t* decode_value)
{
result_type result;
static constexpr uint32_t immLut = (0xF0 & 0xCC) | 0xAA;
static constexpr uint32_t MASK = 0x003F003F;
// 2^7 = 128
static constexpr uint32_t EX = 0x43004300;
uint32_t* h = reinterpret_cast<uint32_t*>(&result);
int32_t q0 = __byte_perm(decode_value[0], decode_value[1], 0x5410);
int32_t q1 = __byte_perm(decode_value[2], decode_value[3], 0x5410);
h[0] = lop3<immLut>(q0 >> 9, MASK, EX);
h[1] = lop3<immLut>(q0 >> 6, MASK, EX);
h[2] = lop3<immLut>(q0 >> 3, MASK, EX);
h[3] = lop3<immLut>(q0, MASK, EX);
h[4] = lop3<immLut>(q1 >> 9, MASK, EX);
h[5] = lop3<immLut>(q1 >> 6, MASK, EX);
h[6] = lop3<immLut>(q1 >> 3, MASK, EX);
h[7] = lop3<immLut>(q1, MASK, EX);
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && defined(ENABLE_BF16))
// 128 + 32 = 160
static constexpr uint32_t SUB = 0x43204320;
asm volatile("sub.bf16x2 %0, %1, %2;\n" : "=r"(h[0]) : "r"(h[0]), "r"(SUB));
asm volatile("sub.bf16x2 %0, %1, %2;\n" : "=r"(h[1]) : "r"(h[1]), "r"(SUB));
asm volatile("sub.bf16x2 %0, %1, %2;\n" : "=r"(h[2]) : "r"(h[2]), "r"(SUB));
asm volatile("sub.bf16x2 %0, %1, %2;\n" : "=r"(h[3]) : "r"(h[3]), "r"(SUB));
asm volatile("sub.bf16x2 %0, %1, %2;\n" : "=r"(h[4]) : "r"(h[4]), "r"(SUB));
asm volatile("sub.bf16x2 %0, %1, %2;\n" : "=r"(h[5]) : "r"(h[5]), "r"(SUB));
asm volatile("sub.bf16x2 %0, %1, %2;\n" : "=r"(h[6]) : "r"(h[6]), "r"(SUB));
asm volatile("sub.bf16x2 %0, %1, %2;\n" : "=r"(h[7]) : "r"(h[7]), "r"(SUB));
#else
// 1.0
static constexpr uint32_t MUL = 0x3F803F80;
// -160
static constexpr uint32_t ADD = 0xC320C320;
asm volatile("fma.rn.bf16x2 %0, %1, %2, %3;\n" : "=r"(h[0]) : "r"(h[0]), "r"(MUL), "r"(ADD));
asm volatile("fma.rn.bf16x2 %0, %1, %2, %3;\n" : "=r"(h[1]) : "r"(h[1]), "r"(MUL), "r"(ADD));
asm volatile("fma.rn.bf16x2 %0, %1, %2, %3;\n" : "=r"(h[2]) : "r"(h[2]), "r"(MUL), "r"(ADD));
asm volatile("fma.rn.bf16x2 %0, %1, %2, %3;\n" : "=r"(h[3]) : "r"(h[3]), "r"(MUL), "r"(ADD));
asm volatile("fma.rn.bf16x2 %0, %1, %2, %3;\n" : "=r"(h[4]) : "r"(h[4]), "r"(MUL), "r"(ADD));
asm volatile("fma.rn.bf16x2 %0, %1, %2, %3;\n" : "=r"(h[5]) : "r"(h[5]), "r"(MUL), "r"(ADD));
asm volatile("fma.rn.bf16x2 %0, %1, %2, %3;\n" : "=r"(h[6]) : "r"(h[6]), "r"(MUL), "r"(ADD));
asm volatile("fma.rn.bf16x2 %0, %1, %2, %3;\n" : "=r"(h[7]) : "r"(h[7]), "r"(MUL), "r"(ADD));
#endif
return result;
}
CUTLASS_DEVICE
result_type operator()(source_type const& s, ScaleComputeT code_scale, ScaleComputeT code_zp)
{
return convert(s, code_scale, code_zp);
}
};
template <typename T, int N>
struct FastInterleavedAndBiasedNumericArrayConverter<T, uint2b_t, N>
{
static_assert(platform::is_same<T, half_t>::value || platform::is_same<T, bfloat16_t>::value,
"T must be fp16 or bf16");
static constexpr int kVecWidth = 16;
static_assert(!(N % kVecWidth), "N must be multiple of 16.");
using result_type = Array<T, N>;
using source_type = Array<uint2b_t, N>;
using code_type = Array<float, N / kVecWidth>;
CUTLASS_DEVICE
static result_type convert(source_type const& source, code_type const& code_scale, code_type const& code_zp)
{
using scalar_result_type = typename result_type::Element;
using scalar_source_type = typename source_type::Element;
FastInterleavedAndBiasedNumericArrayConverter<scalar_result_type, scalar_source_type, kVecWidth>
convert_vector_;
result_type result;
using vec_result = Array<scalar_result_type, kVecWidth>;
using vec_source = Array<scalar_source_type, kVecWidth>;
vec_result* result_ptr = reinterpret_cast<vec_result*>(&result);
vec_source const* source_ptr = reinterpret_cast<vec_source const*>(&source);
CUTLASS_PRAGMA_UNROLL
for (int i = 0; i < N / kVecWidth; ++i)
{
result_ptr[i] = convert_vector_(source_ptr[i], code_scale[i], code_zp[i]);
}
return result;
}
CUTLASS_DEVICE
static result_type convert(source_type const& source, Array<float, N / 4> const& code_scale, Array<float, N / 4> const& code_zp)
{
using scalar_result_type = typename result_type::Element;
using scalar_source_type = typename source_type::Element;
using Converter = FastInterleavedAndBiasedNumericArrayConverter<scalar_result_type, scalar_source_type, kVecWidth>;
result_type result;
using vec_result = typename Converter::result_type;
using vec_source = typename Converter::source_type;
using vec_code = typename Converter::code_type;
vec_result* result_ptr = reinterpret_cast<vec_result*>(&result);
vec_source const* source_ptr = reinterpret_cast<vec_source const*>(&source);
vec_code const* code_scale_ptr = reinterpret_cast<vec_code const*>(&code_scale);
vec_code const* code_zp_ptr = reinterpret_cast<vec_code const*>(&code_zp);
CUTLASS_PRAGMA_UNROLL
for (int i = 0; i < N / kVecWidth; ++i)
{
result_ptr[i] = Converter::convert(source_ptr[i], code_scale_ptr[i], code_zp_ptr[i]);
}
return result;
}
CUTLASS_DEVICE
result_type operator()(source_type const& s, code_type const& code_scale, code_type const& code_zp)
{
return convert(s, code_scale, code_zp);
}
};
/////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@@ -125,10 +125,13 @@ struct WintQuantTraits<ElementT, WintQuantMethod::kWeightOnlyInt2> {
static constexpr int32_t kNumPackedValues = 4;
static constexpr int32_t kPackedSize = 16;
using LocalScaleType = uint4b_t;
using CodeScaleZpType = float;
struct Arguments {
const uint8_t *local_scale_ptr; // quanted 4-bits
const float *code_scale_ptr;
const float *code_zp_ptr;
uint8_t *local_scale_ptr; // quanted 4-bits
float *code_scale_ptr;
float *code_zp_ptr;
};
CUTLASS_DEVICE

View File

@@ -43,7 +43,6 @@
#include "cutlass/trace.h"
#include "cutlass_extensions/gemm/kernel/gemm_moe_problem_visitor.h"
#include "cutlass_extensions/gemm/threadblock/wint2x_tile_dequanter.h"
#include "cutlass_extensions/tile_interleaved_layout.h"
/////////////////////////////////////////////////////////////////////////////////////////////////
@@ -775,17 +774,54 @@ struct Wint2xMoeFCGemm : public MoeFCGemm<Mma_, Epilogue_, ThreadblockSwizzle_,
template <WintQuantMethod QuantMethod, typename dummy>
struct KernelRunner<QuantMethod, true, dummy> {
using WeightQuantTraits = WintQuantTraits<ElementA, QuantMethod>;
using QuantArguments = typename WeightQuantTraits::Arguments;
using MmaQuantArguments = typename Mma::QuantParamsAccessor::Arguments;
CUTLASS_DEVICE
static QuantArguments get_quant_args(Params const& params, int32_t problem_idx, const int64_t gemm_k, const int64_t gemm_n) {
QuantArguments quant_args;
if constexpr (QuantMethod == WintQuantMethod::kWeightOnlyInt2) {
quant_args.local_scale_ptr = params.local_scale + problem_idx * gemm_k * gemm_n / 128;
quant_args.code_scale_ptr = params.code_scale + problem_idx * gemm_n;
quant_args.code_zp_ptr = params.code_zp + problem_idx * gemm_n;
}
return quant_args;
static MmaQuantArguments prepare_quant_args(
Params const& params, cutlass::gemm::GemmCoord const& threadblock_offset,
int64_t problem_idx, const int32_t gemm_k, const int32_t gemm_n, const int thread_idx) {
// the begin threadblock_offset of scale, which holds the same column id with C, but with no row id
cutlass::MatrixCoord tb_offset_scale{0, threadblock_offset.n()};
cutlass::MatrixCoord tb_offset_local_scale{0, threadblock_offset.n() * 2};
ElementScale* weight_scale_ptr = params.weight_scales + problem_idx * gemm_n;
typename Mma::QuantParamsAccessor::IteratorSuperScale iterator_super_scale(
Mma::QuantParamsAccessor::LayoutSuperScale(gemm_n),
weight_scale_ptr,
{1, gemm_n},
thread_idx,
tb_offset_scale);
int local_scale_pointer_offset = ((ThreadblockShape::kK + 127) / 128) * (gemm_n * 2);
int64_t offset_in_bytes = problem_idx * gemm_k * gemm_n / 128;
uint4b_t *local_scale_ptr = reinterpret_cast<uint4b_t *>(params.local_scale + offset_in_bytes);
typename Mma::QuantParamsAccessor::IteratorLocalScale iterator_local_scale(
Mma::QuantParamsAccessor::LayoutLocalScale(gemm_n * 2),
local_scale_ptr,
{(gemm_k + 127) / 128, gemm_n * 2},
thread_idx,
tb_offset_local_scale);
float* code_scale_ptr = params.code_scale + problem_idx * gemm_n;
typename Mma::QuantParamsAccessor::IteratorCodeScaleZp iterator_code_scale(
Mma::QuantParamsAccessor::LayoutCodeScaleZp(gemm_n),
code_scale_ptr,
{1, gemm_n},
thread_idx,
tb_offset_scale);
float* code_zp_ptr = params.code_zp + problem_idx * gemm_n;
typename Mma::QuantParamsAccessor::IteratorCodeScaleZp iterator_code_zp(
Mma::QuantParamsAccessor::LayoutCodeScaleZp(gemm_n),
code_zp_ptr,
{1, gemm_n},
thread_idx,
tb_offset_scale);
MmaQuantArguments mma_quant_args(
iterator_super_scale, iterator_local_scale, iterator_code_scale, iterator_code_zp, local_scale_pointer_offset);
return mma_quant_args;
}
CUTLASS_DEVICE
@@ -814,9 +850,6 @@ struct Wint2xMoeFCGemm : public MoeFCGemm<Mma_, Epilogue_, ThreadblockSwizzle_,
kInterleave >= 1,
"B must be row major/col major OR col major interleaved.");
// LayoutB should be RowMajor
using TileDequanterB = cutlass::gemm::threadblock::TileDequanter<ElementA, ElementScale, ThreadblockShape::kK, ThreadblockShape::kN, kStages, kThreadCount, QuantMethod>;
//
// Problem visitor.
//
@@ -843,12 +876,6 @@ struct Wint2xMoeFCGemm : public MoeFCGemm<Mma_, Epilogue_, ThreadblockSwizzle_,
int(cta_idx % grid_shape.n()) * Mma::Shape::kN, // NOLINT
0);
// begin address offset for weight_scale.
ElementScale* weight_scale_ptr =
params.weight_scales ? params.weight_scales + problem_idx * problem_size.n() : nullptr;
// the begin threadblock_offset of scale, which holds the same column id with C, but with no row id
cutlass::MatrixCoord tb_offset_scale{0, threadblock_offset.n()};
// Load element pointers. Exchange pointers and strides if working on
// the transpose
int64_t rows_to_jump = 0;
@@ -866,42 +893,20 @@ struct Wint2xMoeFCGemm : public MoeFCGemm<Mma_, Epilogue_, ThreadblockSwizzle_,
// Compute initial location in logical coordinates
// the begin threadblock_offset of A, which holds the same row id with C
cutlass::MatrixCoord tb_offset_A{
threadblock_offset.m(),
0,
};
cutlass::MatrixCoord tb_offset_A{threadblock_offset.m(), 0};
// begin address offset for B for current problem_idx, totally num_experts problems
char* byte_ptr_B = ((char*)params.ptr_B) + // NOLINT
problem_idx * bytes_per_expert_matrix; // NOLINT
ElementB* ptr_B = reinterpret_cast<ElementB*>(byte_ptr_B);
typename LayoutB::LongIndex ldm_B =
platform::is_same<layout::RowMajor, LayoutB>::value
? gemm_n
: gemm_k * kInterleave;
typename LayoutB::LongIndex ldm_B_shared = TileDequanterB::kColumns;
// the begin threadblock_offset of B, which holds the same column id with C
cutlass::MatrixCoord tb_offset_B{0,
threadblock_offset.n() / kInterleave};
cutlass::MatrixCoord tb_offset_B{0, threadblock_offset.n() / kInterleave};
cutlass::MatrixCoord extent_B{problem_size.k() * kInterleave, problem_size.n() / kInterleave};
cutlass::MatrixCoord extent_B_shared{TileDequanterB::kRows, TileDequanterB::kColumns};
MmaElementB* smem_unzip_B_ptr = nullptr;
if constexpr (QuantMethod == WintQuantMethod::kWeightOnlyInt2) {
smem_unzip_B_ptr = shared_storage.main_loop.operand_unzip_B_ptr();
}
QuantArguments quant_args = get_quant_args(params, problem_idx, gemm_k, gemm_n);
TileDequanterB tile_dequanter_B(smem_unzip_B_ptr,
byte_ptr_B,
ldm_B,
extent_B,
tb_offset_B,
weight_scale_ptr,
tb_offset_scale,
quant_args);
MmaElementB* ptr_B = tile_dequanter_B.GetOutPtr();
// Compute position within threadblock
int thread_idx = threadIdx.x;
@@ -914,20 +919,21 @@ struct Wint2xMoeFCGemm : public MoeFCGemm<Mma_, Epilogue_, ThreadblockSwizzle_,
tb_offset_A);
typename Mma::IteratorB iterator_B(
LayoutB(TileDequanterB::kUseSharedMemory ? ldm_B_shared : ldm_B),
LayoutB(ldm_B),
ptr_B,
TileDequanterB::kUseSharedMemory ? extent_B_shared : extent_B,
extent_B,
thread_idx,
TileDequanterB::kUseSharedMemory ? cutlass::make_Coord(0, 0) : tb_offset_B);
tb_offset_B);
MmaQuantArguments mma_quant_args = prepare_quant_args(
params, threadblock_offset, problem_idx, gemm_k, gemm_n, thread_idx);
typename Mma::FragmentC accumulators;
accumulators.clear();
// Broadcast the warp_id computed by lane 0 to ensure dependent code
// is compiled as warp-uniform.
int warp_idx = __shfl_sync(0xffffffff, threadIdx.x / 32, 0);
int lane_idx = threadIdx.x % 32;
//
@@ -950,7 +956,7 @@ struct Wint2xMoeFCGemm : public MoeFCGemm<Mma_, Epilogue_, ThreadblockSwizzle_,
accumulators,
iterator_A,
iterator_B,
tile_dequanter_B,
mma_quant_args,
accumulators);
//

View File

@@ -205,7 +205,7 @@ void generic_moe_gemm_kernelLauncher(const T* A,
threadblock_count,
epilogue_op,
reinterpret_cast<const ElementType*>(A),
reinterpret_cast<const CutlassMmaWeightType*>(B),
reinterpret_cast<const CutlassMmaKernelType*>(B),
reinterpret_cast<const ElementType*>(weight_scales),
reinterpret_cast<const ElementType*>(biases),
reinterpret_cast<ElementType*>(C),

View File

@@ -223,14 +223,11 @@ public:
static Status can_implement(Arguments const &args)
{
CUTLASS_TRACE_HOST("W4A8MoeGemmUniversalBase::can_implement()");
// printf("--1\n");
// Initialize static kernel and device properties, if necessary.
Status result = init_device_props();
// printf("--1-2\n");
if (result != Status::kSuccess) {
return result;
}
// printf("--2\n");
dim3 grid = get_grid_shape(args);
// printf("--grid:%d, %d, %d\n", grid.x, grid.y, grid.z);
if (!(grid.y <= std::numeric_limits<uint16_t>::max() &&
@@ -238,7 +235,6 @@ public:
{
return Status::kErrorInvalidProblem;
}
// printf("--3\n");
return GemmKernel::can_implement(args);
}
@@ -285,18 +281,50 @@ public:
}
/// Returns the maximum number of active thread blocks per multiprocessor
static int maximum_active_blocks()
static int maximum_active_blocks(int smem_capacity = -1)
{
CUTLASS_TRACE_HOST("W4A8MoeGemmUniversalBase::maximum_active_blocks()");
// Initialize static device properties, if necessary
if (init_device_props() != Status::kSuccess) {
int smem_size = int(sizeof(typename GemmKernel_::SharedStorage));
CUTLASS_TRACE_HOST(" smem_size: " << smem_size << " bytes");
cudaError_t result;
if (smem_size > (48 << 10)) {
result = cudaFuncSetAttribute(Kernel2<GemmKernel_>,
cudaFuncAttributeMaxDynamicSharedMemorySize,
smem_size);
if (result != cudaSuccess) {
// Call cudaGetLastError() to clear the error bit
result = cudaGetLastError();
CUTLASS_TRACE_HOST(
" cudaFuncSetAttribute() returned error "
<< cudaGetErrorString(result));
return -1;
}
}
int max_active_blocks = -1;
result = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks,
Kernel2<GemmKernel_>,
GemmKernel_::kThreadCount,
smem_size);
if (result != cudaSuccess) {
// Call cudaGetLastError() to clear the error bit
result = cudaGetLastError();
CUTLASS_TRACE_HOST(
" cudaOccupancyMaxActiveBlocksPerMultiprocessor() returned error "
<< cudaGetErrorString(result));
return -1;
}
CUTLASS_TRACE_HOST(" max_active_blocks: " << sm_occupancy_);
return sm_occupancy_;
CUTLASS_TRACE_HOST(" max_active_blocks: " << max_active_blocks);
return max_active_blocks;
}
@@ -341,8 +369,7 @@ public:
// Configure grid and block dimensions
dim3 block(GemmKernel::kThreadCount, 1, 1);
// dim3 grid = params_.get_grid_dims();
dim3 grid(216, 1, 1);
dim3 grid(params_.threadblock_count, 1, 1);
// Launch kernel
CUTLASS_TRACE_HOST(" "

View File

@@ -21,12 +21,12 @@ rm -rf up_gate_proj_7168_8192.log
rm -rf down_proj_8192_3584.log
num_experts=8
for tokens_per_expert in 12
for tokens_per_expert in 1 2 4 8 16 20 24 28 32 36 48 64 96 128 160 192 224 256 384 512 768 1024 2048 3072 4096 8192
do
wait
CUDA_VISIBLE_DEVICES=2 ./w4a8_moe_gemm_test ${num_experts} ${up_gate_proj_n} ${up_gate_proj_k} ${tokens_per_expert} 1 0 >> up_gate_proj_${up_gate_proj_n}_${up_gate_proj_k}.log 2>&1 &
# CUDA_VISIBLE_DEVICES=3 ./w4a8_moe_gemm_test ${num_experts} ${down_proj_n} ${down_proj_k} ${tokens_per_expert} 1 0 >> down_proj_${down_proj_n}_${down_proj_k}.log 2>&1 &
CUDA_VISIBLE_DEVICES=2 ./w4a8_moe_gemm_test ${num_experts} ${ffn1_n} ${ffn1_k} ${tokens_per_expert} 0 1 >> ffn1_${ffn1_n}_${ffn1_k}.log 2>&1 &
CUDA_VISIBLE_DEVICES=3 ./w4a8_moe_gemm_test ${num_experts} ${ffn2_n} ${ffn2_k} ${tokens_per_expert} 0 1 >> ffn2_${ffn2_n}_${ffn2_k}.log 2>&1 &
done
wait
echo "#### finish ####"

View File

@@ -996,7 +996,6 @@ int main(int argc, char *argv[]) {
CutlassTileConfig::CtaShape64x256x64_WarpShape64x64x64,
CutlassTileConfig::CtaShape32x512x64_WarpShape32x128x64,
CutlassTileConfig::CtaShape128x128x64_WarpShape128x32x64,
CutlassTileConfig::CtaShape32x512x64_WarpShape32x128x64,
};
std::vector<SplitKStyle> all_split_k_style{SplitKStyle::NO_SPLIT_K};

View File

@@ -0,0 +1,60 @@
// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include "paddle/extension.h"
std::vector<paddle::Tensor> GetImgBoundaries(const paddle::Tensor& task_input_ids,
const paddle::Tensor& grid_thw,
const int64_t image_patch_id) {
// All tensor in cpu
auto input_ids_ptr = task_input_ids.data<int64_t>();
int64_t seq_lens_origin = task_input_ids.numel();
auto grid_thw_ptr = grid_thw.data<int64_t>();
int token_times = 4;
int token_idx = 0;
int image_idx = 0;
std::vector<int> img_boundaries, img_nums;
img_boundaries.emplace_back(0);
img_nums.emplace_back(0);
while (token_idx < seq_lens_origin) {
if (input_ids_ptr[token_idx] != image_patch_id) {
do {
token_idx++;
} while (token_idx < seq_lens_origin && input_ids_ptr[token_idx] != image_patch_id);
} else {
int cur_image_token_len = (grid_thw_ptr[image_idx * 3 + 1] * grid_thw_ptr[image_idx * 3 + 2]) / token_times;
image_idx++;
token_idx += cur_image_token_len;
}
img_boundaries.emplace_back(token_idx);
img_nums.emplace_back(image_idx);
}
int64_t num_img_boundaries = static_cast<int64_t>(img_boundaries.size());
auto out = paddle::full({2, num_img_boundaries}, 0, paddle::DataType::INT64, paddle::CPUPlace());
for (int i = 0; i < num_img_boundaries; i++) {
out.data<int64_t>()[i] = img_boundaries[i];
out.data<int64_t>()[num_img_boundaries + i] = img_nums[i];
}
return {out};
}
PD_BUILD_OP(get_img_boundaries)
.Inputs({"task_input_ids", "grid_thw"})
.Attrs({"image_patch_id: int64_t"})
.Outputs({"img_boundaries"})
.SetKernelFn(PD_KERNEL(GetImgBoundaries));

View File

@@ -665,10 +665,139 @@ void moe_fast_hardamard_kernel(const T *x,
}
}
template <typename T, typename OutT, int kThreads, int kNBytes, int VecSize, int N,
int kNChunks, int kSmeSize, int kRounds, int kChunksPerSmemSize, bool UseDiagonalBlockMatrix = false>
__global__ __launch_bounds__(kThreads)
void masked_moe_fast_hardamard_kernel(const T *x,
const int64_t *recv_expert_count,
const T *shift,
const T *smooth,
const float* quant_scales,
const int quant_round_type,
const float quant_max_bound,
const float quant_min_bound,
const int64_t token_num,
const int64_t dim,
const int num_max_tokens_per_expert,
OutT *out) {
using vec_t = typename BytesToType<sizeof(T) * VecSize>::Type;
constexpr int kLogVecSize = cilog2(VecSize);
constexpr int kLogWarpSize = cilog2(32);
constexpr int kWarpSize = 32;
constexpr int kNWarps = kThreads / kWarpSize;
constexpr int kLogNWarps = cilog2(kNWarps);
constexpr int kLogNChunks = cilog2(kNChunks);
extern __shared__ char smem_[];
vec_t *smem_exchange = reinterpret_cast<vec_t *>(smem_);
for (int token_id = blockIdx.x; token_id < token_num; token_id += gridDim.x) {
const auto token_idx_in_expert = token_id % num_max_tokens_per_expert;
const auto expert_id = token_id / num_max_tokens_per_expert;
if (token_idx_in_expert >= recv_expert_count[expert_id]) {
auto next_expert_start_idx = (expert_id + 1) * num_max_tokens_per_expert;
auto num_iters_to_next_expert = (next_expert_start_idx - token_id - 1) / gridDim.x;
token_id += num_iters_to_next_expert * gridDim.x;
continue;
}
const T *x_now = x + token_id * dim;
OutT *out_now = out + token_id * dim;
T init_value = static_cast<T>(0.f);
T x_vals[kNChunks][VecSize] = {init_value};
load_input<kNChunks, VecSize, UseDiagonalBlockMatrix, T>(x_now, x_vals, dim);
#ifdef DEBUG_HARDAMARD
if (blockIdx.x == 0 && threadIdx.x == 0) {
for (int i = 0; i < 1; ++i) {
printf("chunk_id0: %d\n", i);
for (int j = 0; j < VecSize; ++j) {
printf("%f ", (float)x_vals[i][j]);
}
printf("\n");
}
}
__syncthreads();
#endif
hadamard_mult_thread<kLogVecSize, kNChunks>(x_vals);
#ifdef DEBUG_HARDAMARD
if (blockIdx.x == 0 && threadIdx.x == 0) {
for (int i = 0; i < 1; ++i) {
printf("chunk_id1: %d, kLogVecSize: %d\n", i, kLogVecSize);
for (int j = 0; j < VecSize; ++j) {
printf("%f ", (float)x_vals[i][j]);
}
printf("\n");
}
}
__syncthreads();
#endif
hadamard_mult_warp<kLogWarpSize, 0, kNChunks, VecSize>(x_vals);
#ifdef DEBUG_HARDAMARD
if (blockIdx.x == 0 && threadIdx.x == 0) {
for (int i = 0; i < 1; ++i) {
printf("chunk_id2: %d\n", i);
for (int j = 0; j < VecSize; ++j) {
printf("%f ", (float)x_vals[i][j]);
}
printf("\n");
}
}
__syncthreads();
#endif
if constexpr (kNWarps > 1) {
// 先让连续的NWARPS个线程拿到其余warps上的数据
exchange_smem_pre<kNChunks, kChunksPerSmemSize, VecSize, kWarpSize, kNWarps, true, vec_t>(x_vals, smem_exchange);
// 交叉计算
hadamard_mult_warp<kLogNWarps, 0, kNChunks, VecSize>(x_vals);
// 再换回来
exchange_smem_pre<kNChunks, kChunksPerSmemSize, VecSize, kWarpSize, kNWarps, false, vec_t>(x_vals, smem_exchange);
}
if constexpr (kNChunks > 1) {
if constexpr (kNChunks == 28) {
hadamard_mult_thread_28_transpose<T, VecSize>(x_vals);
} else if constexpr (kNChunks == 36) {
hadamard_mult_thread_36_transpose<T, VecSize>(x_vals);
} else {
constexpr int kLogNChunks = cilog2(kNChunks);
static_assert(1 << kLogNChunks == kNChunks, "kNChunks must be a power of 2");
hadamard_mult_thread_transpose<kLogNChunks, VecSize>(x_vals);
}
}
if (quant_scales) {
float quant_scale = quant_scales[expert_id];
if (shift) {
smooth_quant_store_output<kNChunks, VecSize, UseDiagonalBlockMatrix, T, OutT>(
out_now,
shift,
smooth,
x_vals,
quant_scale,
quant_round_type,
quant_max_bound,
quant_min_bound,
dim);
} else {
quant_store_output<kNChunks, VecSize, UseDiagonalBlockMatrix, T, OutT>(
out_now,
x_vals,
quant_scale,
quant_round_type,
quant_max_bound,
quant_min_bound,
dim);
}
} else {
store_output<kNChunks, VecSize, UseDiagonalBlockMatrix, T>(out_now, x_vals, dim);
}
}
}
template <typename T, typename OutT, int kLogN, int VecSize, int kNChunks, int kThreads, bool UseDiagonalBlockMatrix>
void MoeFastHardamardImplWrapper(const T *x,
const int64_t *expert_idx_per_token,
const int64_t *recv_expert_count,
const T *shift,
const T *smooth,
const float* quant_scales,
@@ -677,6 +806,8 @@ void MoeFastHardamardImplWrapper(const T *x,
const float quant_min_bound,
const int64_t token_num,
const int64_t dim,
const int num_max_tokens_per_expert,
bool used_in_ep_low_latency,
OutT* out,
cudaStream_t stream) {
using nv_type = typename nv_type_traits<T>::type;
@@ -696,34 +827,61 @@ void MoeFastHardamardImplWrapper(const T *x,
int sm_count;
int act_blocks_per_sm;
cudaDeviceGetAttribute(&sm_count, cudaDevAttrMultiProcessorCount, dev_id);
auto kernel = moe_fast_hardamard_kernel<nv_type, out_type, kThreads, kNBytes, VecSize, N, kNChunks, kSmemSize, kRounds, kChunksPerSmemSize, UseDiagonalBlockMatrix>;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&act_blocks_per_sm, kernel, kThreads, kSmemSize);
const int num_blocks_per_wave = sm_count * act_blocks_per_sm;
dim3 grid;
grid.x = min(static_cast<int64_t>(num_blocks_per_wave), token_num);
if constexpr (UseDiagonalBlockMatrix) {
grid.y = ceil(dim / (kThreads * VecSize));
if (used_in_ep_low_latency) {
auto masked_kernel = masked_moe_fast_hardamard_kernel<nv_type, out_type, kThreads, kNBytes, VecSize, N, kNChunks, kSmemSize, kRounds, kChunksPerSmemSize, UseDiagonalBlockMatrix>;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&act_blocks_per_sm, masked_kernel, kThreads, kSmemSize);
const int num_blocks_per_wave = sm_count * act_blocks_per_sm;
dim3 grid;
grid.x = min(static_cast<int64_t>(num_blocks_per_wave), token_num);
if constexpr (UseDiagonalBlockMatrix) {
grid.y = ceil(dim / (kThreads * VecSize));
}
masked_kernel<<<grid, kThreads, kSmemSize, stream>>>(
reinterpret_cast<const nv_type*>(x),
recv_expert_count,
reinterpret_cast<const nv_type*>(shift),
reinterpret_cast<const nv_type*>(smooth),
quant_scales,
quant_round_type,
quant_max_bound,
quant_min_bound,
token_num,
dim,
num_max_tokens_per_expert,
reinterpret_cast<out_type*>(out)
);
} else {
auto kernel = moe_fast_hardamard_kernel<nv_type, out_type, kThreads, kNBytes, VecSize, N, kNChunks, kSmemSize, kRounds, kChunksPerSmemSize, UseDiagonalBlockMatrix>;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&act_blocks_per_sm, kernel, kThreads, kSmemSize);
const int num_blocks_per_wave = sm_count * act_blocks_per_sm;
dim3 grid;
grid.x = min(static_cast<int64_t>(num_blocks_per_wave), token_num);
if constexpr (UseDiagonalBlockMatrix) {
grid.y = ceil(dim / (kThreads * VecSize));
}
kernel<<<grid, kThreads, kSmemSize, stream>>>(
reinterpret_cast<const nv_type*>(x),
expert_idx_per_token,
reinterpret_cast<const nv_type*>(shift),
reinterpret_cast<const nv_type*>(smooth),
quant_scales,
quant_round_type,
quant_max_bound,
quant_min_bound,
token_num,
dim,
reinterpret_cast<out_type*>(out)
);
}
kernel<<<grid, kThreads, kSmemSize, stream>>>(
reinterpret_cast<const nv_type*>(x),
expert_idx_per_token,
reinterpret_cast<const nv_type*>(shift),
reinterpret_cast<const nv_type*>(smooth),
quant_scales,
quant_round_type,
quant_max_bound,
quant_min_bound,
token_num,
dim,
reinterpret_cast<out_type*>(out)
);
CUDA_CHECK(cudaDeviceSynchronize());
}
template <typename T, typename OutT>
void MoeFastHardamardWrapper(const T *x_data,
const int64_t *expert_idx_per_token,
const int64_t *recv_expert_count,
const T *shift,
const T *smooth,
const float* quant_scales,
@@ -732,6 +890,8 @@ void MoeFastHardamardWrapper(const T *x_data,
const float quant_min_bound,
const int64_t token_num,
const int64_t dim,
const int num_max_tokens_per_expert,
bool used_in_ep_low_latency,
OutT* out,
cudaStream_t &stream) {
bool FLAGS_hardamard_use_diagonal_block_matrix = true;
@@ -749,6 +909,7 @@ void MoeFastHardamardWrapper(const T *x_data,
MoeFastHardamardImplWrapper<T, OutT, kLogN, VEC_SIZE, kNChunks, kThreads, true>(
x_data,
expert_idx_per_token,
recv_expert_count,
shift,
smooth,
quant_scales,
@@ -757,6 +918,8 @@ void MoeFastHardamardWrapper(const T *x_data,
quant_min_bound,
token_num,
dim,
num_max_tokens_per_expert,
used_in_ep_low_latency,
out,
stream);
})});
@@ -770,6 +933,7 @@ void MoeFastHardamardWrapper(const T *x_data,
MoeFastHardamardImplWrapper<T, OutT, kLogN, VecSize, kNChunks, kThreads, false>(
x_data,
expert_idx_per_token,
recv_expert_count,
shift,
smooth,
quant_scales,
@@ -778,6 +942,8 @@ void MoeFastHardamardWrapper(const T *x_data,
quant_min_bound,
token_num,
dim,
num_max_tokens_per_expert,
used_in_ep_low_latency,
out,
stream);
});
@@ -790,6 +956,7 @@ void MoeFastHardamardWrapper(const T *x_data,
MoeFastHardamardImplWrapper<T, OutT, kLogN, VecSize, kNChunks, kThreads, false>(
x_data,
expert_idx_per_token,
recv_expert_count,
shift,
smooth,
quant_scales,
@@ -798,6 +965,8 @@ void MoeFastHardamardWrapper(const T *x_data,
quant_min_bound,
token_num,
dim,
num_max_tokens_per_expert,
used_in_ep_low_latency,
out,
stream);
});
@@ -810,6 +979,7 @@ void MoeFastHardamardWrapper(const T *x_data,
MoeFastHardamardImplWrapper<T, OutT, kLogN, VecSize, kNChunks, kThreads, false>(
x_data,
expert_idx_per_token,
recv_expert_count,
shift,
smooth,
quant_scales,
@@ -818,6 +988,8 @@ void MoeFastHardamardWrapper(const T *x_data,
quant_min_bound,
token_num,
dim,
num_max_tokens_per_expert,
used_in_ep_low_latency,
out,
stream);
});
@@ -828,6 +1000,7 @@ void MoeFastHardamardWrapper(const T *x_data,
template void MoeFastHardamardWrapper<phi::dtype::float16, phi::dtype::float16>(
const phi::dtype::float16 *x_data,
const int64_t *expert_idx_per_token,
const int64_t *recv_expert_count,
const phi::dtype::float16 *shift,
const phi::dtype::float16 *smooth,
const float* quant_scales,
@@ -836,6 +1009,8 @@ template void MoeFastHardamardWrapper<phi::dtype::float16, phi::dtype::float16>(
const float quant_min_bound,
const int64_t token_num,
const int64_t dim,
const int num_max_tokens_per_expert,
bool used_in_ep_low_latency,
phi::dtype::float16 *out,
cudaStream_t &stream
);
@@ -843,6 +1018,7 @@ template void MoeFastHardamardWrapper<phi::dtype::float16, phi::dtype::float16>(
template void MoeFastHardamardWrapper<phi::dtype::float16, int8_t>(
const phi::dtype::float16 *x_data,
const int64_t *expert_idx_per_token,
const int64_t *recv_expert_count,
const phi::dtype::float16 *shift,
const phi::dtype::float16 *smooth,
const float* quant_scales,
@@ -851,6 +1027,8 @@ template void MoeFastHardamardWrapper<phi::dtype::float16, int8_t>(
const float quant_min_bound,
const int64_t token_num,
const int64_t dim,
const int num_max_tokens_per_expert,
bool used_in_ep_low_latency,
int8_t *out,
cudaStream_t &stream
);
@@ -858,6 +1036,7 @@ template void MoeFastHardamardWrapper<phi::dtype::float16, int8_t>(
template void MoeFastHardamardWrapper<phi::dtype::bfloat16, phi::dtype::bfloat16>(
const phi::dtype::bfloat16 *x_data,
const int64_t *expert_idx_per_token,
const int64_t *recv_expert_count,
const phi::dtype::bfloat16 *shift,
const phi::dtype::bfloat16 *smooth,
const float* quant_scales,
@@ -866,6 +1045,8 @@ template void MoeFastHardamardWrapper<phi::dtype::bfloat16, phi::dtype::bfloat16
const float quant_min_bound,
const int64_t token_num,
const int64_t dim,
const int num_max_tokens_per_expert,
bool used_in_ep_low_latency,
phi::dtype::bfloat16 *out,
cudaStream_t &stream
);
@@ -873,6 +1054,7 @@ template void MoeFastHardamardWrapper<phi::dtype::bfloat16, phi::dtype::bfloat16
template void MoeFastHardamardWrapper<phi::dtype::bfloat16, int8_t>(
const phi::dtype::bfloat16 *x_data,
const int64_t *expert_idx_per_token,
const int64_t *recv_expert_count,
const phi::dtype::bfloat16 *shift,
const phi::dtype::bfloat16 *smooth,
const float* quant_scales,
@@ -881,6 +1063,8 @@ template void MoeFastHardamardWrapper<phi::dtype::bfloat16, int8_t>(
const float quant_min_bound,
const int64_t token_num,
const int64_t dim,
const int num_max_tokens_per_expert,
bool used_in_ep_low_latency,
int8_t *out,
cudaStream_t &stream
);

View File

@@ -21,6 +21,7 @@
template <typename T, typename OutT>
void MoeFastHardamardWrapper(const T *x_data,
const int64_t *expert_idx_per_token,
const int64_t *recv_expert_count,
const T *shift,
const T *smooth,
const float* quant_scales,
@@ -29,5 +30,7 @@ void MoeFastHardamardWrapper(const T *x_data,
const float quant_min_bound,
const int64_t token_num,
const int64_t dim,
const int num_max_tokens_per_expert,
bool used_in_ep_low_latency,
OutT* out,
cudaStream_t &stream);

View File

@@ -240,6 +240,7 @@ void MoeFFNKernel(const paddle::Tensor& permute_input,
MoeFastHardamardWrapper<data_t, int8_t>(
act_out_tensor.data<data_t>(),
expert_idx_per_token ? expert_idx_per_token.get().data<int64_t>() : nullptr,
const_cast<int64_t*>(tokens_expert_prefix_sum.data<int64_t>()),
down_proj_shift, // down_proj_shift->data<T>(),
down_proj_smooth, // down_proj_smooth->data<T>(),
down_proj_in_scale ? const_cast<paddle::Tensor*>(down_proj_in_scale.get_ptr())->data<float>() : nullptr,
@@ -248,6 +249,8 @@ void MoeFFNKernel(const paddle::Tensor& permute_input,
-127.0,
expanded_active_expert_rows,
inter_size / 2,
num_max_tokens_per_expert,
used_in_ep_low_latency,
reinterpret_cast<int8_t *>(int8_act_out->ptr()),
stream
);

View File

@@ -49,12 +49,13 @@ void WeightOnlyMoeFFNKernel(const paddle::Tensor& permute_input,
typename WeightOnlyTraits::Arguments up_gate_proj_quant_args;
typename WeightOnlyTraits::Arguments down_proj_quant_args;
if constexpr (QuantMethod == cutlass::WintQuantMethod::kWeightOnlyInt2) {
up_gate_proj_quant_args.local_scale_ptr = up_gate_proj_local_scale->data<uint8_t>();
up_gate_proj_quant_args.code_scale_ptr = up_gate_proj_code_scale->data<float>();
up_gate_proj_quant_args.code_zp_ptr = up_gate_proj_code_zp->data<float>();
down_proj_quant_args.local_scale_ptr = down_proj_local_scale->data<uint8_t>();
down_proj_quant_args.code_scale_ptr = down_proj_code_scale->data<float>();
down_proj_quant_args.code_zp_ptr = down_proj_code_zp->data<float>();
up_gate_proj_quant_args.local_scale_ptr = const_cast<uint8_t*>(up_gate_proj_local_scale->data<uint8_t>());
up_gate_proj_quant_args.code_scale_ptr = const_cast<float*>(up_gate_proj_code_scale->data<float>());
up_gate_proj_quant_args.code_zp_ptr = const_cast<float*>(up_gate_proj_code_zp->data<float>());
down_proj_quant_args.local_scale_ptr = const_cast<uint8_t*>(down_proj_local_scale->data<uint8_t>());
down_proj_quant_args.code_scale_ptr = const_cast<float*>(down_proj_code_scale->data<float>());
down_proj_quant_args.code_zp_ptr = const_cast<float*>(down_proj_code_zp->data<float>());
}
auto moe_gemm_runner = MoeGemmRunner<NvType, WeightOnlyTraits>();

View File

@@ -180,7 +180,7 @@ void token_penalty_multi_scores_kernel(
int64_t token_num = shape[0];
int64_t length = shape[1];
int64_t length_id = pre_ids.shape()[1];
int64_t length_bad_words = bad_tokens.shape()[0];
int64_t length_bad_words = bad_tokens.shape()[1];
int64_t end_length = eos_token_id.shape()[0];

View File

@@ -30,30 +30,62 @@ __global__ void set_value_by_flags(bool *stop_flags,
const int *seq_lens,
const int bs,
const int end_length,
const int64_t *pre_ids,
const int pre_ids_len,
const int64_t *step_idx,
const int64_t *stop_seqs,
const int *stop_seqs_len,
const int stop_seqs_bs,
const int stop_seqs_max_len,
bool beam_search,
bool prefill_one_step_stop) {
int tid = threadIdx.x;
if (tid < bs) {
if (prefill_one_step_stop) {
stop_flags[tid] = true;
if (seq_lens[tid] == 0) {
topk_ids[tid] = -1;
}
next_tokens[tid] = topk_ids[tid];
} else {
if (stop_flags[tid]) {
if (seq_lens[tid] == 0) {
topk_ids[tid] = -1;
} else {
topk_ids[tid] = end_ids[0];
next_tokens[tid] = end_ids[0];
int bid = blockIdx.x;
if (tid >= stop_seqs_bs) return;
if (bid < bs) {
if(tid == 0){
if (prefill_one_step_stop) {
stop_flags[bid] = true;
if (seq_lens[bid] == 0) {
topk_ids[bid] = -1;
}
next_tokens[bid] = topk_ids[bid];
} else {
next_tokens[tid] = topk_ids[tid];
if (stop_flags[bid]) {
if (seq_lens[bid] == 0) {
topk_ids[bid] = -1;
} else {
topk_ids[bid] = end_ids[0];
next_tokens[bid] = end_ids[0];
}
} else {
next_tokens[bid] = topk_ids[bid];
}
}
if (!beam_search && is_in_end(topk_ids[bid], end_ids, end_length)) {
stop_flags[bid] = true;
}
}
if (!beam_search && is_in_end(topk_ids[tid], end_ids, end_length)) {
stop_flags[tid] = true;
// dealing stop_seqs
const int stop_seq_len = (stop_seqs_len + bid * stop_seqs_bs)[tid];
if (stop_seq_len <= 0) return;
const int64_t *stop_seq_now = stop_seqs + bid * stop_seqs_bs + tid * stop_seqs_max_len;
const int64_t *pre_ids_now = pre_ids + bid * pre_ids_len;
const int64_t step_idx_now = step_idx[bid];
bool is_end = true;
int count = 1;
for (int i = stop_seq_len - 1; i >= 0; --i) {
if ((step_idx_now - count) < 0 ||
pre_ids_now[step_idx_now - count++] != stop_seq_now[i]) {
is_end = false;
break;
}
}
if (is_end) {
next_tokens[bid] = end_ids[0];
stop_flags[bid] = true;
topk_ids[bid] = end_ids[0];
}
}
}
@@ -63,6 +95,10 @@ void GetStopFlagsMulti(const paddle::Tensor &topk_ids,
const paddle::Tensor &seq_lens,
const paddle::Tensor &end_ids,
const paddle::Tensor &next_tokens,
const paddle::Tensor &pre_ids,
const paddle::Tensor &step_idx,
const paddle::Tensor &stop_seqs,
const paddle::Tensor &stop_seqs_len,
const bool beam_search) {
PD_CHECK(topk_ids.dtype() == paddle::DataType::INT64);
PD_CHECK(stop_flags.dtype() == paddle::DataType::BOOL);
@@ -83,8 +119,10 @@ void GetStopFlagsMulti(const paddle::Tensor &topk_ids,
std::vector<int64_t> shape = topk_ids.shape();
int64_t bs_now = shape[0];
int64_t end_length = end_ids.shape()[0];
int block_size = (bs_now + WARP_SIZE - 1) / WARP_SIZE * WARP_SIZE;
set_value_by_flags<<<1, block_size, 0, cu_stream>>>(
int stop_seqs_bs = stop_seqs.shape()[1];
int stop_seqs_max_len = stop_seqs.shape()[2];
int block_size = (stop_seqs_bs + WARP_SIZE - 1) / WARP_SIZE * WARP_SIZE;
set_value_by_flags<<<bs_now, block_size, 0, cu_stream>>>(
const_cast<bool *>(stop_flags.data<bool>()),
const_cast<int64_t *>(topk_ids.data<int64_t>()),
const_cast<int64_t *>(next_tokens.data<int64_t>()),
@@ -92,12 +130,19 @@ void GetStopFlagsMulti(const paddle::Tensor &topk_ids,
seq_lens.data<int>(),
bs_now,
end_length,
pre_ids.data<int64_t>(),
pre_ids.shape()[1],
step_idx.data<int64_t>(),
stop_seqs.data<int64_t>(),
stop_seqs_len.data<int>(),
stop_seqs_bs,
stop_seqs_max_len,
beam_search,
prefill_one_step_stop);
}
PD_BUILD_STATIC_OP(set_stop_value_multi_ends)
.Inputs({"topk_ids", "stop_flags", "seq_lens", "end_ids", "next_tokens"})
.Inputs({"topk_ids", "stop_flags", "seq_lens", "end_ids", "next_tokens", "pre_ids", "step_idx", "stop_seqs", "stop_seqs_len"})
.Attrs({"beam_search: bool"})
.Outputs({"topk_ids_out", "stop_flags_out", "next_tokens_out"})
.SetInplaceMap({{"topk_ids", "topk_ids_out"},

View File

@@ -1,133 +0,0 @@
// Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include "paddle/extension.h"
#include "helper.h"
#ifndef PD_BUILD_STATIC_OP
#define PD_BUILD_STATIC_OP(name) PD_BUILD_OP(static_op_##name)
#endif
__global__ void set_value_by_stop_seqs(bool *stop_flags,
int64_t *topk_ids,
const int64_t *pre_ids,
const int64_t *step_idx,
const int64_t *stop_seqs,
const int *stop_seqs_len,
const int *seq_lens,
const int64_t *end_ids,
const int bs,
const int stop_seqs_bs,
const int stop_seqs_max_len,
const int pre_ids_len) {
const int bid = blockIdx.x;
const int tid = threadIdx.x;
if (tid >= stop_seqs_bs) return;
const int stop_seq_len = stop_seqs_len[tid];
if (stop_seq_len <= 0) return;
const int64_t *stop_seq_now = stop_seqs + tid * stop_seqs_max_len;
const int64_t *pre_ids_now = pre_ids + bid * pre_ids_len;
const int64_t step_idx_now = step_idx[bid];
if (bid < bs) {
if (stop_flags[bid]) { // 长度超限当前位置置为2
topk_ids[bid] = end_ids[0];
if (seq_lens[bid] == 0) { // 已终止,当前位置置为-1
topk_ids[bid] = -1;
}
return;
}
bool is_end = true;
int count = 1;
if (topk_ids[bid] == end_ids[0]) {
if (tid == 0) {
stop_flags[bid] = true;
}
return;
}
for (int i = stop_seq_len - 1; i >= 0; --i) {
if ((step_idx_now - count) < 0 ||
pre_ids_now[step_idx_now - count++] != stop_seq_now[i]) {
is_end = false;
break;
}
}
if (is_end) {
topk_ids[bid] = end_ids[0];
stop_flags[bid] = true;
}
}
}
void GetStopFlagsMultiSeqs(const paddle::Tensor &topk_ids,
const paddle::Tensor &pre_ids,
const paddle::Tensor &step_idx,
const paddle::Tensor &stop_flags,
const paddle::Tensor &seq_lens,
const paddle::Tensor &stop_seqs,
const paddle::Tensor &stop_seqs_len,
const paddle::Tensor &end_ids) {
PD_CHECK(topk_ids.dtype() == paddle::DataType::INT64);
PD_CHECK(stop_flags.dtype() == paddle::DataType::BOOL);
#ifdef PADDLE_WITH_CUSTOM_DEVICE
auto dev_ctx = static_cast<const phi::CustomContext*>(paddle::experimental::DeviceContextPool::Instance().Get(topk_ids.place()));
auto cu_stream = dev_ctx->stream();
#else
auto cu_stream = topk_ids.stream();
#endif
std::vector<int64_t> shape = topk_ids.shape();
std::vector<int64_t> stop_seqs_shape = stop_seqs.shape();
int bs_now = shape[0];
int stop_seqs_bs = stop_seqs_shape[0];
int stop_seqs_max_len = stop_seqs_shape[1];
int pre_ids_len = pre_ids.shape()[1];
int block_size = (stop_seqs_bs + WARP_SIZE - 1) / WARP_SIZE * WARP_SIZE;
set_value_by_stop_seqs<<<bs_now, block_size, 0, cu_stream>>>(
const_cast<bool *>(stop_flags.data<bool>()),
const_cast<int64_t *>(topk_ids.data<int64_t>()),
pre_ids.data<int64_t>(),
step_idx.data<int64_t>(),
stop_seqs.data<int64_t>(),
stop_seqs_len.data<int>(),
seq_lens.data<int>(),
end_ids.data<int64_t>(),
bs_now,
stop_seqs_bs,
stop_seqs_max_len,
pre_ids_len);
}
PD_BUILD_STATIC_OP(set_stop_value_multi_seqs)
.Inputs({"topk_ids",
"pre_ids",
"step_idx",
"stop_flags",
"seq_lens",
"stop_seqs",
"stop_seqs_len",
"end_ids"})
.Outputs({"topk_ids_out", "stop_flags_out"})
.SetInplaceMap({{"topk_ids", "topk_ids_out"},
{"stop_flags", "stop_flags_out"}})
.SetKernelFn(PD_KERNEL(GetStopFlagsMultiSeqs));

View File

@@ -171,7 +171,7 @@ void token_penalty_multi_scores_kernel(const paddle::Tensor &pre_ids,
int64_t vocab_size = shape[1];
int64_t max_dec_len = pre_ids.shape()[1];
int64_t bad_words_len = bad_tokens.shape()[0];
int64_t bad_words_len = bad_tokens.shape()[1];
int64_t eos_len = eos_token_id.shape()[0];
int64_t max_model_len = prompt_ids.shape()[1];

View File

@@ -256,11 +256,11 @@ elif paddle.is_compiled_with_cuda():
"gpu_ops/gather_idx.cu",
"gpu_ops/get_output_ep.cc",
"gpu_ops/get_mm_split_fuse.cc",
"gpu_ops/get_img_boundaries.cc",
"gpu_ops/token_penalty_multi_scores.cu",
"gpu_ops/token_penalty_only_once.cu",
"gpu_ops/stop_generation.cu",
"gpu_ops/stop_generation_multi_ends.cu",
"gpu_ops/stop_generation_multi_stop_seqs.cu",
"gpu_ops/set_flags.cu",
"gpu_ops/update_inputs_v1.cu",
"gpu_ops/recover_decode_task.cu",
@@ -529,7 +529,6 @@ elif paddle.is_compiled_with_custom_device("iluvatar_gpu"):
sources=[
"gpu_ops/get_padding_offset.cu",
"gpu_ops/set_value_by_flags.cu",
"gpu_ops/stop_generation_multi_stop_seqs.cu",
"gpu_ops/rebuild_padding.cu",
"gpu_ops/update_inputs.cu",
"gpu_ops/stop_generation_multi_ends.cu",

View File

@@ -0,0 +1,68 @@
// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <paddle/phi/backends/xpu/xpu_context.h>
#include "paddle/extension.h"
#include "paddle/phi/core/enforce.h"
#include "xpu/plugin.h"
void RecoverDecodeTask(const paddle::Tensor &stop_flags,
const paddle::Tensor &seq_lens_this_time,
const paddle::Tensor &seq_lens_encoder,
const paddle::Tensor &seq_lens_decoder,
const paddle::Tensor &step_seq_lens_decoder,
const paddle::Tensor &block_tables,
const paddle::Tensor &is_block_step,
const int block_size) {
phi::XPUPlace place(phi::backends::xpu::GetXPUCurrentDeviceId());
auto dev_ctx =
paddle::experimental::DeviceContextPool::Instance().Get(place);
auto xpu_ctx = static_cast<const phi::XPUContext *>(dev_ctx);
const int bsz = seq_lens_this_time.shape()[0];
const int block_num_per_seq = block_tables.shape()[1];
int r = baidu::xpu::api::plugin::recover_decode_task(
xpu_ctx->x_context(),
const_cast<bool *>(stop_flags.data<bool>()),
const_cast<int *>(seq_lens_this_time.data<int>()),
const_cast<int *>(seq_lens_encoder.data<int>()),
const_cast<int *>(seq_lens_decoder.data<int>()),
const_cast<int *>(step_seq_lens_decoder.data<int>()),
const_cast<int *>(block_tables.data<int>()),
const_cast<bool *>(is_block_step.data<bool>()),
bsz,
block_num_per_seq,
block_size);
PD_CHECK(r == 0, "baidu::xpu::api::plugin::recover_decode_task failed.");
}
PD_BUILD_OP(recover_decode_task)
.Inputs({"stop_flags",
"seq_lens_this_time",
"seq_lens_encoder",
"seq_lens_decoder",
"step_seq_lens_decoder",
"block_tables",
"is_block_step"})
.Attrs({"block_size: int"})
.Outputs({"seq_lens_this_time_out",
"seq_lens_encoder_out",
"seq_lens_decoder_out",
"stop_flags_out",
"is_block_step_out"})
.SetInplaceMap({{"seq_lens_this_time", "seq_lens_this_time_out"},
{"seq_lens_encoder", "seq_lens_encoder_out"},
{"seq_lens_decoder", "seq_lens_decoder_out"},
{"stop_flags", "stop_flags_out"},
{"is_block_step", "is_block_step_out"}})
.SetKernelFn(PD_KERNEL(RecoverDecodeTask));

View File

@@ -0,0 +1,105 @@
// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <paddle/phi/backends/xpu/xpu_context.h>
#include "paddle/extension.h"
#include "paddle/phi/core/enforce.h"
#include "xpu/plugin.h"
void UpdateInputesV1(const paddle::Tensor &stop_flags,
const paddle::Tensor &not_need_stop, // only on cpu
const paddle::Tensor &seq_lens_this_time,
const paddle::Tensor &seq_lens_encoder,
const paddle::Tensor &seq_lens_decoder,
const paddle::Tensor &step_seq_lens_decoder,
const paddle::Tensor &prompt_lens,
const paddle::Tensor &topk_ids,
const paddle::Tensor &input_ids,
const paddle::Tensor &block_tables,
const paddle::Tensor &stop_nums,
const paddle::Tensor &next_tokens,
const paddle::Tensor &is_block_step,
const int block_size) {
phi::XPUPlace place(phi::backends::xpu::GetXPUCurrentDeviceId());
auto dev_ctx =
paddle::experimental::DeviceContextPool::Instance().Get(place);
auto xpu_ctx = static_cast<const phi::XPUContext *>(dev_ctx);
const int max_bsz = stop_flags.shape()[0];
const int now_bsz = seq_lens_this_time.shape()[0];
// std::cout << "now_bsz: " << now_bsz << std::endl;
const int input_ids_stride = input_ids.shape()[1];
const int block_num_per_seq = block_tables.shape()[1];
auto not_need_stop_gpu = not_need_stop.copy_to(stop_flags.place(), false);
int r = baidu::xpu::api::plugin::update_inputs_v1(
xpu_ctx->x_context(),
const_cast<bool *>(not_need_stop_gpu.data<bool>()),
const_cast<int *>(seq_lens_this_time.data<int>()),
const_cast<int *>(seq_lens_encoder.data<int>()),
const_cast<int *>(seq_lens_decoder.data<int>()),
const_cast<int *>(step_seq_lens_decoder.data<int>()),
const_cast<int64_t *>(prompt_lens.data<int64_t>()),
const_cast<int64_t *>(topk_ids.data<int64_t>()),
const_cast<int64_t *>(input_ids.data<int64_t>()),
const_cast<int *>(block_tables.data<int>()),
stop_nums.data<int64_t>(),
const_cast<bool *>(stop_flags.data<bool>()),
const_cast<bool *>(is_block_step.data<bool>()),
next_tokens.data<int64_t>(),
now_bsz,
max_bsz,
input_ids_stride,
block_num_per_seq,
block_size);
PD_CHECK(r == 0, "baidu::xpu::api::plugin::update_inputs_kernel_v1 failed.");
auto not_need_stop_cpu =
not_need_stop_gpu.copy_to(not_need_stop.place(), false);
bool *not_need_stop_data = const_cast<bool *>(not_need_stop.data<bool>());
not_need_stop_data[0] = not_need_stop_cpu.data<bool>()[0];
}
PD_BUILD_OP(update_inputs_v1)
.Inputs({"stop_flags",
"not_need_stop",
"seq_lens_this_time",
"seq_lens_encoder",
"seq_lens_decoder",
"step_seq_lens_decoder",
"prompt_lens",
"topk_ids",
"input_ids",
"block_tables",
"stop_nums",
"next_tokens",
"is_block_step"})
.Attrs({"block_size: int"})
.Outputs({"not_need_stop_out",
"seq_lens_this_time_out",
"seq_lens_encoder_out",
"seq_lens_decoder_out",
"step_seq_lens_decoder_out",
"topk_ids_out",
"input_ids_out",
"stop_flags_out",
"is_block_step_out"})
.SetInplaceMap({{"not_need_stop", "not_need_stop_out"},
{"seq_lens_this_time", "seq_lens_this_time_out"},
{"seq_lens_encoder", "seq_lens_encoder_out"},
{"seq_lens_decoder", "seq_lens_decoder_out"},
{"topk_ids", "topk_ids_out"},
{"input_ids", "input_ids_out"},
{"stop_flags", "stop_flags_out"},
{"step_seq_lens_decoder", "step_seq_lens_decoder_out"},
{"is_block_step", "is_block_step_out"}})
.SetKernelFn(PD_KERNEL(UpdateInputesV1));

View File

@@ -86,6 +86,39 @@ recover_block(Context *ctx,
const int block_num_per_seq, const int length,
const int pre_id_length);
DLL_EXPORT int
recover_decode_task(Context *ctx, bool *stop_flags,
int *seq_lens_this_time,
int *seq_lens_encoder,
int *seq_lens_decoder,
int *step_seq_lens_decoder,
int *block_tables,
bool *is_block_step,
const int bsz,
const int block_num_per_seq,
const int block_size);
DLL_EXPORT int
update_inputs_v1(Context *ctx, bool *not_need_stop,
int *seq_lens_this_time,
int *seq_lens_encoder,
int *seq_lens_decoder,
int *step_seq_lens_decoder,
int64_t *prompt_lens,
int64_t *topk_ids,
int64_t *input_ids,
int *block_tables,
const int64_t *stop_nums,
bool *stop_flags,
bool *is_block_step,
const int64_t *next_tokens,
const int bsz,
const int max_bsz,
const int input_ids_stride,
const int block_num_per_seq,
const int block_size);
template <typename TX, typename TY>
DLL_EXPORT int
eb_adjust_batch(Context *ctx, const TX *x, TY *y,

View File

@@ -0,0 +1,41 @@
#include "xpu/kernel/cluster.h"
#include "xpu/kernel/cluster_partition.h"
#include "xpu/kernel/cluster_primitive.h"
namespace xpu3 {
namespace plugin {
__global__ void recover_decode_task(bool *stop_flags,
int *seq_lens_this_time,
int *seq_lens_encoder,
int *seq_lens_decoder,
int *step_seq_lens_decoder,
int *block_tables,
bool *is_block_step,
const int bsz,
const int block_num_per_seq,
const int block_size) {
int cid = core_id();
int ncores = core_num();
int clusterid = cluster_id();
int nclusters = cluster_num();
int thread_idx = clusterid * ncores + cid;
int nthreads = nclusters * ncores;
// if (clusterid != 0) return;
for (; thread_idx < bsz; thread_idx += nthreads) {
if(is_block_step[thread_idx] == true) {
// int *block_table_now = block_tables + thread_idx * block_num_per_seq;
if (block_tables[thread_idx * block_num_per_seq + step_seq_lens_decoder[thread_idx] / block_size] != -1) {
// can be recovered for decoding
is_block_step[thread_idx] = false;
seq_lens_this_time[thread_idx]= 1;
stop_flags[thread_idx] = false;
seq_lens_encoder[thread_idx] = 0;
seq_lens_decoder[thread_idx] = step_seq_lens_decoder[thread_idx];
}
}
}
}
} // namespace plugin
} // namespace xpu3

View File

@@ -0,0 +1,131 @@
#include "xpu/kernel/cluster.h"
#include "xpu/kernel/cluster_partition.h"
#include "xpu/kernel/cluster_primitive.h"
// #include <stdio.h>
// using namespace std;
#include "xpu/kernel/xtdk_io.h"
#include "xpu/kernel/xtdk.h"
namespace xpu3 {
namespace plugin {
__global__ void update_inputs_v1(bool *not_need_stop,
int *seq_lens_this_time,
int *seq_lens_encoder,
int *seq_lens_decoder,
int *step_seq_lens_decoder,
int64_t *prompt_lens,
int64_t *topk_ids,
int64_t *input_ids,
int *block_tables,
const int64_t *stop_nums,
bool *stop_flags,
bool *is_block_step,
const int64_t *next_tokens,
const int bsz,
const int max_bsz,
const int input_ids_stride,
const int block_num_per_seq,
const int block_size) {
// std::cout << "seq_lens_this_time " << seq_lens_this_time[0] << std::endl;
int cid = core_id();
int ncores = core_num();
int clusterid = cluster_id();
int nclusters = cluster_num();
int thread_idx = clusterid * ncores + cid;
if (clusterid != 0) return;
const int max_bs = 1024;
__shared__ bool stop_flags_sm[max_bs];
__shared__ int stop_flags_int_sm[max_bs];
if(cid == 0){
GM2SM(stop_flags, stop_flags_sm, sizeof(bool) * bsz);
}
sync_all();
for(int i = cid; i < bsz; i+= ncores){
if(i < bsz){
stop_flags_sm[i] = stop_flags[i];
stop_flags_int_sm[i] = static_cast<int64_t>(stop_flags_sm[i]);
}else{
stop_flags_sm[i] = true;
stop_flags_int_sm[i] = 1;
}
if(i<bsz){
int seq_len_this_time_update = 0;
int seq_len_decoder_update = 0;
int seq_lens_encoder_update = 0;
if(stop_flags_sm[i]){
LM2GM(&seq_len_this_time_update, seq_lens_this_time + i, sizeof(int));
LM2GM(&seq_len_decoder_update, seq_lens_decoder + i, sizeof(int));
LM2GM(&seq_lens_encoder_update, seq_lens_encoder + i, sizeof(int));
}else{
GM2LM(seq_lens_this_time+i, &seq_len_this_time_update, sizeof(int));
GM2LM(seq_lens_decoder+i, &seq_len_decoder_update, sizeof(int));
GM2LM(seq_lens_encoder+i, &seq_lens_encoder_update, sizeof(int));
int sum_of_seq_lens_this_time_and_seq_lens_decoder = seq_len_this_time_update + seq_len_decoder_update;
int prompt_lens_update = 0;
GM2LM(prompt_lens+i, &prompt_lens_update, sizeof(int64_t));
// decoding
if(sum_of_seq_lens_this_time_and_seq_lens_decoder >= prompt_lens_update){
seq_len_decoder_update = seq_len_this_time_update + seq_len_decoder_update;
LM2GM(&seq_len_decoder_update, seq_lens_decoder+i, sizeof(int));
seq_len_this_time_update = 1;
LM2GM(&seq_len_this_time_update, seq_lens_this_time + i, sizeof(int));
seq_lens_encoder_update = 0;
LM2GM(&seq_lens_encoder_update, seq_lens_encoder + i, sizeof(int));
int64_t input_ids_update;
GM2LM(next_tokens + i, &input_ids_update, sizeof(int64_t));
LM2GM(&input_ids_update, input_ids + i * input_ids_stride, sizeof(int64_t));
// to judge whether block is not enough
if(seq_len_this_time_update != 0 && block_tables[i * block_num_per_seq + seq_len_decoder_update/block_size] == -1){
is_block_step[i] = true;
seq_len_this_time_update = 0;
LM2GM(&seq_len_this_time_update, seq_lens_this_time + i, sizeof(int));
stop_flags_sm[i] = true;
SM2GM(stop_flags_sm+i, stop_flags+i, sizeof(bool));
LM2GM(&seq_len_decoder_update, step_seq_lens_decoder+i, sizeof(int));
seq_len_decoder_update = 0;
LM2GM(&seq_len_decoder_update, seq_lens_decoder + i, sizeof(int));
seq_len_decoder_update = 0;
LM2GM(&seq_len_decoder_update, seq_lens_decoder + i, sizeof(int));
stop_flags_int_sm[i] = 1;
}
}else{
stop_flags_sm[i] = true;
SM2GM(stop_flags_sm+i, stop_flags+i, sizeof(bool));
seq_len_this_time_update = 0;
LM2GM(&seq_len_this_time_update, seq_lens_this_time + i, sizeof(int));
seq_len_decoder_update = 0;
seq_lens_encoder_update = 0;
LM2GM(&seq_len_decoder_update, seq_lens_decoder + i, sizeof(int));
LM2GM(&seq_lens_encoder_update, seq_lens_encoder + i, sizeof(int));
int64_t topk_ids_update = -1;
LM2GM(&topk_ids_update, topk_ids + i, sizeof(int64_t));
stop_flags_int_sm[i] = 1;
}
}
}
}
sync_all();
sync_cluster();
int stop_sum = 0;
if (cid == 0) {
for (int i = 0; i < max_bsz; i++) {
stop_sum += stop_flags_int_sm[i];
}
// printf("stop_sum : %d\n", stop_sum);
int64_t stop_num;
GM2LM(stop_nums, &stop_num, sizeof(int64_t));
bool not_need_stop_update = stop_sum < static_cast<int>(stop_num);
mfence_lm();
LM2GM(&not_need_stop_update, not_need_stop, sizeof(bool));
}
}
} // namespace plugin
} // namespace xpu3

View File

@@ -0,0 +1,107 @@
// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include "xpu/plugin.h"
#include "xpu/refactor/impl_public/wrapper_check.h"
#include <algorithm>
#include <numeric>
namespace xpu3 {
namespace plugin {
__attribute__((global)) void
recover_decode_task(bool *stop_flags,
int *seq_lens_this_time,
int *seq_lens_encoder,
int *seq_lens_decoder,
int *step_seq_lens_decoder,
int *block_tables,
bool *is_block_step,
const int bsz,
const int block_num_per_seq,
const int block_size);
} // namespace plugin
} // namespace xpu3
namespace baidu {
namespace xpu {
namespace api {
namespace plugin {
static int xpu3_wrapper(Context *ctx, bool *stop_flags,
int *seq_lens_this_time,
int *seq_lens_encoder,
int *seq_lens_decoder,
int *step_seq_lens_decoder,
int *block_tables,
bool *is_block_step,
const int bsz,
const int block_num_per_seq,
const int block_size) {
using XPU_INT64 = typename XPUIndexType<int64_t>::type;
auto recover_decode_task = xpu3::plugin::recover_decode_task;
recover_decode_task<<<ctx->ncluster(), 64, ctx->xpu_stream>>>(
stop_flags,
seq_lens_this_time,
seq_lens_encoder,
seq_lens_decoder,
step_seq_lens_decoder,
block_tables,
is_block_step,
bsz,
block_num_per_seq,
block_size);
return api::SUCCESS;
}
int recover_decode_task(Context *ctx, bool *stop_flags,
int *seq_lens_this_time,
int *seq_lens_encoder,
int *seq_lens_decoder,
int *step_seq_lens_decoder,
int *block_tables,
bool *is_block_step,
const int bsz,
const int block_num_per_seq,
const int block_size) {
WRAPPER_CHECK_CTX(ctx);
WRAPPER_DUMP_FUNCTION_T1(ctx, "recover_decode_task", int);
WRAPPER_DUMP_PARAM5(ctx, stop_flags, seq_lens_this_time,
seq_lens_encoder, seq_lens_decoder, step_seq_lens_decoder);
WRAPPER_DUMP_PARAM2(ctx, block_tables, is_block_step);
WRAPPER_DUMP_PARAM3(ctx, bsz, block_num_per_seq, block_size);
WRAPPER_DUMP(ctx);
if (ctx->dev().type() == api::kCPU) {
assert(false);
}
if (ctx->dev().type() == api::kXPU2 || ctx->dev().type() == api::kXPU3) {
return xpu3_wrapper(ctx, stop_flags,
seq_lens_this_time,
seq_lens_encoder,
seq_lens_decoder,
step_seq_lens_decoder,
block_tables,
is_block_step,
bsz,
block_num_per_seq,
block_size);
}
WRAPPER_UNIMPLEMENTED(ctx);
}
} // namespace plugin
} // namespace api
} // namespace xpu
} // namespace baidu

View File

@@ -0,0 +1,149 @@
// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include "xpu/plugin.h"
#include "xpu/refactor/impl_public/wrapper_check.h"
#include <algorithm>
#include <numeric>
namespace xpu3 {
namespace plugin {
__attribute__((global)) void
update_inputs_v1(bool *not_need_stop,
int *seq_lens_this_time,
int *seq_lens_encoder,
int *seq_lens_decoder,
int *step_seq_lens_decoder,
int64_t *prompt_lens,
int64_t *topk_ids,
int64_t *input_ids,
int *block_tables,
const int64_t *stop_nums,
bool *stop_flags,
bool *is_block_step,
const int64_t *next_tokens,
const int bsz,
const int max_bsz,
const int input_ids_stride,
const int block_num_per_seq,
const int block_size);
} // namespace plugin
} // namespace xpu3
namespace baidu {
namespace xpu {
namespace api {
namespace plugin {
static int xpu3_wrapper(Context *ctx, bool *not_need_stop,
int *seq_lens_this_time,
int *seq_lens_encoder,
int *seq_lens_decoder,
int *step_seq_lens_decoder,
int64_t *prompt_lens,
int64_t *topk_ids,
int64_t *input_ids,
int *block_tables,
const int64_t *stop_nums,
bool *stop_flags,
bool *is_block_step,
const int64_t *next_tokens,
const int bsz,
const int max_bsz,
const int input_ids_stride,
const int block_num_per_seq,
const int block_size) {
using XPU_INT64 = typename XPUIndexType<int64_t>::type;
auto update_inputs_v1 = xpu3::plugin::update_inputs_v1;
// kernel 内要做 reduce只能用 1 个 cluster
update_inputs_v1<<<1, 64, ctx->xpu_stream>>>(
not_need_stop,
seq_lens_this_time,
seq_lens_encoder,
seq_lens_decoder,
step_seq_lens_decoder,
reinterpret_cast<XPU_INT64 *>(prompt_lens),
reinterpret_cast<XPU_INT64 *>(topk_ids),
reinterpret_cast<XPU_INT64 *>(input_ids),
block_tables,
reinterpret_cast<const XPU_INT64 *>(stop_nums),
stop_flags,
is_block_step,
reinterpret_cast<const XPU_INT64 *>(next_tokens),
bsz,
max_bsz,
input_ids_stride,
block_num_per_seq,
block_size);
return api::SUCCESS;
}
int update_inputs_v1(Context *ctx, bool *not_need_stop,
int *seq_lens_this_time,
int *seq_lens_encoder,
int *seq_lens_decoder,
int *step_seq_lens_decoder,
int64_t *prompt_lens,
int64_t *topk_ids,
int64_t *input_ids,
int *block_tables,
const int64_t *stop_nums,
bool *stop_flags,
bool *is_block_step,
const int64_t *next_tokens,
const int bsz,
const int max_bsz,
const int input_ids_stride,
const int block_num_per_seq,
const int block_size) {
WRAPPER_CHECK_CTX(ctx);
WRAPPER_DUMP_FUNCTION_T1(ctx, "update_inputs_v1", int);
WRAPPER_DUMP_PARAM5(ctx, not_need_stop, seq_lens_this_time,
seq_lens_encoder, seq_lens_decoder, step_seq_lens_decoder);
WRAPPER_DUMP_PARAM5(ctx, prompt_lens, topk_ids, input_ids, block_tables, stop_nums);
WRAPPER_DUMP_PARAM3(ctx, stop_flags, is_block_step, next_tokens);
WRAPPER_DUMP_PARAM5(ctx, bsz, max_bsz, input_ids_stride, block_num_per_seq, block_size);
WRAPPER_DUMP(ctx);
if (ctx->dev().type() == api::kCPU) {
assert(false);
}
if (ctx->dev().type() == api::kXPU2 || ctx->dev().type() == api::kXPU3) {
return xpu3_wrapper(ctx, not_need_stop,
seq_lens_this_time,
seq_lens_encoder,
seq_lens_decoder,
step_seq_lens_decoder,
prompt_lens,
topk_ids,
input_ids,
block_tables,
stop_nums,
stop_flags,
is_block_step,
next_tokens,
bsz,
max_bsz,
input_ids_stride,
block_num_per_seq,
block_size);
}
WRAPPER_UNIMPLEMENTED(ctx);
}
} // namespace plugin
} // namespace api
} // namespace xpu
} // namespace baidu

View File

@@ -144,6 +144,8 @@ def xpu_setup_ops():
"./ops/get_token_penalty_multi_scores.cc",
"./ops/get_padding_offset.cc",
"./ops/update_inputs.cc",
"./ops/recover_decode_task.cc",
"./ops/update_inputs_v1.cc",
"./ops/get_output.cc",
"./ops/step.cc",
"./ops/get_infer_param.cc",

View File

@@ -1,10 +1,10 @@
FROM ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.0.0
FROM ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.1.0
ARG PADDLE_VERSION=3.1.1
ARG FD_VERSION=2.1.0
ENV DEBIAN_FRONTEND=noninteractive
WORKDIR /workspace
RUN rm -rf /workspace/FastDeploy
COPY . /workspace/FastDeploy
RUN echo "ulimit -u unlimited" >> /root/.bashrc
RUN echo "ulimit -n 65536" >> /root/.bashrc
@@ -13,10 +13,10 @@ RUN echo "ulimit -n 65536" >> /root/.bashrc
RUN python -m pip uninstall paddlepaddle-gpu fastdeploy-gpu -y
# install paddlepaddle
RUN python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
RUN python -m pip install --no-cache-dir paddlepaddle-gpu==${PADDLE_VERSION} -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
# build and install FastDeploy
RUN cd FastDeploy && bash build.sh 1 python false [80,90] && python -m pip install --no-cache-dir dist/* && rm -rf /workspace/FastDeploy
RUN python -m pip install --no-cache-dir fastdeploy-gpu==${FD_VERSION} -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
ENV http_proxy=""
ENV https_proxy=""

View File

@@ -1,4 +1,6 @@
FROM ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlenlp:llm-base-gcc12.3-xpu-xft20250402-v1.1
ARG PADDLE_VERSION=3.1.0
ARG FD_VERSION=2.0.0
WORKDIR /workspace
@@ -14,23 +16,16 @@ RUN apt-get update && apt-get install -y libibverbs-dev librdmacm-dev cmake pybi
# uninstall existing package
RUN python -m pip uninstall paddlepaddle-gpu paddlepaddle-xpu -y
# install paddlepaddle
RUN python -m pip install --no-cache-dir --progress-bar off paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
# install paddlepaddle-xpu
RUN python -m pip install --no-cache-dir --progress-bar off paddlepaddle-xpu==${PADDLE_VERSION} -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
COPY . /workspace/FastDeploy
RUN python -m pip install --no-cache-dir fastdeploy-xpu==${FD_VERSION} -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# get xtdk and xvllm and xre
RUN mkdir -p /workspace/deps && cd /workspace/deps && \
wget https://klx-sdk-release-public.su.bcebos.com/xre/kl3-release/5.0.21.21/xre-Linux-x86_64-5.0.21.21.tar.gz && \
tar -zxf xre-Linux-x86_64-5.0.21.21.tar.gz && mv xre-Linux-x86_64-5.0.21.21 xre && \
cd /workspace/FastDeploy && bash custom_ops/xpu_ops/src/download_dependencies.sh stable
tar -zxf xre-Linux-x86_64-5.0.21.21.tar.gz && mv xre-Linux-x86_64-5.0.21.21 xre
ENV PATH=/workspace/deps/xre/bin:$PATH
ENV CLANG_PATH=/workspace/FastDeploy/custom_ops/xpu_ops/src/third_party/xtdk
ENV XVLLM_PATH=/workspace/FastDeploy/custom_ops/xpu_ops/src/third_party/xvllm
# build and install FastDeploy
RUN cd /workspace/FastDeploy && bash build.sh && python -m pip install --no-cache-dir dist/* && rm -rf /workspace/FastDeploy
ENV http_proxy=""
ENV https_proxy=""

View File

@@ -0,0 +1,94 @@
# ERNIE-4.5-0.3B
## Environmental Preparation
### 1.1 Hardware requirements
The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following hardware for each quantization is as follows:
| | WINT8 | WINT4 | FP8 |
|-----|-----|-----|-----|
|H800 80GB| 1 | 1 | 1 |
|A800 80GB| 1 | 1 | / |
|H20 96GB| 1 | 1 | 1 |
|L20 48GB| 1 | 1 | 1 |
|A30 40GB| 1 | 1 | / |
|A10 24GB| 1 | 1 | / |
**Tips:**
1. To modify the number of deployment GPUs, specify `--tensor-parallel-size 2` in starting command.
2. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
### 1.2 Install fastdeploy
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
- Model DownloadFor detail, please refer to [Supported Models](../supported_models.md). **Please note that models with Paddle suffix need to be used for Fastdeploy**
## 2.How to Use
### 2.1 Basic: Launching the Service
Start the service by following command:
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--tensor-parallel-size 1 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 128
```
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
### 2.2 Advanced: How to get better performance
#### 2.2.1 Correctly set parameters that match the application scenario
Evaluate average input length, average output length, and maximum context length
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
- **Enable the service management global block**
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
#### 2.2.2 Prefix Caching
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
**How to enable:**
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine.
```
--enable-prefix-caching
--swap-space 50
```
#### 2.2.3 Chunked Prefill
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
**How to enable:** Add the following lines to the startup parameters
```
--enable-chunked-prefill
```
#### 2.2.4 CudaGraph
**Idea:**
CUDAGraph is a GPU computing acceleration technology provided by NVIDIA. It achieves efficient execution and optimization of GPU tasks by capturing CUDA operation sequences into a graph structure. The core idea of CUDAGraph is to encapsulate a series of GPU computing and memory operations into a re-executable graph, thereby reducing CPU-GPU communication overhead, reducing kernel startup latency, and improving overall computing performance.
**How to enable:**
Add the following lines to the startup parameters
```
--use-cudagraph
```
Notes:
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time.
3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
#### 2.2.6 Rejection Sampling
**Idea:**
Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models.
**How to enable:**
Add the following environment variables before starting
```
export FD_SAMPLING_CLASS=rejection
```
## FAQ
If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).

View File

@@ -0,0 +1,150 @@
# ERNIE-4.5-21B-A3B
## Environmental Preparation
### 1.1 Hardware requirements
The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the following hardware for each quantization is as follows:
| | WINT8 | WINT4 | FP8 |
|-----|-----|-----|-----|
|H800 80GB| 1 | 1 | 1 |
|A800 80GB| 1 | 1 | / |
|H20 96GB| 1 | 1 | 1 |
|L20 48GB| 1 | 1 | 1 |
|A30 40GB| 2 | 1 | / |
|A10 24GB| 2 | 1 | / |
**Tips:**
1. To modify the number of deployment GPUs, specify `--tensor-parallel-size 2` in starting command.
2. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
### 1.2 Install fastdeploy and prepare the model
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
- Model DownloadFor detail, please refer to [Supported Models](../supported_models.md). **Please note that models with Paddle suffix need to be used for Fastdeploy**
## 2.How to Use
### 2.1 Basic: Launching the Service
Start the service by following command:
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-21B-A3B-Paddle \
--tensor-parallel-size 1 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 128
```
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
### 2.2 Advanced: How to get better performance
#### 2.2.1 Correctly set parameters that match the application scenario
Evaluate average input length, average output length, and maximum context length
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
- **Enable the service management global block**
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
#### 2.2.2 Prefix Caching
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
**How to enable:**
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine.
```
--enable-prefix-caching
--swap-space 50
```
#### 2.2.3 Chunked Prefill
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
**How to enable:** Add the following lines to the startup parameters
```
--enable-chunked-prefill
```
#### 2.2.4 MTP (Multi-Token Prediction)
**Idea:**
By predicting multiple tokens at once, the number of decoding steps is reduced to significantly speed up the generation speed, while maintaining the generation quality through certain strategies. For details, please refer to [Speculative Decoding](../features/speculative_decoding.md)。
**How to enable:**
Add the following lines to the startup parameters
```
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
```
#### 2.2.5 CUDAGraph
**Idea:**
CUDAGraph is a GPU computing acceleration technology provided by NVIDIA. It achieves efficient execution and optimization of GPU tasks by capturing CUDA operation sequences into a graph structure. The core idea of CUDAGraph is to encapsulate a series of GPU computing and memory operations into a re-executable graph, thereby reducing CPU-GPU communication overhead, reducing kernel startup latency, and improving overall computing performance.
**How to enable:**
Add the following lines to the startup parameters
```
--use-cudagraph
```
Notes:
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time.
3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
#### 2.2.6 Rejection Sampling
**Idea:**
Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models.
**How to enable:**
Add the following environment variables before starting
```
export FD_SAMPLING_CLASS=rejection
```
#### 2.2.7 Disaggregated Deployment
**Idea:** Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency.
**How to enable:** Take the deployment of a single machine with 8 GPUs and 1P1D (4 GPUs each) as an example. Compared with the default hybrid deployment method, `--splitwise-role` is required to specify the role of the node. And the GPUs and logs of the two nodes are isolated through the environment variables `FD_LOG_DIR` and `CUDA_VISIBLE_DEVICES`.
```
# prefill
export CUDA_VISIBLE_DEVICES=0,1,2,3
export INFERENCE_MSG_QUEUE_ID=1315
export FLAGS_max_partition_size=2048
export FD_ATTENTION_BACKEND=FLASH_ATTN
export FD_LOG_DIR="prefill_log"
quant_type=block_wise_fp8
export FD_USE_DEEP_GEMM=0
python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A3B-Paddle \
--max-model-len 131072 \
--max-num-seqs 20 \
--num-gpu-blocks-override 40000 \
--quantization ${quant_type} \
--gpu-memory-utilization 0.9 --kv-cache-ratio 0.9 \
--port 7012 --engine-worker-queue-port 7013 --metrics-port 7014 --tensor-parallel-size 4 \
--cache-queue-port 7015 \
--splitwise-role "prefill" \
```
```
# decode
export CUDA_VISIBLE_DEVICES=4,5,6,7
export INFERENCE_MSG_QUEUE_ID=1215
export FLAGS_max_partition_size=2048
export FD_LOG_DIR="decode_log"
quant_type=block_wise_fp8
export FD_USE_DEEP_GEMM=0
python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A3B-Paddle \
--max-model-len 131072 \
--max-num-seqs 20 \
--quantization ${quant_type} \
--gpu-memory-utilization 0.85 --kv-cache-ratio 0.1 \
--port 9012 --engine-worker-queue-port 8013 --metrics-port 8014 --tensor-parallel-size 4 \
--cache-queue-port 8015 \
--innode-prefill-ports 7013 \
--splitwise-role "decode"
```
## FAQ
If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).

View File

@@ -0,0 +1,143 @@
# ERNIE-4.5-300B-A47B
## Environmental Preparation
### 1.1 Hardware requirements
The minimum number of GPUs required to deploy `ERNIE-4.5-300B-A47B` on the following hardware for each quantization is as follows:
| | WINT8 | WINT4 | FP8 | WINT2 | W4A8 |
|-----|-----|-----|-----|-----|-----|
|H800 80GB| 8 | 4 | 8 | 2 | 4 |
|A800 80GB| 8 | 4 | / | 2 | 4 |
**Tips:**
1. To modify the number of deployment GPUs, specify `--tensor-parallel-size 4` in starting command.
2. Since only 4-GPSs quantization scale is provided, the W4A8 model needs to be deployed on 4 GPUs.
3. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
### 1.2 Install fastdeploy
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
- Model DownloadFor detail, please refer to [Supported Models](../supported_models.md). **Please note that models with Paddle suffix need to be used for Fastdeploy**
## 2.How to Use
### 2.1 Basic: Launching the Service
Start the service by following command:
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--tensor-parallel-size 8 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 128
```
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
### 2.2 Advanced: How to get better performance
#### 2.2.1 Correctly set parameters that match the application scenario
Evaluate average input length, average output length, and maximum context length
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
- **Enable the service management global block**
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
#### 2.2.2 Prefix Caching
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
**How to enable:**
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine.
```
--enable-prefix-caching
--swap-space 50
```
#### 2.2.3 Chunked Prefill
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
**How to enable:** Add the following lines to the startup parameters
```
--enable-chunked-prefill
```
#### 2.2.4 MTP (Multi-Token Prediction)
**Idea:**
By predicting multiple tokens at once, the number of decoding steps is reduced to significantly speed up the generation speed, while maintaining the generation quality through certain strategies. For details, please refer to [Speculative Decoding](../features/speculative_decoding.md)。
**How to enable:**
Add the following lines to the startup parameters
```
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
```
#### 2.2.5 W4A8C8 Quantization
**Idea:**
Quantization can achieve model compression, reduce GPU memory usage and speed up inference. To achieve better inference results, per-channel symmetric 4-bit quantization is used for MoE weights. static per-tensor symmetric 8-bit quantization is used for activation. And static per-channel symmetric 8-bit quantization is used for KVCache.
**How to enable:**
Just specify the corresponding model name in the startup command, `baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle`
```
--model baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle
```
#### 2.2.6 Rejection Sampling
**Idea:**
Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models.
**How to enable:**
Add the following environment variables before starting
```
export FD_SAMPLING_CLASS=rejection
```
#### 2.2.7 Disaggregated Deployment
**Idea:** Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency.
**How to enable:** Take the deployment of a single machine with 8 GPUs and 1P1D (4 GPUs each) as an example. Compared with the default hybrid deployment method, `--splitwise-role` is required to specify the role of the node. And the GPUs and logs of the two nodes are isolated through the environment variables `FD_LOG_DIR` and `CUDA_VISIBLE_DEVICES`.
```
export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 --metrics-port 8181 \
--engine-worker-queue-port 8182 \
--cache-queue-port 8183 \
--tensor-parallel-size 4 \
--quantization wint4 \
--splitwise-role "prefill"
```
```
export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=4,5,6,7
# Note that innode-prefill-ports is specified as the Prefill serviceengine-worker-queue-port
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle\
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port 8186 \
--cache-queue-port 8187 \
--tensor-parallel-size 4 \
--quantization wint4 \
--innode-prefill-ports 8182 \
--splitwise-role "decode"
```
#### 2.2.8 CUDAGraph
**Idea:**
CUDAGraph is a GPU computing acceleration technology provided by NVIDIA. It achieves efficient execution and optimization of GPU tasks by capturing CUDA operation sequences into a graph structure. The core idea of CUDAGraph is to encapsulate a series of GPU computing and memory operations into a re-executable graph, thereby reducing CPU-GPU communication overhead, reducing kernel startup latency, and improving overall computing performance.
**How to enable:**
Add the following lines to the startup parameters
```
--use-cudagraph
--enable-custom-all-reduce
```
Notes:
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time.
3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
## FAQ
If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).

View File

@@ -0,0 +1,134 @@
# ERNIE-4.5-VL-28B-A3B-Paddle
## 1. Environment Preparation
### 1.1 Support Status
The minimum number of cards required for deployment on the following hardware is as follows:
| Device [GPU Mem] | WINT4 | WINT8 | BFLOAT16 |
|:----------:|:----------:|:------:| :------:|
| A30 [24G] | 2 | 2 | 4 |
| L20 [48G] | 1 | 1 | 2 |
| H20 [144G] | 1 | 1 | 1 |
| A100 [80G] | 1 | 1 | 1 |
| H800 [80G] | 1 | 1 | 1 |
### 1.2 Install Fastdeploy
Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md)
> ⚠️ Precautions:
> - FastDeploy only supports models in Paddle format please ensure to download models with the `-Paddle` file extension.
> - The model name will trigger an automatic download. If the model has already been downloaded, you can directly use the absolute path to the model's download location.
## 2.How to Use
### 2.1 Basic: Launching the Service
**Example 1:** Deploying a 32K Context Service on a Single RTX 4090 GPU
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 256 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
```
**Example 2:** Deploying a 128K Context Service on Dual H800 GPUs
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--max-num-seqs 256 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
```
> ⚠️ For versions 2.1 and above, the new scheduler needs to be enabled via an environment variable `ENABLE_V1_KVCACHE_SCHEDULER=1`. Otherwise, some requests may be truncated before reaching the maximum length or return empty results.
An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
### 2.2 Advanced: How to Achieve Better Performance
#### 2.2.1 Evaluating Application Scenarios and Setting Parameters Correctly
> **Context Length**
- **Parameters** `--max-model-len`
- **Description** Controls the maximum context length that the model can process.
- **Recommendation** Longer context lengths may reduce throughput. Adjust based on actual needs, with a maximum supported context length of **128k** (131,072).
⚠️ Note: Longer context lengths will significantly increase GPU memory requirements. Ensure your hardware resources are sufficient before setting a longer context.
> **Maximum sequence count**
- **Parameters** `--max-num-seqs`
- **Description** Controls the maximum number of sequences the service can handle, supporting a range of 1 to 256.
- **Recommendation** If you are unsure of the average number of sequences per request in your actual application scenario, we recommend setting it to **256**. If the average number of sequences per request in your application is significantly fewer than 256, we suggest setting it to a slightly higher value than the average to further reduce GPU memory usage and optimize service performance.
> **Multi-image and multi-video input**
- **Parameters**`--limit-mm-per-prompt`
- **Description**Our model supports multi-image and multi-video input in a single prompt. Please use this **Parameters** setting to limit the number of images/videos per request, ensuring efficient resource utilization.
- **Recommendation**We recommend setting the number of images and videos in a single prompt to **100 each** to balance performance and memory usage.
> **Available GPU memory ratio during initialization**
- **Parameters** `--gpu-memory-utilization`
- **Description** Controls the available GPU memory for FastDeploy service initialization. The default value is 0.9, meaning 10% of the memory is reserved for backup.
- **Recommendation** It is recommended to use the default value of 0.9. If an "out of memory" error occurs during stress testing, you may attempt to reduce this value.
#### 2.2.2 Chunked Prefill
- **Parameters** `--enable-chunked-prefill`
- **Description** Enabling `chunked prefill` can **reduce peak GPU memory usage** and **improve service throughput**.
- **Other relevant configurations**:
`--max-num-batched-tokens`Limit the maximum number of tokens per chunk, with a recommended setting of 384.
#### 2.2.3 **Quantization precision**
- **Parameters** `--quantization`
- **Supported precision types**
- WINT4 (Suitable for most users)
- WINT8
- BFLOAT16 (When the `--quantization` parameter is not set, BFLOAT16 is used by default.)
- **Recommendation**
- Unless you have extremely stringent precision requirements, we strongly recommend using WINT4 quantization. This will significantly reduce memory consumption and increase throughput.
- If slightly higher precision is required, you may try WINT8.
- Only consider using BFLOAT16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.
#### 2.2.4 **Adjustable environment variables**
> **Rejection sampling**`FD_SAMPLING_CLASS=rejection`
- **Description** Rejection sampling involves generating samples from a proposal distribution that is easy to sample from, thereby avoiding explicit sorting and achieving an effect of improving sampling speed, which can enhance inference performance.
- **Recommendation** This is a relatively aggressive optimization strategy that affects the results, and we are still conducting comprehensive validation of its impact. If you have high performance requirements and can accept potential compromises in results, you may consider enabling this strategy.
> **Attention Hyperparameter**`FLAGS_max_partition_size=1024`
- **Description** The hyperparameters for the Append Attention (default) backend have been tested on commonly used datasets, and our results show that setting it to 1024 can significantly improve decoding speed, especially in long-text scenarios.
- **Recommendation** In the future, it will be modified to an automatic adjustment mechanism. If you have high performance requirements, you may consider enabling it.
## 3. FAQ
**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.
### 3.1 Out of Memory
If the service prompts "Out of Memory" during startup, please try the following solutions:
1. Ensure no other processes are occupying GPU memory;
2. Use WINT4/WINT8 quantization and enable chunked prefill;
3. Reduce context length and maximum sequence count as needed;
4. Increase the number of GPU cards for deployment (e.g., 2 or 4 cards) by modifying the parameter `--tensor-parallel-size 2` or `--tensor-parallel-size 4`.
If the service starts normally but later reports insufficient memory, try:
1. Adjust the initial GPU memory utilization ratio by modifying `--gpu-memory-utilization`;
2. Increase the number of deployment cards (parameter adjustment as above).

View File

@@ -0,0 +1,110 @@
# ERNIE-4.5-VL-424B-A47B-Paddle
## 1. Environment Preparation
### 1.1 Support Status
The minimum number of cards required for deployment on the following hardware is as follows:
| Device [GPU Mem] | WINT4 | WINT8 | BFLOAT16 |
|:----------:|:----------:|:------:| :------:|
| H20 [144G] | 8 | 8 | 8 |
| A100 [80G] | 8 | 8 | - |
| H800 [80G] | 8 | 8 | - |
### 1.2 Install Fastdeploy
Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md)
> ⚠️ Precautions:
> - FastDeploy only supports models in Paddle format please ensure to download models with the `-Paddle` file extension.
> - The model name will trigger an automatic download. If the model has already been downloaded, you can directly use the absolute path to the model's download location.
## 2.How to Use
### 2.1 Basic: Launching the Service
**Example 1:** Deploying a 128K context service on 8x H800 GPUs.
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 8 \
--max-model-len 131072 \
--max-num-seqs 16 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.8 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
```
> ⚠️ For versions 2.1 and above, the new scheduler needs to be enabled via an environment variable `ENABLE_V1_KVCACHE_SCHEDULER=1`. Otherwise, some requests may be truncated before reaching the maximum length or return empty results.
An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
### 2.2 Advanced: How to Achieve Better Performance
#### 2.2.1 Evaluating Application Scenarios and Setting Parameters Correctly
> **Context Length**
- **Parameters** `--max-model-len`
- **Description** Controls the maximum context length that the model can process.
- **Recommendation** Longer context lengths may reduce throughput. Adjust based on actual needs, with a maximum supported context length of **128k** (131,072).
⚠️ Note: Longer context lengths will significantly increase GPU memory requirements. Ensure your hardware resources are sufficient before setting a longer context.
> **Maximum sequence count**
- **Parameters** `--max-num-seqs`
- **Description** Controls the maximum number of sequences the service can handle, supporting a range of 1 to 256.
- **Recommendation** If you are unsure of the average number of sequences per request in your actual application scenario, we recommend setting it to **256**. If the average number of sequences per request in your application is significantly fewer than 256, we suggest setting it to a slightly higher value than the average to further reduce GPU memory usage and optimize service performance.
> **Multi-image and multi-video input**
- **Parameters**`--limit-mm-per-prompt`
- **Description**Our model supports multi-image and multi-video input in a single prompt. Please use this **Parameters** setting to limit the number of images/videos per request, ensuring efficient resource utilization.
- **Recommendation**We recommend setting the number of images and videos in a single prompt to **100 each** to balance performance and memory usage.
> **Available GPU memory ratio during initialization**
- **Parameters** `--gpu-memory-utilization`
- **Description** Controls the available GPU memory for FastDeploy service initialization. The default value is 0.9, meaning 10% of the memory is reserved for backup.
- **Recommendation** It is recommended to use the default value of 0.9. If an "out of memory" error occurs during stress testing, you may attempt to reduce this value.
#### 2.2.2 Chunked Prefill
- **Parameters** `--enable-chunked-prefill`
- **Description** Enabling `chunked prefill` can **reduce peak GPU memory usage** and **improve service throughput**.
- **Other relevant configurations**:
`--max-num-batched-tokens`Limit the maximum number of tokens per chunk, with a recommended setting of 384.
#### 2.2.3 **Quantization precision**
- **Parameters** `--quantization`
- **Supported precision types**
- wint4 (Suitable for most users)
- wint8
- bfloat16 (When the `--quantization` parameter is not set, bfloat16 is used by default.)
- **Recommendation**
- Unless you have extremely stringent precision requirements, we strongly recommend using wint4 quantization. This will significantly reduce memory consumption and increase throughput.
- If slightly higher precision is required, you may try wint8.
- Only consider using bfloat16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.
#### 2.2.4 **Adjustable environment variables**
> **Rejection sampling**`FD_SAMPLING_CLASS=rejection`
- **Description** Rejection sampling involves generating samples from a proposal distribution that is easy to sample from, thereby avoiding explicit sorting and achieving an effect of improving sampling speed, which can enhance inference performance.
- **Recommendation** This is a relatively aggressive optimization strategy that affects the results, and we are still conducting comprehensive validation of its impact. If you have high performance requirements and can accept potential compromises in results, you may consider enabling this strategy.
> **Attention Hyperparameter**`FLAGS_max_partition_size=1024`
- **Description** The hyperparameters for the Append Attention (default) backend have been tested on commonly used datasets, and our results show that setting it to 1024 can significantly improve decoding speed, especially in long-text scenarios.
- **Recommendation** In the future, it will be modified to an automatic adjustment mechanism. If you have high performance requirements, you may consider enabling it.
## 3. FAQ
**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.
### 3.1 Out of Memory
If the service prompts "Out of Memory" during startup, please try the following solutions:
1. Ensure no other processes are occupying GPU memory;
2. Use wint4/wint8 quantization and enable chunked prefill;
3. Reduce context length and maximum sequence count as needed.
If the service starts normally but later reports insufficient memory, try:
1. Adjust the initial GPU memory utilization ratio by modifying `--gpu-memory-utilization`.

View File

@@ -0,0 +1,37 @@
# FAQ
## 1.CUDA out of memory
1. when starting the service
- Check the minimum number of deployment GPUs corresponding to the model and quantification method. If it is not met, increase the number of deployment GPUs.
- If CUDAGraph is enabled, try to reserve more GPU memory for CUDAGraph by lowering `gpu_memory_utilization`, or reduce the GPU memory usage of CUDAGraph by reducing `max_num_seqs` and setting `cudagraph_capture_sizes`
2. during service operation:
- Check whether there is information similar to the following in the log. If so, it is usually caused by insufficient output blocks. You need to reduce `kv-cache-ratio`
```
need_block_len: 1 free_list_len: 0
step max_id: 2 max_num: 133 encoder block len: 24
recover seq_id: 2 free_list_len: 144 used_list_len: 134
need_block_len: 1 free_list_len: 0
step max_id: 2 max_num: 144 encoder_block_len: 24
```
It is recommended to enable the service management global block. You need add environment variables before starting the service.
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
## 2.Poor model performance
1. First, check whether the output length meets expectations and whether it is caused by excessive decoding length. If the output is long, please check whether there is similar information as follows in the log. If so, it is usually caused by insufficient output blocks and you need to reduce `kv-cache-ratio`
```
need_block_len: 1 free_list_len: 0
step max_id: 2 max_num: 133 encoder block len: 24
recover seq_id: 2 free_list_len: 144 used_list_len: 134
need_block_len: 1 free_list_len: 0
step max_id: 2 max_num: 144 encoder_block_len: 24
```
It is also recommended to enable the service management global block. You need add environment variables before starting the service.
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
2. Check whether the KVCache blocks allocated by the automatic profile are as expected. If the automatic profile is affected by the fluctuation of video memory and may result in less allocation, you can manually set the `num_gpu_blocks_override` parameter to expand the KVCache block.

View File

@@ -0,0 +1,7 @@
# Optimal Deployment
- [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md)
- [ERNIE-4.5-21B-A3B-Paddle.md](ERNIE-4.5-21B-A3B-Paddle.md)
- [ERNIE-4.5-300B-A47B-Paddle.md](ERNIE-4.5-300B-A47B-Paddle.md)
- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
- [ERNIE-4.5-VL-424B-A47B-Paddle](ERNIE-4.5-VL-424B-A47B-Paddle.md)

122
docs/features/early_stop.md Normal file
View File

@@ -0,0 +1,122 @@
# Early Stopping
The early stopping is used to prematurely terminate the token generation of the model. Specifically, the early stopping uses different strategies to determine whether the currently generated token sequence meets the early stopping criteria. If so, token generation is terminated prematurely. FastDeploy currently supports the repetition strategy and stop sequence.
## 1. Repetition Strategy
* The repetition strategy determines whether to trigger the early stopping function by checking the number of times a high-probability token is generated.
* Specifically, if the probability of generating a token for a batch exceeds a user-set probability threshold for a specified number of consecutive times, token generation for that batch is terminated prematurely.
### Usage Instructions
When starting the service, add the early stopping function startup option.
* Online inference startup example:
* Using default hyperparameters: --enable-early-stop
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
--enable-early-stop
```
* Using custom hyperparameters: --early-stop-config
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
--early-stop-config '{"enable_early_stop":true, "window_size": 1000, "threshold": 0.9}'
```
* Offline reasoning example
* Use default hyperparameter: enable_early_stop
```python
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM
model_name_or_path = "baidu/ERNIE-4.5-0.3B-Paddle"
sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
llm = LLM(model=model_name_or_path, tensor_parallel_size=1, enable_early_stop=True)
output = llm.generate(prompts="who are you?", use_tqdm=True, sampling_params=sampling_params)
print(output)
```
* Use custom hyperparameters: early_stop_config
```python
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM
model_name_or_path = "baidu/ERNIE-4.5-0.3B-Paddle"
early_stop_config = {"enable_early_stop":True, "window_size":1000, "threshold":0.9}
sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
llm = LLM(model=model_name_or_path, tensor_parallel_size=1, early_stop_config=early_stop_config) output = llm.generate(prompts="who are you?", use_tqdm=True, sampling_params=sampling_params)
print(output)
```
### Parameter Description
* `enable_early_stop`: (bool) Whether to enable the early stopping. Default False.
* `strategy`: (str) The strategy used by the early stopping. Currently, only the repetition strategy is supported. Default "repetition".
* `window_size`: (int) The upper limit of the number of consecutive high-probability tokens in the repetition strategy. If the number exceeds this limit, the early stopping will be triggered. Default 3000.
* `threshold`: (float) The high-probability threshold in the repetition strategy. Default 0.99.
## 2. Stop Sequence
* The Stop Sequence strategy determines whether to trigger early stopping by checking whether the generated token sequence contains a user-specified stop sequence.
* Specifically, if the token sequence generated by a batch contains a user-specified stop sequence, token generation for that batch is terminated prematurely.
### Usage Instructions
Before starting the service, set the following environment variables
```
FD_STOP_SEQS_MAX_LEN (Maximum length of stop sequences, default is 8)
FD_MAX_STOP_SEQS_NUM (Maximum number of stop sequences, default is 5)
```
request with stop parameter it can be str or List[str]
* online serving, set `stop` parameter in request
```
# create a chat request with "stop" parameter
import openai
ip = "0.0.0.0"
service_http_port = "8233"
client = openai.Client(base_url=f"http://{ip}:{service_http_port}/v1", api_key="EMPTY_API_KEY")
response = client.chat.completions.create(
model="default",
messages=[
{"role": "user", "content": '今天天气真好'},
],
temperature=1.0,
top_p=0,
stream=False,
stop=["明天", "出去走走"]
)
```
* offline LLM, set `stop_seqs` parameter in `SamplingParams`
```
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM
model_name_or_path = "ERNIE-4.5-21B-A3B-Paddle"
sampling_params = SamplingParams(temperature=1, top_p=0, stop=["出去走走"])
llm = LLM(model=model_name_or_path, tensor_parallel_size=1)
output = llm.chat(messages=[{"role": "user", "content": "今天天气真好"}], use_tqdm=True, sampling_params=sampling_params)
print(output)
```

View File

@@ -0,0 +1,112 @@
# Graph optimization technology in FastDeploy
FastDeploy's `GraphOptimizationBackend` integrates a variety of graph optimization technologies:
+ **CUDA Graph**A mechanism that starts multiple GPU operations with a single CPU operation reduces overhead and improves performance
+ **StaticGraph to DynamicGraph**Convert dynamic graphs to static graphs, optimize calculation graphs and improve execution efficiency using global graph structure information
+ **CINN Neural Network Compiler**Perform IR conversion, Kernel fusion, Kernel generation and other computational graph compilation optimization methods based on static graphs to achieve comprehensive optimization
Any dynamic situations such as data-dependent control flow, Host-Device synchronization, model input of address/shape changes, dynamic Kernel execution configuration, etc. will cause CUDAGraph Capture/Replay to fail. The scenarios facing LLM inference are dynamic input lengths, dynamic Batch Size, and flexible Attention implementation and multi-device communication, making CUDAGraph difficult to apply.
The mainstream open source solution implements CUDA Graph based on static graphs, with a deep technology stack. FastDeploy not only supports static graphs, neural network compilers, and CUDAGraph combination optimization, but also supports directly applying CUDAGraph in dynamic graphs, which has lower development costs, but the dynamic situations faced are more complex.
FastDeploy's `GraphOptimizationBackend` design architecture is as follows, **some functions are still under development, so it is recommended to read the first chapter carefully using restrictions**.
![](./images/GraphOptBackendArch.svg)
## 1. GraphOptimizationBackend Current usage restrictions
In the CUDAGraph multi-device inference task, you need to use the Custom all-reduce operator to perform multi-card all-reduce.
Before version 2.2, neither the CUDAGraph nor the Custom all-reduce operators were enabled by default. You need to add `--enable-custom-all-reduce` to the startup command to manually enable it.
### 1.1 The multi-device scene needs to be enabled Custom all-reduce
The `FLAGS_max_partition_size` environment variable controls the `gridDim` execution configuration of Kernel in CascadeAppend Attention, and dynamic execution configuration will cause CUDAGraph execution to fail.
[PR#3223](https://github.com/PaddlePaddle/FastDeploy/pull/3223) Fixed this issue, but it still existed in Release versions before 2.2.
**Problem self-checking method:**
+ Calculate `div_up(max_model_len, max_partition_size)` based on the value of `FLAGS_max_partition_size` (default is 32K) and `max_model_len` in the startup parameters. The result is greater than `1` and it can run normally when it is equal to `1`.
**Solution:**
1. Adjust the values of `FLAGS_max_partition_size` and `max_model_len` without triggering dynamic execution of configuration.
2. Close CUDAGraph
## 2. GraphOptimizationBackend related configuration parameters
Currently, only user configuration of the following parameters is supported
+ `use_cudagraph` : bool = False
+ `graph_optimization_config` : Dict[str, Any]
+ `graph_opt_level`: int = 0
+ `use_cudagraph`: bool = False
+ `cudagraph_capture_sizes` : List[int] = None
CudaGrpah can be enabled by setting `--use-cudagraph` or `--graph-optimization-config '{"use_cudagraph":true}'`. Using two different methods to set the use graph simultaneously may cause conflicts.
The `graph_opt_level` parameter within `--graph-optimization-config` is used to configure the graph optimization level, with the following available options:
+ `0`: Use Dynamic compute graph, default to 0
+ `1`: Use Static compute graph, during the initialization phase, Paddle API will be used to convert the dynamic image into a static image
+ `2`: Base on Static compute graph, use the complier(CINN, Compiler Infrastructure for Neural Networks) of Paddle to compile and optimize
In general, static graphs have lower Kernel Launch overhead than dynamic graphs, and it is recommended to use static graphs.
For adapted models, FastDeploy's CudaGraph *can support both dynamic and static graphs* simultaneously.
When CudaGraph is enabled in the default configuration, a list of Batch Sizes that CudaGraph needs to capture will be automatically set based on the 'max_num_deqs' parameter. The logic for generating the list of Batch Sizes that need to be captured is as follows
1. Generate a candidate list with a range of [1,1024] Batch Size.
```
# Batch Size [1, 2, 4, 8, 16, ... 120, 128]
candidate_capture_sizes = [1, 2, 4] + [8 * i for i in range(1, 17)]
# Batch Size (128, 144, ... 240, 256]
candidate_capture_sizes += [16 * i for i in range(9, 17)]
# Batch Size (256, 288, ... 992, 1024]
candidate_capture_sizes += [32 * i for i in range(17, 33)]
```
2. Crop the candidate list based on the user set 'max_num_deqs' to obtain a CudaGraph capture list with a range of [1,' max_num_deqs'].
Users can also customize the batch size list that needs to be captured by CudaGraph through the parameter `cudagraph_capture_sizes` in`--graph-optimization-config`:
```
--graph-optimization-config '{"cudagraph_capture_sizes": [1, 3, 5, 7, 9]}'
```
### 2.1 CudaGraph related parameters
Using CudaGraph incurs some additional memory overhead, divided into two categories in FastDeploy:
+ Additional input Buffer overhead
+ CudaGraph uses dedicated memory pool, thus holding some intermediate activation memory isolated from main framework
FastDeploy initialization sequence first uses `gpu_memory_utilization` parameter to calculate available memory for `KVCache`, after initializing `KVCache` then uses remaining memory to initialize CudaGraph. Since CudaGraph is not enabled by default currently, using default startup parameters may encounter `Out of memory` errors, can try following solutions:
+ Lower `gpu_memory_utilization` value, reserve more memory for CudaGraph.
+ Lower `max_num_seqs` to decrease the maximum concurrency.
+ Customize the batch size list that CudaGraph needs to capture through `graph_optimization_config`, and reduce the number of captured graphs by using `cudagraph_capture_sizes`
+ Before use, must ensure loaded model is properly decorated with ```@support_graph_optimization```.
```python
# 1. import decorator
from fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization
...
# 2. add decorator
@support_graph_optimization
class Ernie4_5_Model(nn.Layer): # Note decorator is added to nn.Layer subclass
...
# 3. modify parameter passing in ModelForCasualLM subclass's self.model()
class Ernie4_5_MoeForCausalLM(ModelForCasualLM):
...
def forward(
self,
ids_remove_padding: paddle.Tensor,
forward_meta: ForwardMeta,
):
hidden_states = self.model(ids_remove_padding=ids_remove_padding, # specify parameter name when passing
forward_meta=forward_meta)
return hidden_statesfrom fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization
...
@support_graph_optimization
class Ernie45TModel(nn.Layer): # Note decorator is added to nn.Layer subclass
...
```

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 30 KiB

View File

@@ -64,6 +64,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
--port 8801 \
--metrics-port 8802 \
--engine-worker-queue-port 8803 \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--scheduler-name global \
--scheduler-ttl 900 \
--scheduler-host "127.0.0.1" \
@@ -71,7 +72,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
--scheduler-db 0 \
--scheduler-password "" \
--scheduler-topic "default" \
--scheduler-min-load_score 3 \
--scheduler-min-load-score 3 \
--scheduler-load-shards-num 1
```

View File

@@ -0,0 +1,71 @@
# Multi-Node Deployment
## Overview
Multi-node deployment addresses scenarios where a single machine's GPU memory is insufficient to support deployment of large models by enabling tensor parallelism across multiple machines.
## Environment Preparation
#### Network Requirements
1. All nodes must be within the same local network
2. Ensure bidirectional connectivity between all nodes (test using `ping` and `nc -zv`)
#### Software Requirements
1. Install the same version of FastDeploy on all nodes
2. [Recommended] Install and configure MPI (OpenMPI or MPICH)
## Tensor Parallel Deployment
### Recommended Launch Method
We recommend using mpirun for one-command startup without manually starting each node.
### Usage Instructions
1. Execute the same command on all machines
2. The IP order in the `ips` parameter determines the node startup sequence
3. The first IP will be designated as the master node
4. Ensure all nodes can resolve each other's hostnames
* Online inference startup example:
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
--tensor-parallel-size 16 \
--ips 192.168.1.101,192.168.1.102
```
* Offline startup example:
```python
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM
model_name_or_path = "baidu/ERNIE-4.5-300B-A47B-Paddle"
sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
llm = LLM(model=model_name_or_path, tensor_parallel_size=16, ips="192.168.1.101,192.168.1.102")
if llm._check_master():
output = llm.generate(prompts="Who are you?", use_tqdm=True, sampling_params=sampling_params)
print(output)
```
* Notes:
- Only the master node can receive completion requests
- Always send requests to the master node (the first IP in the ips list)
- The master node will distribute workloads across all nodes
### Parameter Description
#### `ips` Parameter
- **Type**: `string`
- **Format**: Comma-separated IPv4 addresses
- **Description**: Specifies the IP addresses of all nodes in the deployment group
- **Required**: Only for multi-node deployments
- **Example**: `"192.168.1.101,192.168.1.102,192.168.1.103"`
#### `tensor_parallel_size` Parameter
- **Type**: `integer`
- **Description**: Total number of GPUs across all nodes
- **Required**: Yes
- **Example**: For 2 nodes with 8 GPUs each, set to 16

99
docs/features/plugins.md Normal file
View File

@@ -0,0 +1,99 @@
# FastDeploy Plugin Mechanism Documentation
FastDeploy supports a plugin mechanism that allows users to extend functionality without modifying the core code. Plugins are automatically discovered and loaded through Python's `entry_points` mechanism.
## How Plugins Work
Plugins are essentially registration functions that are automatically called when FastDeploy starts. The system uses the `load_plugins_by_group` function to ensure that all processes (including child processes in distributed training scenarios) have loaded the required plugins before official operations begin.
## Plugin Discovery Mechanism
FastDeploy uses Python's `entry_points` mechanism to discover and load plugins. Developers need to register their plugins in the specified entry point group in their project.
### Example: Creating a Plugin
#### 1. How Plugin Work
Assuming you have a custom model class `MyModelForCasualLM` and a pretrained class `MyPretrainedModel`, you can write the following registration function:
```python
# File: fd_add_dummy_model/__init__.py or fd_add_dummy_model/register.py
from fastdeploy.model_registry import ModelRegistry
from my_custom_model import MyModelForCasualLM, MyPretrainedModel
from fastdeploy.config import ErnieArchitectures
def register():
if "MyModelForCasualLM" not in ModelRegistry.get_supported_archs():
if MyModelForCasualLM.name().startswith("Ernie"):
ErnieArchitectures.register_ernie_model_arch(MyModelForCasualLM)
ModelRegistry.register_model_class(MyModelForCasualLM)
ModelRegistry.register_pretrained_model(MyPretrainedModel)
```
Assuming you have a custom model_runner class `MyModelRunner`, you can write the following registration function:
```python
# File: fd_add_dummy_model_runner/__init__.py
from .my_model_runner import MyModelRunner
def get_runner():
return MyModelRunner
```
#### 2. Register Plugin in `setup.py`
```python
# setup.py
from setuptools import setup
setup(
name="fastdeploy-plugins",
version="0.1",
packages=["fd_add_dummy_model", "fd_add_dummy_model_runner"],
entry_points={
"fastdeploy.model_register_plugins": [
"fd_add_dummy_model = fd_add_dummy_model:register",
],
"fastdeploy.model_runner_plugins": [
"model_runner = fd_add_dummy_model:get_runner"
],
},
)
```
## Plugin Structure
Plugins consist of three components:
| Component | Description |
|-----------|-------------|
| **Plugin Group** | The functional group to which the plugin belongs, for example:<br> - `fastdeploy.model_register_plugins`: for model registration<br> - `fastdeploy.model_runner_plugins`: for model runner registration<br> Users can customize groups as needed. |
| **Plugin Name** | The unique identifier for each plugin (e.g., `fd_add_dummy_model`), which can be controlled via the `FD_PLUGINS` environment variable to determine whether to load the plugin. |
| **Plugin Value** | Format is `module_name:function_name`, pointing to the entry function that executes the registration logic. |
## Controlling Plugin Loading Behavior
By default, FastDeploy loads all registered plugins. To load only specific plugins, you can set the environment variable:
```bash
export FD_PLUGINS=fastdeploy-plugins
```
Multiple plugin names can be separated by commas:
```bash
export FD_PLUGINS=plugin_a,plugin_b
```
## Reference Example
Please refer to the example plugin implementation in the project directory:
```
./test/plugins/
```
It contains a complete plugin structure and `setup.py` configuration example.
## Summary
Through the plugin mechanism, users can easily add custom models or functional modules to FastDeploy without modifying the core source code. This not only enhances system extensibility but also facilitates third-party developers in extending functionality.
For further plugin development, please refer to the `model_registry` and `plugin_loader` modules in the FastDeploy source code.

View File

@@ -8,14 +8,14 @@ Reasoning models return an additional `reasoning_content` field in their output,
| baidu/ERNIE-4.5-VL-424B-A47B-Paddle | ernie-45-vl | ✓ |
| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | ernie-45-vl | ✓ |
The reasoning model requires a specified parser to extract reasoning content. The reasoning mode can be disabled by setting the `enable_thinking=False` parameter.
The reasoning model requires a specified parser to extract reasoning content. The reasoning mode can be disabled by setting the `"enable_thinking": false` parameter.
Interfaces that support toggling the reasoning mode:
1. `/v1/chat/completions` requests in OpenAI services.
2. `/v1/chat/completions` requests in the OpenAI Python client.
3. `llm.chat` requests in Offline interfaces.
For reasoning models, the length of the reasoning content can be controlled via `reasoning_max_tokens`. Add `metadata={"reasoning_max_tokens": 1024}` to the request.
For reasoning models, the length of the reasoning content can be controlled via `reasoning_max_tokens`. Add `"reasoning_max_tokens": 1024` to the request.
### Quick Start
When launching the model service, specify the parser name using the `--reasoning-parser` argument.
@@ -43,7 +43,8 @@ curl -X POST "http://0.0.0.0:8192/v1/chat/completions" \
{"type": "text", "text": "Which era does the cultural relic in the picture belong to"}
]}
],
"metadata": {"enable_thinking": true}
"chat_template_kwargs":{"enable_thinking": true},
"reasoning_max_tokens": 1024
}'
```
@@ -68,7 +69,10 @@ chat_response = client.chat.completions.create(
],
model="vl",
stream=True,
metadata={"enable_thinking": True}
extra_body={
"chat_template_kwargs":{"enable_thinking": True},
"reasoning_max_tokens": 1024
}
)
for chunk in chat_response:
if chunk.choices[0].delta is not None:

225
docs/features/sampling.md Normal file
View File

@@ -0,0 +1,225 @@
# Sampling Strategies
Sampling strategies are used to determine how to select the next token from the output probability distribution of a model. FastDeploy currently supports multiple sampling strategies including Top-p, Top-k_Top-p, and Min-p Sampling.
1. Top-p Sampling
* Top-p sampling truncates the probability cumulative distribution, considering only the most likely token set that reaches a specified threshold p.
* It dynamically selects the number of tokens considered, ensuring diversity in the results while avoiding unlikely tokens.
2. Top-k_Top-p Sampling
* Initially performs top-k sampling, then normalizes within the top-k results, and finally performs top-p sampling.
* By limiting the initial selection range (top-k) and then accumulating probabilities within it (top-p), it improves the quality and coherence of the generated text.
3. Min-p Sampling
* Min-p sampling calculates `pivot=max_prob * min_p`, then retains only tokens with probabilities greater than the `pivot` (setting others to zero) for subsequent sampling.
* It filters out tokens with relatively low probabilities, sampling only from high-probability tokens to improve generation quality.
## Usage Instructions
During deployment, you can choose the sampling algorithm by setting the environment variable `FD_SAMPLING_CLASS`. Available values are `base`, `base_non_truncated`, `air`, or `rejection`.
**Algorithms Supporting Only Top-p Sampling**
* `base` (default): Directly normalizes using the `top_p` value, favoring tokens with greater probabilities.
* `base_non_truncated`: Strictly follows the Top-p sampling logic, first selecting the smallest set that reaches the cumulative probability of `top_p`, then normalizing these selected elements.
* `air`: This algorithm is inspired by [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and supports Top-p sampling.
**Algorithms Supporting Top-p and Top-k_Top-p Sampling**
* `rejection`: This algorithm is inspired by [flashinfer](https://github.com/flashinfer-ai/flashinfer) and allows flexible settings for `top_k` and `top_p` parameters for Top-p or Top-k_Top-p sampling.
## Configuration Method
### Top-p Sampling
1. During deployment, set the environment variable to select the sampling algorithm, default is base:
```bash
export FD_SAMPLING_CLASS=rejection # base, base_non_truncated, or air
```
2. When sending a request, specify the following parameters:
* Example request with curl:
```bash
curl -X POST "http://0.0.0.0:9222/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "How old are you"}
],
"top_p": 0.8
}'
```
* Example request with Python:
```python
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
],
stream=True,
top_p=0.8
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```
### Top-k_Top-p Sampling
1. During deployment, set the environment variable to select the rejection sampling algorithm:
```bash
export FD_SAMPLING_CLASS=rejection
```
2. When sending a request, specify the following parameters:
* Example request with curl:
```bash
curl -X POST "http://0.0.0.0:9222/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "How old are you"}
],
"top_p": 0.8,
"top_k": 20
}'
```
* Example request with Python:
```python
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
],
stream=True,
top_p=0.8,
extra_body={"top_k": 20, "min_p":0.1}
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```
### Min-p Sampling
If you want to use min-p sampling before top-p or top-k_top-p sampling, specify the following parameters when sending a request:
* Example request with curl:
```bash
curl -X POST "http://0.0.0.0:9222/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "How old are you"}
],
"min_p": 0.1,
"top_p": 0.8,
"top_k": 20
}'
```
* Example request with Python:
```python
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
],
stream=True,
top_p=0.8,
extra_body={"top_k": 20, "min_p":0.1}
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```
With the above configurations, you can flexibly choose and use the appropriate sampling strategy according to the needs of specific generation tasks.
## Parameter Description
`top_p`: The probability cumulative distribution truncation threshold, considering only the most likely token set that reaches this threshold. It is a float type, with a range of [0.0, 1.0]. When top_p=1.0, all tokens are considered; when top_p=0.0, it degenerates into greedy search.
`top_k`: The number of tokens with the highest sampling probability, limiting the sampling range to the top k tokens. It is an int type, with a range of [0, vocab_size].
`min_p`: Low probability filtering threshold, considering only the token set with probability greater than or equal to (`max_prob*min_p`). It is a float type, with a range of [0.0, 1.0].
# Bad Words
Used to prevent the model from generating certain specific words during the inference process. Commonly applied in safety control, content filtering, and behavioral constraints of the model.
## Usage Instructions
Include the `bad_words` parameter in the request:
* Example request with curl:
```bash
curl -X POST "http://0.0.0.0:9222/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "How old are you"}
],
"bad_words": ["age", "I"]
}'
```
* Example request with Python:
```python
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
],
extra_body={"bad_words": ["you", "me"]},
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```
## Parameter Description
`bad_words`: List of forbidden words. Type: list of str. Each word must be a single token.

View File

@@ -23,6 +23,7 @@ Execute the following command to start the service. For parameter configurations
>💡 **Note**: Since the model parameter size is 424B-A47B, on an 80G * 8 GPU machine, specify ```--quantization wint4``` (wint8 is also supported).
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
--port 8180 --engine-worker-queue-port 8181 \
@@ -31,7 +32,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 32 \
--enable-mm \
--mm-processor-kwargs '{"video_max_frames": 30}' \
--limit-mm-per-prompt '{"image": 10, "video": 3}' \
--reasoning-parser ernie-45-vl
@@ -113,7 +113,7 @@ curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
{"type": "text", "text": "From which era does the artifact in the image originate?"}
]}
],
"metadata": {"enable_thinking": false}
"chat_template_kwargs":{"enable_thinking": false}
}'
```

View File

@@ -1,6 +1,7 @@
# Deploy ERNIE-4.5-300B-A47B Model
This document explains how to deploy the ERNIE-4.5 model. Before starting the deployment, please ensure that your hardware environment meets the following requirements:
- GPU Driver >= 535
- CUDA >= 12.3
- CUDNN >= 9.5
@@ -20,6 +21,7 @@ Specify `--model baidu/ERNIE-4.5-300B-A47B-Paddle` during deployment to automati
Execute the following command to start the service. For configuration details, refer to the [Parameter Guide](../parameters.md):
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 --engine-worker-queue-port 8181 \

View File

@@ -1,12 +1,12 @@
# Run ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B model on iluvatar machine
The current version of the software merely serves as a demonstration demo for the Iluvatar CoreX combined with the Fastdeploy inference framework for large models. There may be issues when running the latest ERNIE4.5 model, and we will conduct repairs and performance optimization in the future. Subsequent versions will provide customers with a more stable version.
The current version of the software merely serves as a demonstration demo for the Iluvatar CoreX combined with the Fastdeploy inference framework for large models. Running the latest ERNIE4.5 300B model on the GSM8K dataset takes about 6.3 hours.
## Machine Preparation
First, you need to prepare a machine with the following configurations:
First, the `TP=16` when running the ERNIE4.5 300B model and so you need to prepare a machine with the following configurations:
| CPU | Memory | Card | Hard Disk|
| :---: | :---: | :---: | :---: |
| x86 | 1TB| 8xBI150| 1TB|
| x86 | 1TB| 16xBI150| 1TB|
Currently, the entire model needs to be loaded into the host memory, which requires more than 600GB of host memory. This issue will be optimized in subsequent versions.
@@ -32,7 +32,7 @@ docker exec -it paddle_infer bash
```bash
pip3 install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
pip3 install paddle-iluvatar-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/
pip3 install fastdeploy_iluvatar_gpu -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
pip3 install fastdeploy_iluvatar_gpu==2.1.0.dev0 -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
```
## Prepare the inference demo script
@@ -46,6 +46,7 @@ script list below:
export PADDLE_XCCL_BACKEND=iluvatar_gpu
export INFERENCE_MSG_QUEUE_ID=232132
export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
export FD_SAMPLING_CLASS=rejection
export FD_DEBUG=1
python3 run_demo.py
```
@@ -64,7 +65,7 @@ prompts = [
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)
# load the model
llm = LLM(model="/home/paddle/ernie-4_5-21b-a3b-bf16-paddle", tensor_parallel_size=4, max_model_len=8192, static_decode_blocks=0, quantization='wint8')
llm = LLM(model="/home/paddle/ernie-4_5-21b-a3b-bf16-paddle", tensor_parallel_size=4, max_model_len=8192, static_decode_blocks=0, block_size=16, quantization='wint8')
# Perform batch inference
outputs = llm.generate(prompts, sampling_params)
@@ -118,3 +119,281 @@ Now, let's break down each step:
**Step 3: Drawing the
The largest ocean is the Pacific Ocean, covering an area of approximately ⦠[3], The first scientific expeditions to determine the ocean's depth were the Challenger expedition (1872â1876) and the U.S. Navy Hydrographic Office survey (1877â1879). The oceanic crust is thin and irregular, consisting of upward moving magma from the mantle below, and cooling and solidifying on the surface. The shallowest parts of the ocean are called the continental shelves. Large tides are caused mainly by the alignment of the Sun, Moon, and Earth during new or full moons. The origin of the word "ocean" is not clear. The first global oceanic topography survey was completed by the Challenger expedition (1872â1876). [57] The sound speed in the ocean is primarily a function of water temperature and salinity, and varies with depth. The deep-ocean floor is mostly flat and devoid of life, with the exception of seamounts and various underwater volcanic features, including seamounts and hydrothermal vents. [73] Today, the five ocean
```
## Run ernie4.5 300B model with the GSM8K dataset
1. Download GSM8K dataset
```bash
wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
```
2. Prepare `bench_gsm8k.py`
```python
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Fastdeploy + ERNIE-4.5-Turbo 的指标评估 """
# adapted from https://github.com/sgl-project/sglang/blob/main/benchmark/gsm8k/bench_other.py
import argparse
import ast
import json
import re
import time
from concurrent.futures import ThreadPoolExecutor
import numpy as np
import requests
from tqdm import tqdm
INVALID = -9999999
def call_generate(prompt, **kwargs):
"""
Generates response based on the input prompt.
Args:
prompt (str): The input prompt text.
**kwargs: Keyword arguments, including server IP address and port number.
Returns:
str: The response generated based on the prompt.
"""
url = f"http://{kwargs['ip']}:{kwargs['port']}/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"messages": [
{
"role": "user",
"content": prompt,
}
],
"temperature": 0.6,
"max_tokens": 2047,
"top_p": 0.95,
"do_sample": True,
}
response = requests.post(url, headers=headers, data=json.dumps(data))
out = response.json()
return out["choices"][0]["message"]["content"]
def get_one_example(lines, i, include_answer):
"""
Retrieves a question-answer example from the given list of text lines.
Args:
lines (list of dict): A list of question-answer pairs.
i (int): The index of the question-answer pair to retrieve from lines.
include_answer (bool): Whether to include the answer in the returned string.
Returns:
str: A formatted question-answer string in the format "Question: <question>\nAnswer: <answer>".
"""
ret = "Question: " + lines[i]["question"] + "\nAnswer:"
if include_answer:
ret += " " + lines[i]["answer"]
return ret
def get_few_shot_examples(lines, k):
"""
Selects k examples from the given list of text lines and concatenates them into a single string.
Args:
lines (list): A list containing text lines.
k (int): The number of examples to select.
Returns:
str: A string composed of k examples, separated by two newline characters.
"""
ret = ""
for i in range(k):
ret += get_one_example(lines, i, True) + "\n\n"
return ret
def get_answer_value(answer_str):
"""
Extracts numerical values from an answer string and returns them.
Args:
answer_str (str): The string containing the answer.
Returns:
The extracted numerical value; returns "INVALID" if extraction fails.
"""
answer_str = answer_str.replace(",", "")
numbers = re.findall(r"\d+", answer_str)
if len(numbers) < 1:
return INVALID
try:
return ast.literal_eval(numbers[-1])
except SyntaxError:
return INVALID
def read_jsonl(filename: str):
"""
Reads a JSONL file.
Args:
filename (str): Path to the JSONL file.
Yields:
dict: A dictionary object corresponding to each line in the JSONL file.
"""
with open(filename) as fin:
for line in fin:
if line.startswith("#"):
continue
yield json.loads(line)
def main(args):
"""
Process inputs and generate answers by calling the model in parallel using a thread pool.
Args:
args (argparse.Namespace):
- num_questions (int): Number of questions to process.
- num_shots (int): Number of few-shot learning examples.
- ip (str): IP address of the model service.
- port (int): Port number of the model service.
- parallel (int): Number of questions to process in parallel.
- result_file (str): File path to store the results.
Returns:
None
"""
# Read data
filename = "test.jsonl"
lines = list(read_jsonl(filename))
# Construct prompts
num_questions = args.num_questions
num_shots = args.num_shots
few_shot_examples = get_few_shot_examples(lines, num_shots)
questions = []
labels = []
for i in range(len(lines[:num_questions])):
questions.append(get_one_example(lines, i, False))
labels.append(get_answer_value(lines[i]["answer"]))
assert all(l != INVALID for l in labels)
states = [None] * len(labels)
# Use thread pool
def get_one_answer(i):
answer = call_generate(
prompt=few_shot_examples + questions[i],
# stop=["Question", "Assistant:", "<|separator|>"],
ip=args.ip,
port=args.port,
)
states[i] = answer
tic = time.time()
if args.parallel == 1:
for i in tqdm(range(len(questions))):
get_one_answer(i)
else:
with ThreadPoolExecutor(args.parallel) as executor:
list(
tqdm(
executor.map(get_one_answer, list(range(len(questions)))),
total=len(questions),
)
)
latency = time.time() - tic
preds = []
for i in range(len(states)):
preds.append(get_answer_value(states[i]))
# Compute accuracy
acc = np.mean(np.array(preds) == np.array(labels))
invalid = np.mean(np.array(preds) == INVALID)
# Print results
print(f"Accuracy: {acc:.3f}")
print(f"Invalid: {invalid:.3f}")
print(f"Latency: {latency:.3f} s")
with open(args.result_file, "a") as fout:
value = {
"task": "gsm8k",
"backend": "paddlepaddle",
"num_gpus": 1,
"latency": round(latency, 3),
"accuracy": round(acc, 3),
"num_requests": args.num_questions,
"other": {
"num_questions": args.num_questions,
"parallel": args.parallel,
},
}
fout.write(json.dumps(value) + "\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--ip", type=str, default="127.0.0.1")
parser.add_argument("--port", type=str, default="8188")
parser.add_argument("--num-shots", type=int, default=10)
parser.add_argument("--data-path", type=str, default="test.jsonl")
parser.add_argument("--num-questions", type=int, default=1319)
parser.add_argument("--result-file", type=str, default="result.jsonl")
parser.add_argument("--parallel", type=int, default=1)
args = parser.parse_args()
main(args)
```
3. Prepare `run_bench.sh`
```bash
#!/bin/bash
export PADDLE_XCCL_BACKEND=iluvatar_gpu
export INFERENCE_MSG_QUEUE_ID=232132
export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
export FD_SAMPLING_CLASS=rejection
python3 -m fastdeploy.entrypoints.openai.api_server --model "/home/paddle/ernie-45t" --port 8188 --tensor-parallel-size 16 --block-size 16 --static-decode-blocks 0 --quantization wint8
```
4. Running the Script
Firstly, open a terminal and run:
```bash
./run_bench.sh
```
After the service is ready, open another terminal and run:
```bash
python3 -u bench_gsm8k.py --port 8188 --num-questions 1319 --num-shots 5 --parallel 8
```
It takes about 6.3 hours to run the GSM8K dataset.
```
Accuracy: 0.964
Invaild: 0.000
Latency: 22918.186 s
```

View File

@@ -5,8 +5,8 @@
- OS: Linux
- Python: 3.10
- XPU Model: P800
- XPU Driver Version: ≥ 5.0.21.10
- XPU Firmware Version: ≥ 1.31
- XPU Driver Version: ≥ 5.0.21.26
- XPU Firmware Version: ≥ 1.48
Verified platform:
- CPU: INTEL(R) XEON(R) PLATINUM 8563C / Hygon C86-4G 7490 64-core Processor
@@ -15,8 +15,8 @@ Verified platform:
- OS: CentOS release 7.6 (Final)
- Python: 3.10
- XPU Model: P800 (OAM Edition)
- XPU Driver Version: 5.0.21.10
- XPU Firmware Version: 1.31
- XPU Driver Version: 5.0.21.26
- XPU Firmware Version: 1.48
**Note:** Currently, only INTEL or Hygon CPU-based P800 (OAM Edition) servers have been verified. Other CPU types and P800 (PCIe Edition) servers have not been tested yet.
@@ -25,9 +25,9 @@ Verified platform:
```bash
mkdir Work
cd Work
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.1.0
docker run --name fastdeploy-xpu --net=host -itd --privileged -v $PWD:/Work -w /Work \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0 \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.1.0 \
/bin/bash
docker exec -it fastdeploy-xpu /bin/bash
```
@@ -37,7 +37,7 @@ docker exec -it fastdeploy-xpu /bin/bash
### Install PaddlePaddle
```bash
python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
python -m pip install paddlepaddle-xpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
```
Alternatively, you can install the latest version of PaddlePaddle (Not recommended)
@@ -49,7 +49,7 @@ python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/
### Install FastDeploy (**Do NOT install via PyPI source**)
```bash
python -m pip install fastdeploy-xpu==2.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
python -m pip install fastdeploy-xpu==2.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
Alternatively, you can install the latest version of FastDeploy (Not recommended)
@@ -63,7 +63,7 @@ python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/pa
### Install PaddlePaddle
```bash
python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
python -m pip install paddlepaddle-xpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
```
Alternatively, you can install the latest version of PaddlePaddle (Not recommended)

View File

@@ -13,14 +13,14 @@ The following installation methods are available when your environment meets the
**Notice**: The pre-built image only supports SM80/90 GPU(e.g. H800/A800)if you are deploying on SM86/89GPU(L40/4090/L20), please reinstall ```fastdpeloy-gpu``` after you create the container.
```shell
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.0.0
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.1.0
```
## 2. Pre-built Pip Installation
First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
```shell
python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```
Then install fastdeploy. **Do not install from PyPI**. Use the following methods instead:
@@ -58,7 +58,7 @@ docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .
First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
```shell
python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```
Then clone the source code and build:

View File

@@ -16,6 +16,7 @@ For more information about how to install FastDeploy, refer to the [installation
After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8180 \

View File

@@ -19,6 +19,7 @@ For more information about how to install FastDeploy, refer to the [installation
After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
@@ -26,8 +27,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
--reasoning-parser ernie-45-vl \
--enable-mm
--reasoning-parser ernie-45-vl
```
> 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```baidu/ERNIE-4.5-0.3B-Base-Paddle```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md).
@@ -74,7 +74,7 @@ curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
{"type": "text", "text": "What era does this artifact belong to?"}
]}
],
"metadata": {"enable_thinking": false}
"chat_template_kwargs":{"enable_thinking": false}
}'
```
@@ -96,7 +96,7 @@ response = client.chat.completions.create(
{"type": "text", "text": "What era does this artifact belong to?"},
]},
],
metadata={"enable_thinking": false},
extra_body={"enable_thinking": false},
stream=True,
)
for chunk in response:

View File

@@ -13,12 +13,12 @@
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅(WINT4)| WIP |128K |
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|✅(WINT4)| WIP | 128K |
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| WIP |128K |
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|| WIP | 128K |
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K |
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K |
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅|128K |
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅|128K |
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | | ✅|128K |
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | | ✅|128K |
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ❌ | ✅ | ✅ | ❌ | ✅| 128K |
## Documentation

View File

@@ -39,7 +39,7 @@ Documentation for `SamplingParams`, `LLM.generate`, `LLM.chat`, and output struc
```python
from fastdeploy.entrypoints.llm import LLM
# 加载模型
llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
outputs = llm.chat(
messages=[
@@ -127,7 +127,7 @@ for message in messages:
})
sampling_params = SamplingParams(temperature=0.1, max_tokens=6400)
llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
outputs = llm.generate(prompts={
"prompt": prompt,
"multimodal_data": {
@@ -183,6 +183,7 @@ For ```LLM``` configuration, refer to [Parameter Documentation](parameters.md).
* min_p(float): Minimum probability relative to the maximum probability for a token to be considered (>0 filters low-probability tokens to improve quality)
* max_tokens(int): Maximum generated tokens (input + output)
* min_tokens(int): Minimum forced generation length
* bad_words(list[str]): Prohibited words
### 2.5 fastdeploy.engine.request.RequestOutput

View File

@@ -21,9 +21,10 @@ python -m fastdeploy.entrypoints.openai.api_server \
For more usage methods of the command line during service deployment, refer to [Parameter Descriptions](../parameters.md).
## Sending User Requests
## Chat Completion API
FastDeploy provides a Chat Completion API that is compatible with the OpenAI protocol, allowing user requests to be sent directly using OpenAI's request method.
The FastDeploy interface is compatible with the OpenAI protocol, allowing user requests to be sent directly using OpenAI's request method.
### Sending User Requests
Here is an example of sending a user request using the curl command:
@@ -73,53 +74,327 @@ print('\n')
For a description of the OpenAI protocol, refer to the document [OpenAI Chat Completion API](https://platform.openai.com/docs/api-reference/chat/create).
## Parameter Differences
### Request Parameter Differences
The differences in request parameters between FastDeploy and the OpenAI protocol are as follows. Other request parameters will be ignored:
### Compatible OpenAI Parameters
```python
messages: Union[List[Any], List[int]]
# List of input messages, which can be text messages (`List[Any]`, typically `List[dict]`) or token ID lists (`List[int]`).
- `prompt` (supported only in the `v1/completions` interface)
- `messages` (supported only in the `v1/chat/completions` interface)
- `logprobs`: Optional[bool] = False (supported only in the `v1/chat/completions` interface)
- `top_logprobs`: Optional[int] = None (supported only in the `v1/chat/completions` interface. An integer between 0 and 20,logprobs must be set to true if this parameter is used)
- `frequency_penalty`: Optional[float] = 0.0
- `max_tokens`: Optional[int] = 16
- `presence_penalty`: Optional[float] = 0.0
- `stream`: Optional[bool] = False
- `stream_options`: Optional[StreamOptions] = None
- `temperature`: Optional[float] = None
- `top_p`: Optional[float] = None
- `metadata`: Optional[dict] = None (supported only in `v1/chat/completions` for configuring additional parameters, e.g., `metadata={"enable_thinking": True}`)
- `min_tokens`: Optional[int] = 1 (minimum number of tokens generated)
- `reasoning_max_tokens`: Optional[int] = None (maximum number of tokens for reasoning content, defaults to the same as `max_tokens`)
- `enable_thinking`: Optional[bool] = True (whether to enable reasoning for models that support deep thinking)
- `repetition_penalty`: Optional[float] = None (coefficient for directly penalizing repeated token generation (>1 penalizes repetition, <1 encourages repetition))
tools: Optional[List[ChatCompletionToolsParam]] = None
# List of tool call configurations, used for enabling function calling (Function Calling) or tool usage (e.g., ReAct framework).
> Note: For multimodal models, since the reasoning chain is enabled by default, resulting in overly long outputs, `max_tokens` can be set to the model's maximum output length or the default value can be used.
model: Optional[str] = "default"
# Specifies the model name or version to use, defaulting to `"default"` (which may point to the base model).
### Return Field Differences
frequency_penalty: Optional[float] = None
# Frequency penalty coefficient, reducing the probability of generating the same token repeatedly (`>1.0` suppresses repetition, `<1.0` encourages repetition, default `None` disables).
The additional return fields added by FastDeploy are as follows:
logprobs: Optional[bool] = False
# Whether to return the log probabilities of each generated token, used for debugging or analysis.
- `arrival_time`: Returns the cumulative time taken for all tokens
- `reasoning_content`: The returned result of the reasoning chain
top_logprobs: Optional[int] = 0
# Returns the top `top_logprobs` tokens and their log probabilities for each generated position (default `0` means no return).
max_tokens: Optional[int] = Field(
default=None,
deprecated="max_tokens is deprecated in favor of the max_completion_tokens field",
)
# Deprecated: Maximum number of tokens to generate (recommended to use `max_completion_tokens` instead).
max_completion_tokens: Optional[int] = None
# Maximum number of tokens to generate (recommended alternative to `max_tokens`), no default limit (restricted by the model's context window).
presence_penalty: Optional[float] = None
# Presence penalty coefficient, reducing the probability of generating new topics (unseen topics) (`>1.0` suppresses new topics, `<1.0` encourages new topics, default `None` disables).
stream: Optional[bool] = False
# Whether to enable streaming output (return results token by token), default `False` (returns complete results at once).
stream_options: Optional[StreamOptions] = None
# Additional configurations for streaming output (such as chunk size, timeout, etc.), refer to the specific definition of `StreamOptions`.
temperature: Optional[float] = None
# Temperature coefficient, controlling generation randomness (`0.0` for deterministic generation, `>1.0` for more randomness, default `None` uses model default).
top_p: Optional[float] = None
# Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds `top_p` (default `None` disables).
response_format: Optional[AnyResponseFormat] = None
# Specifies the output format (such as JSON, XML, etc.), requires passing a predefined format configuration object.
user: Optional[str] = None
# User identifier, used for tracking or distinguishing requests from different users (default `None` does not pass).
metadata: Optional[dict] = None
# Additional metadata, used for passing custom information (such as request ID, debug markers, etc.).
```
### Additional Parameters Added by FastDeploy
> Note:
When sending requests using curl, the following parameters can be used directly;
When sending requests using openai.Client, these parameters need to be placed in the `extra_body` parameter, e.g. `extra_body={"chat_template_kwargs": {"enable_thinking":True}, "include_stop_str_in_output": True}`.
The following sampling parameters are supported.
```python
top_k: Optional[int] = None
# Limits the consideration to the top K tokens with the highest probability at each generation step, used to control randomness (default None means no limit).
min_p: Optional[float] = None
# Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds min_p (default None means disabled).
min_tokens: Optional[int] = None
# Forces a minimum number of tokens to be generated, avoiding premature truncation (default None means no limit).
include_stop_str_in_output: Optional[bool] = False
# Whether to include the stop string content in the output (default False, meaning output is truncated when a stop string is encountered).
bad_words: Optional[List[str]] = None
# List of forbidden words (e.g., sensitive words) that the model should avoid generating (default None means no restriction).
repetition_penalty: Optional[float] = None
# Repetition penalty coefficient, reducing the probability of repeating already generated tokens (`>1.0` suppresses repetition, `<1.0` encourages repetition, default None means disabled).
```
The following extra parameters are supported:
```python
chat_template_kwargs: Optional[dict] = None
# Additional parameters passed to the chat template, used for customizing dialogue formats (default None).
reasoning_max_tokens: Optional[int] = None
# Maximum number of tokens to generate during reasoning (e.g., CoT, chain of thought) (default None means using global max_tokens).
structural_tag: Optional[str] = None
# Structural tag, used to mark specific structures of generated content (such as JSON, XML, etc., default None).
guided_json: Optional[Union[str, dict, BaseModel]] = None
# Guides the generation of content conforming to JSON structure, can be a JSON string, dictionary, or Pydantic model (default None).
guided_regex: Optional[str] = None
# Guides the generation of content conforming to regular expression rules (default None means no restriction).
guided_choice: Optional[List[str]] = None
# Guides the generation of content selected from a specified candidate list (default None means no restriction).
guided_grammar: Optional[str] = None
# Guides the generation of content conforming to grammar rules (such as BNF) (default None means no restriction).
return_token_ids: Optional[bool] = None
# Whether to return the token IDs of the generation results instead of text (default None means return text).
prompt_token_ids: Optional[List[int]] = None
# Directly passes the token ID list of the prompt, skipping the text encoding step (default None means using text input).
max_streaming_response_tokens: Optional[int] = None
# Maximum number of tokens returned at a time during streaming output (default None means no limit).
disable_chat_template: Optional[bool] = False
# Whether to disable chat template rendering, using raw input directly (default False means template is enabled).
```
### Differences in Return Fields
Additional return fields added by FastDeploy:
- `arrival_time`: Cumulative time consumed for all tokens
- `reasoning_content`: Return results of the chain of thought
- `prompt_token_ids`: List of token IDs for the input sequence
- `completion_token_ids`: List of token IDs for the output sequence
Overview of return parameters:
```python
ChatCompletionResponse:
id: str
object: str = "chat.completion"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[ChatCompletionResponseChoice]
usage: UsageInfo
ChatCompletionResponseChoice:
index: int
message: ChatMessage
logprobs: Optional[LogProbs] = None
finish_reason: Optional[Literal["stop", "length", "tool_calls", "recover_stop"]]
ChatMessage:
role: str
content: str
reasoning_content: Optional[str] = None
prompt_token_ids: Optional[List[int]] = None
completion_token_ids: Optional[List[int]] = None
# Fields returned for streaming responses
ChatCompletionStreamResponse:
id: str
object: str = "chat.completion.chunk"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[ChatCompletionResponseStreamChoice]
ChatCompletionResponseStreamChoice:
usage: Optional[UsageInfo] = None
ChatCompletionResponseStreamChoice:
index: int
delta: DeltaMessage
finish_reason: Optional[Literal["stop", "length"]] = None
logprobs: Optional[LogProbs] = None
finish_reason: Optional[Literal["stop", "length", "tool_calls"]] = None
arrival_time: Optional[float] = None
DeltaMessage:
role: Optional[str] = None
content: Optional[str] = None
token_ids: Optional[List[int]] = None
prompt_token_ids: Optional[List[int]] = None
completion_token_ids: Optional[List[int]] = None
reasoning_content: Optional[str] = None
```
## Completion API
The Completion API interface is mainly used for continuation scenarios, suitable for users who have customized context input and expect the model to only output continuation content; the inference process does not add other `prompt` concatenations.
### Sending User Requests
Here is an example of sending a user request using the curl command:
```bash
curl -X POST "http://0.0.0.0:8188/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"prompt": "以下是一篇关于深圳文心公园的500字游记和赏析"
}'
```
Here is an example of sending a user request using a Python script:
```python
import openai
host = "0.0.0.0"
port = "8170"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.completions.create(
model="default",
prompt="以下是一篇关于深圳文心公园的500字游记和赏析",
stream=False,
)
print(response.choices[0].text)
```
For an explanation of the OpenAI protocol, refer to the [OpenAI Completion API](https://platform.openai.com/docs/api-reference/completions/create)。
### Compatible OpenAI Parameters
```python
model: Optional[str] = "default"
# Specifies the model name or version to use, defaulting to `"default"` (which may point to the base model).
prompt: Union[List[int], List[List[int]], str, List[str]]
# Input prompt, supporting multiple formats:
# - `str`: Plain text prompt (e.g., `"Hello, how are you?"`).
# - `List[str]`: Multiple text segments (e.g., `["User:", "Hello!", "Assistant:", "Hi!"]`).
# - `List[int]`: Directly passes a list of token IDs (e.g., `[123, 456]`).
# - `List[List[int]]`: List of multiple token ID lists (e.g., `[[123], [456, 789]]`).
best_of: Optional[int] = None
# Generates `best_of` candidate results and returns the highest-scoring one (requires `n=1`).
frequency_penalty: Optional[float] = None
# Frequency penalty coefficient, reducing the probability of generating the same token repeatedly (`>1.0` suppresses repetition, `<1.0` encourages repetition).
logprobs: Optional[int] = None
# Returns the log probabilities of each generated token, can specify the number of candidates to return.
max_tokens: Optional[int] = None
# Maximum number of tokens to generate (including input and output), no default limit (restricted by the model's context window).
presence_penalty: Optional[float] = None
# Presence penalty coefficient, reducing the probability of generating new topics (unseen topics) (`>1.0` suppresses new topics, `<1.0` encourages new topics).
```
### Additional Parameters Added by FastDeploy
> Note:
When sending requests using curl, the following parameters can be used directly;
When sending requests using openai.Client, these parameters need to be placed in the `extra_body` parameter, e.g. `extra_body={"chat_template_kwargs": {"enable_thinking":True}, "include_stop_str_in_output": True}`.
The following sampling parameters are supported.
```python
top_k: Optional[int] = None
# Limits the consideration to the top K tokens with the highest probability at each generation step, used to control randomness (default None means no limit).
min_p: Optional[float] = None
# Nucleus sampling threshold, only retaining tokens whose cumulative probability exceeds min_p (default None means disabled).
min_tokens: Optional[int] = None
# Forces a minimum number of tokens to be generated, avoiding premature truncation (default None means no limit).
include_stop_str_in_output: Optional[bool] = False
# Whether to include the stop string content in the output (default False, meaning output is truncated when a stop string is encountered).
bad_words: Optional[List[str]] = None
# List of forbidden words (e.g., sensitive words) that the model should avoid generating (default None means no restriction).
repetition_penalty: Optional[float] = None
# Repetition penalty coefficient, reducing the probability of repeating already generated tokens (`>1.0` suppresses repetition, `<1.0` encourages repetition, default None means disabled).
```
The following extra parameters are supported:
```python
guided_json: Optional[Union[str, dict, BaseModel]] = None
# Guides the generation of content conforming to JSON structure, can be a JSON string, dictionary, or Pydantic model (default None).
guided_regex: Optional[str] = None
# Guides the generation of content conforming to regular expression rules (default None means no restriction).
guided_choice: Optional[List[str]] = None
# Guides the generation of content selected from a specified candidate list (default None means no restriction).
guided_grammar: Optional[str] = None
# Guides the generation of content conforming to grammar rules (such as BNF) (default None means no restriction).
return_token_ids: Optional[bool] = None
# Whether to return the token IDs of the generation results instead of text (default None means return text).
prompt_token_ids: Optional[List[int]] = None
# Directly passes the token ID list of the prompt, skipping the text encoding step (default None means using text input).
max_streaming_response_tokens: Optional[int] = None
# Maximum number of tokens returned at a time during streaming output (default None means no limit).
```
### Overview of Return Parameters
```python
CompletionResponse:
id: str
object: str = "text_completion"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[CompletionResponseChoice]
usage: UsageInfo
CompletionResponseChoice:
index: int
text: str
prompt_token_ids: Optional[List[int]] = None
completion_token_ids: Optional[List[int]] = None
arrival_time: Optional[float] = None
logprobs: Optional[int] = None
reasoning_content: Optional[str] = None
finish_reason: Optional[Literal["stop", "length", "tool_calls"]]
# Fields returned for streaming responses
CompletionStreamResponse
id: str
object: str = "text_completion"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[CompletionResponseStreamChoice]
usage: Optional[UsageInfo] = None
CompletionResponseStreamChoice:
index: int
text: str
arrival_time: float = None
prompt_token_ids: Optional[List[int]] = None
completion_token_ids: Optional[List[int]] = None
logprobs: Optional[float] = None
reasoning_content: Optional[str] = None
finish_reason: Optional[Literal["stop", "length", "tool_calls"]] = None
```

View File

@@ -8,6 +8,8 @@ When using FastDeploy to deploy models (including offline inference and service
|:--------------|:----|:-----------|
| ```port``` | `int` | Only required for service deployment, HTTP service port number, default: 8000 |
| ```metrics_port``` | `int` | Only required for service deployment, metrics monitoring port number, default: 8001 |
| ```max_waiting_time``` | `int` | Only required for service deployment, maximum wait time for establishing a connection upon service request. Default: -1 (indicates no wait time limit).|
| ```max_concurrency``` | `int` | Only required for service deployment, the actual number of connections established by the service, default 512 |
| ```engine_worker_queue_port``` | `int` | FastDeploy internal engine communication port, default: 8002 |
| ```cache_queue_port``` | `int` | FastDeploy internal KVCache process communication port, default: 8003 |
| ```max_model_len``` | `int` | Default maximum supported context length for inference, default: 2048 |
@@ -19,7 +21,7 @@ When using FastDeploy to deploy models (including offline inference and service
| ```tokenizer``` | `str` | Tokenizer name or path, defaults to model path |
| ```use_warmup``` | `int` | Whether to perform warmup at startup, will automatically generate maximum length data for warmup, enabled by default when automatically calculating KV Cache |
| ```limit_mm_per_prompt``` | `dict[str]` | Limit the amount of multimodal data per prompt, e.g.: {"image": 10, "video": 3}, default: 1 for all |
| ```enable_mm``` | `bool` | Whether to support multimodal data (for multimodal models only), default: False |
| ```enable_mm``` | `bool` | __[DEPRECATED]__ Whether to support multimodal data (for multimodal models only), default: False |
| ```quantization``` | `str` | Model quantization strategy, when loading BF16 CKPT, specifying wint4 or wint8 supports lossless online 4bit/8bit quantization |
| ```gpu_memory_utilization``` | `float` | GPU memory utilization, default: 0.9 |
| ```num_gpu_blocks_override``` | `int` | Preallocated KVCache blocks, this parameter can be automatically calculated by FastDeploy based on memory situation, no need for user configuration, default: None |
@@ -33,8 +35,8 @@ When using FastDeploy to deploy models (including offline inference and service
| ```long_prefill_token_threshold``` | `int` | When Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04 |
| ```static_decode_blocks``` | `int` | During inference, each request is forced to allocate corresponding number of blocks from Prefill's KVCache for Decode use, default: 2 |
| ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output |
| ```use_cudagraph``` | `bool` | Whether to use cuda graph, default: False |
```graph_optimization_config``` | `str` | Parameters related to graph optimization can be configured, with default values of'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }' |
| ```use_cudagraph``` | `bool` | Whether to use cuda graph, default False. It is recommended to read [graph_optimization.md](./features/graph_optimization.md) carefully before opening. Custom all-reduce needs to be enabled at the same time in multi-card scenarios. |
| ```graph_optimization_config``` | `dict[str]` | Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }'Detailed description reference [graph_optimization.md](./features/graph_optimization.md)|
| ```enable_custom_all_reduce``` | `bool` | Enable Custom all-reduce, default: False |
| ```splitwise_role``` | `str` | Whether to enable splitwise inference, default value: mixed, supported parameters: ["mixed", "decode", "prefill"] |
| ```innode_prefill_ports``` | `str` | Internal engine startup ports for prefill instances (only required for single-machine PD separation), default: None |
@@ -44,6 +46,8 @@ When using FastDeploy to deploy models (including offline inference and service
| ```dynamic_load_weight``` | `int` | Whether to enable dynamic weight loading, default: 0 |
| ```enable_expert_parallel``` | `bool` | Whether to enable expert parallel |
| ```enable_logprob``` | `bool` | Whether to enable return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message.If logrpob is not used, this parameter can be omitted when starting |
| ```tool_call_parser``` | `str` | Specify the function call parser to be used for extracting function call content from the model's output. |
| ```tool_parser_plugin``` | `str` | Specify the file path of the tool parser to be registered, so as to register parsers that are not in the code repository. The code format within these parsers must adhere to the format used in the code repository. |
## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```?
@@ -68,86 +72,3 @@ In actual inference, it's difficult for users to know how to properly configure
When `enable_chunked_prefill` is enabled, the service processes long input sequences through dynamic chunking, significantly improving GPU resource utilization. In this mode, the original `max_num_batched_tokens` parameter no longer constrains the batch token count in prefill phase (limiting single prefill token count), thus introducing `max_num_partial_prefills` parameter specifically to limit concurrently processed partial batches.
To optimize scheduling priority for short requests, new `max_long_partial_prefills` and `long_prefill_token_threshold` parameter combination is added. The former limits the number of long requests in single prefill batch, the latter defines the token threshold for long requests. The system will prioritize batch space for short requests, thereby reducing short request latency in mixed workload scenarios while maintaining stable throughput.
## 4. GraphOptimizationBackend related configuration parameters
Currently, only user configuration of the following parameters is supported
- `use_cudagraph` : bool = False
- `graph_optimization_config` : Dict[str, Any]
- `graph_opt_level`: int = 0
- `use_cudagraph`: bool = False
- `cudagraph_capture_sizes` : List[int] = None
CudaGrpah can be enabled by setting `--use-cudagraph` or `--graph-optimization-config '{"use_cudagraph":true}'`. Using two different methods to set the use graph simultaneously may cause conflicts.
The `graph_opt_level` parameter within `--graph-optimization-config` is used to configure the graph optimization level, with the following available options:
- `0`: Use Dynamic compute graph, default to 0
- `1`: Use Static compute graph, during the initialization phase, Paddle API will be used to convert the dynamic image into a static image
- `2`: Base on Static compute graph, use the complier(CINN, Compiler Infrastructure for Neural Networks) of Paddle to compile and optimize
In general, static graphs have lower Kernel Launch overhead than dynamic graphs, and it is recommended to use static graphs.
For adapted models, FastDeploy's CudaGraph *can support both dynamic and static graphs* simultaneously.
When CudaGraph is enabled in the default configuration, a list of Batch Sizes that CudaGraph needs to capture will be automatically set based on the 'max_num_deqs' parameter. The logic for generating the list of Batch Sizes that need to be captured is as follows
1. Generate a candidate list with a range of [1,1024] Batch Size.
```
# Batch Size [1, 2, 4, 8, 16, ... 120, 128]
candidate_capture_sizes = [1, 2, 4] + [8 * i for i in range(1, 17)]
# Batch Size (128, 144, ... 240, 256]
candidate_capture_sizes += [16 * i for i in range(9, 17)]
# Batch Size (256, 288, ... 992, 1024]
candidate_capture_sizes += [32 * i for i in range(17, 33)]
```
2. Crop the candidate list based on the user set 'max_num_deqs' to obtain a CudaGraph capture list with a range of [1,' max_num_deqs'].
Users can also customize the batch size list that needs to be captured by CudaGraph through the parameter `cudagraph_capture_sizes` in`--graph-optimization-config`:
```
--graph-optimization-config '{"cudagraph_capture_sizes": [1, 3, 5, 7, 9]}'
```
### CudaGraph related parameters
Using CudaGraph incurs some additional memory overhead, divided into two categories in FastDeploy:
- Additional input Buffer overhead
- CudaGraph uses dedicated memory pool, thus holding some intermediate activation memory isolated from main framework
FastDeploy initialization sequence first uses `gpu_memory_utilization` parameter to calculate available memory for `KVCache`, after initializing `KVCache` then uses remaining memory to initialize CudaGraph. Since CudaGraph is not enabled by default currently, using default startup parameters may encounter `Out of memory` errors, can try following solutions:
- Lower `gpu_memory_utilization` value, reserve more memory for CudaGraph.
- Lower `max_num_seqs` to decrease the maximum concurrency.
- Customize the batch size list that CudaGraph needs to capture through `graph_optimization_config`, and reduce the number of captured graphs by using `cudagraph_capture_sizes`
- Before use, must ensure loaded model is properly decorated with ```@support_graph_optimization```.
```python
# 1. import decorator
from fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization
...
# 2. add decorator
@support_graph_optimization
class Ernie4_5_Model(nn.Layer): # Note decorator is added to nn.Layer subclass
...
# 3. modify parameter passing in ModelForCasualLM subclass's self.model()
class Ernie4_5_MoeForCausalLM(ModelForCasualLM):
...
def forward(
self,
ids_remove_padding: paddle.Tensor,
forward_meta: ForwardMeta,
):
hidden_states = self.model(ids_remove_padding=ids_remove_padding, # specify parameter name when passing
forward_meta=forward_meta)
return hidden_statesfrom fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization
...
@support_graph_optimization
class Ernie45TModel(nn.Layer): # Note decorator is added to nn.Layer subclass
...
```
- When ```use_cudagraph``` is enabled, currently only supports single-GPU inference, i.e. ```tensor_parallel_size``` set to 1.
- When ```use_cudagraph``` is enabled, cannot enable ```enable_prefix_caching``` or ```enable_chunked_prefill```.

View File

@@ -38,7 +38,7 @@ environment_variables: dict[str, Callable[[], Any]] = {
# Whether to use HuggingFace tokenizer (0 or 1)
"FD_USE_HF_TOKENIZER":
lambda: os.getenv("FD_USE_HF_TOKENIZER", 0),
lambda: bool(int(os.getenv("FD_USE_HF_TOKENIZER", 0))),
# ZMQ send high-water mark (HWM) during initialization
"FD_ZMQ_SNDHWM":

View File

@@ -2,17 +2,17 @@
|Model Name|Context Length|Quantization|XPUs Required|Deployment Commands|Minimum Version Required|
|-|-|-|-|-|-|
|ERNIE-4.5-300B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
|ERNIE-4.5-300B-A47B|32K|WINT4|4 (recommend)|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.0|
|ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.0|
|ERNIE-4.5-300B-A47B|128K|WINT4|8 (recommend)|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.0|
|ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
|ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
|ERNIE-4.5-21B-A3B|32K|WINT4|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
|ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
|ERNIE-4.5-21B-A3B|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
|ERNIE-4.5-21B-A3B|128K|WINT4|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
|ERNIE-4.5-300B-A47B|32K|WINT4|4 (Recommended)|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.0|
|ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.95|>=2.0.0|
|ERNIE-4.5-300B-A47B|128K|WINT4|8 (Recommended)|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.0|
|ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
|ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
|ERNIE-4.5-21B-A3B|32K|WINT4|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
|ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
|ERNIE-4.5-21B-A3B|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
|ERNIE-4.5-21B-A3B|128K|WINT4|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
|ERNIE-4.5-0.3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
|ERNIE-4.5-0.3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="x" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
|ERNIE-4.5-0.3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
|ERNIE-4.5-0.3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
|ERNIE-4.5-0.3B|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
@@ -89,4 +89,4 @@ for chunk in response:
print('\n')
```
For detailed OpenAI protocol specifications, see [OpenAI Chat Compeltion API](https://platform.openai.com/docs/api-reference/chat/create). Differences from the standard OpenAI protocol are documented in [OpenAI Protocol-Compatible API Server](../../online_serving/README.md).
For detailed OpenAI protocol specifications, see [OpenAI Chat Compeltion API](https://platform.openai.com/docs/api-reference/chat/create). Differences from the standard OpenAI protocol are documented in [OpenAI Protocol-Compatible API Server](../online_serving/README.md).

View File

@@ -0,0 +1,94 @@
# ERNIE-4.5-0.3B
## 一、环境准备
### 1.1 支持情况
ERNIE-4.5-0.3B 各量化精度,在下列硬件上部署所需要的最小卡数如下:
| | WINT8 | WINT4 | FP8 |
|-----|-----|-----|-----|
|H800 80GB| 1 | 1 | 1 |
|A800 80GB| 1 | 1 | / |
|H20 96GB| 1 | 1 | 1 |
|L20 48GB| 1 | 1 | 1 |
|A30 40GB| 1 | 1 | / |
|A10 24GB| 1 | 1 | / |
**注:**
1. 在启动命令后指定`--tensor-parallel-size 1` 即可修改部署卡数
2. 表格中未列出的硬件,可根据显存大小进行预估是否可以部署
### 1.2 安装fastdeploy
- 安装请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。
- 模型下载,请参考[支持模型列表](../supported_models.md)。**请注意使用Fastdeploy部署需要Paddle后缀的模型**
## 二、如何使用
### 2.1 基础:启动服务
通过下列命令启动服务
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--tensor-parallel-size 1 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 128
```
其中:
- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。
- `--max-model-len`表示当前部署的服务所支持的最长Token数量。设置得越大模型可支持的上下文长度也越大但相应占用的显存也越多可能影响并发数。
更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。
### 2.2 进阶:如何获取更优性能
#### 2.2.1 评估应用场景,正确设置参数
结合应用场景评估平均输入长度、平均输出长度、最大上下文长度。例如平均输入长度为1000输出长度为30000那么建议设置为 32768
- 根据最大上下文长度,设置`max-model-len`
- **启用服务管理全局 Block**
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
#### 2.2.2 Prefix Caching
**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果KV Cache避免重复计算从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md)
**启用方式:**
在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。
```
--enable-prefix-caching
--swap-space 50
```
#### 2.2.3 Chunked Prefill
**原理:** 采用分块策略将预填充Prefill阶段请求拆解为小规模子任务与解码Decode请求混合批处理执行。可以更好地平衡计算密集型Prefill和访存密集型Decode操作优化GPU资源利用率减少单次Prefill的计算量和显存占用从而降低显存峰值避免显存不足的问题。 具体请参考[Chunked Prefill](../features/chunked_prefill.md)
**启用方式:** 在启动参数下增加即可
```
--enable-chunked-prefill
```
#### 2.2.4 CUDAGraph
**原理:**
CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操作序列捕获capture为图结构graph实现 GPU 任务的高效执行和优化。CUDAGraph 的核心思想是将一系列 GPU 计算和内存操作封装为一个可重复执行的图,从而减少 CPU-GPU 通信开销、降低内核启动延迟,并提升整体计算性能。
**启用方式:**
在启动命令中增加
```
--use-cudagraph
```
注:
1. 通常情况下不需要额外设置其他参数但CUDAGraph会产生一些额外的显存开销在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
2. 开启CUDAGraph时如果是TP>1的多卡推理场景需要同时指定 `--enable-custom-all-reduce`
3. 开启CUDAGraph时暂时不支持`max-model-len > 32768`的场景。
#### 2.2.5 拒绝采样
**原理:**
拒绝采样即从一个易于采样的提议分布proposal distribution中生成样本避免显式排序从而达到提升采样速度的效果对小尺寸的模型有较明显的提升。
**启用方式:**
启动前增加下列环境变量
```
export FD_SAMPLING_CLASS=rejection
```
## 三、常见问题FAQ
如果您在使用过程中遇到问题,可以在[FAQ](./FAQ.md)中查阅。

View File

@@ -0,0 +1,150 @@
# ERNIE-4.5-21B-A3B
## 一、环境准备
### 1.1 支持情况
ERNIE-4.5-21B-A3B 各量化精度,在下列硬件上部署所需要的最小卡数如下:
| | WINT8 | WINT4 | FP8 |
|-----|-----|-----|-----|
|H800 80GB| 1 | 1 | 1 |
|A800 80GB| 1 | 1 | / |
|H20 96GB| 1 | 1 | 1 |
|L20 48GB| 1 | 1 | 1 |
|A30 40GB| 2 | 1 | / |
|A10 24GB| 2 | 1 | / |
**注:**
1. 在启动命令后指定`--tensor-parallel-size 2` 即可修改部署卡数
2. 表格中未列出的硬件,可根据显存大小进行预估是否可以部署
### 1.2 安装fastdeploy
- 安装,请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。
- 模型下载,请参考[支持模型列表](../supported_models.md)。**请注意使用Fastdeploy部署需要Paddle后缀的模型**
## 二、如何使用
### 2.1 基础:启动服务
通过下列命令启动服务
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-21B-A3B-Paddle \
--tensor-parallel-size 1 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 128
```
其中:
- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。
- `--max-model-len`表示当前部署的服务所支持的最长Token数量。设置得越大模型可支持的上下文长度也越大但相应占用的显存也越多可能影响并发数。
更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。
### 2.2 进阶:如何获取更优性能
#### 2.2.1 评估应用场景,正确设置参数
结合应用场景评估平均输入长度、平均输出长度、最大上下文长度。例如平均输入长度为1000输出长度为30000那么建议设置为 32768
- 根据最大上下文长度,设置`max-model-len`
- **启用服务管理全局 Block**
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
#### 2.2.2 Prefix Caching
**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果KV Cache避免重复计算从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md)
**启用方式:**
在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。
```
--enable-prefix-caching
--swap-space 50
```
#### 2.2.3 Chunked Prefill
**原理:** 采用分块策略将预填充Prefill阶段请求拆解为小规模子任务与解码Decode请求混合批处理执行。可以更好地平衡计算密集型Prefill和访存密集型Decode操作优化GPU资源利用率减少单次Prefill的计算量和显存占用从而降低显存峰值避免显存不足的问题。 具体请参考[Chunked Prefill](../features/chunked_prefill.md)
**启用方式:** 在启动参数下增加即可
```
--enable-chunked-prefill
```
#### 2.2.4 MTP (Multi-Token Prediction)
**原理:**
通过一次性预测多个Token减少解码步数以显著加快生成速度同时通过一定策略保持生成质量。具体请参考[投机解码](../features/speculative_decoding.md)。
**启用方式:**
在启动参数下增加即可
```
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
```
#### 2.2.5 CUDAGraph
**原理:**
CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操作序列捕获capture为图结构graph实现 GPU 任务的高效执行和优化。CUDAGraph 的核心思想是将一系列 GPU 计算和内存操作封装为一个可重复执行的图,从而减少 CPU-GPU 通信开销、降低内核启动延迟,并提升整体计算性能。
**启用方式:**
在启动命令中增加
```
--use-cudagraph
```
注:
1. 通常情况下不需要额外设置其他参数但CUDAGraph会产生一些额外的显存开销在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
2. 开启CUDAGraph时如果是TP>1的多卡推理场景需要同时指定 `--enable-custom-all-reduce`
3. 开启CUDAGraph时暂时不支持`max-model-len > 32768`的场景。
#### 2.2.6 拒绝采样
**原理:**
拒绝采样即从一个易于采样的提议分布proposal distribution中生成样本避免显式排序从而达到提升采样速度的效果对小尺寸的模型有较明显的提升。
**启用方式:**
启动前增加下列环境变量
```
export FD_SAMPLING_CLASS=rejection
```
#### 2.2.7 分离式部署
**原理:** 分离式部署的核心思想是将Prefill 和 Decode 分开部署,在一定场景下可以提高硬件利用率,有效提高吞吐,降低整句时延。具体请参考分离式部署
**启用方式:** 以单机8GPU1P1D各4GPU部署为例与默认的混合式部署方式相比 需要`--splitwise-role`指定节点的角色。并通过环境变量`FD_LOG_DIR``CUDA_VISIBLE_DEVICES`将两个节点的GPU 和日志隔离开
```
# prefill
export CUDA_VISIBLE_DEVICES=0,1,2,3
export INFERENCE_MSG_QUEUE_ID=1315
export FLAGS_max_partition_size=2048
export FD_ATTENTION_BACKEND=FLASH_ATTN
export FD_LOG_DIR="prefill_log"
quant_type=block_wise_fp8
export FD_USE_DEEP_GEMM=0
python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A3B-Paddle \
--max-model-len 131072 \
--max-num-seqs 20 \
--num-gpu-blocks-override 40000 \
--quantization ${quant_type} \
--gpu-memory-utilization 0.9 --kv-cache-ratio 0.9 \
--port 7012 --engine-worker-queue-port 7013 --metrics-port 7014 --tensor-parallel-size 4 \
--cache-queue-port 7015 \
--splitwise-role "prefill" \
```
```
# decode
export CUDA_VISIBLE_DEVICES=4,5,6,7
export INFERENCE_MSG_QUEUE_ID=1215
export FLAGS_max_partition_size=2048
export FD_LOG_DIR="decode_log"
quant_type=block_wise_fp8
export FD_USE_DEEP_GEMM=0
python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A3B-Paddle \
--max-model-len 131072 \
--max-num-seqs 20 \
--quantization ${quant_type} \
--gpu-memory-utilization 0.85 --kv-cache-ratio 0.1 \
--port 9012 --engine-worker-queue-port 8013 --metrics-port 8014 --tensor-parallel-size 4 \
--cache-queue-port 8015 \
--innode-prefill-ports 7013 \
--splitwise-role "decode"
```
## 三、常见问题FAQ
如果您在使用过程中遇到问题,可以在[FAQ](./FAQ.md)中查阅。

View File

@@ -0,0 +1,144 @@
# ERNIE-4.5-300B-A47B
## 一、环境准备
### 1.1 支持情况
ERNIE-4.5-300B-A47B各量化精度在下列硬件上部署所需要的最小卡数如下
| | WINT8 | WINT4 | FP8 | WINT2 | W4A8 |
|-----|-----|-----|-----|-----|-----|
|H800 80GB| 8 | 4 | 8 | 2 | 4 |
|A800 80GB| 8 | 4 | / | 2 | 4 |
**注:**
1. 在启动命令后指定`--tensor-parallel-size 4`即可修改部署卡数
2. 由于仅提供4卡量化scaleW4A8模型需部署在4卡
3. 表格中未列出的硬件,可根据显存大小进行预估是否可以部署
### 1.2 安装fastdeploy
- 安装,请参考[Fastdeploy Installation](../get_started/installation/README.md)完成安装。
- 模型下载,请参考[支持模型列表](../supported_models.md)。**请注意使用Fastdeploy部署需要Paddle后缀的模型**
## 二、如何使用
### 2.1 基础:启动服务
通过下列命令启动服务
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--tensor-parallel-size 8 \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 128
```
其中:
- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。
- `--max-model-len`表示当前部署的服务所支持的最长Token数量。设置得越大模型可支持的上下文长度也越大但相应占用的显存也越多可能影响并发数。
更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。
### 2.2 进阶:如何获取更优性能
#### 2.2.1 评估应用场景,正确设置参数
结合应用场景,评估平均输入长度、平均输出长度、最大上下文长度
- 根据最大上下文长度,设置`max-model-len`。例如平均输入长度为1000输出长度为30000那么建议设置为 32768
- **启用服务管理全局 Block**
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
#### 2.2.2 Prefix Caching
**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果KV Cache避免重复计算从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md)
**启用方式:**
在启动参数下增加下列两行,其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上额外开启CPU缓存大小为GB应根据机器实际情况调整。
```
--enable-prefix-caching
--swap-space 50
```
#### 2.2.3 Chunked Prefill
**原理:** 采用分块策略将预填充Prefill阶段请求拆解为小规模子任务与解码Decode请求混合批处理执行。可以更好地平衡计算密集型Prefill和访存密集型Decode操作优化GPU资源利用率减少单次Prefill的计算量和显存占用从而降低显存峰值避免显存不足的问题。 具体请参考[Chunked Prefill](../features/chunked_prefill.md)
**启用方式:** 在启动参数下增加即可
```
--enable-chunked-prefill
```
#### 2.2.4 MTP (Multi-Token Prediction)
**原理:**
通过一次性预测多个Token减少解码步数以显著加快生成速度同时通过一定策略保持生成质量。具体请参考[投机解码](../features/speculative_decoding.md)。
**启用方式:**
在启动参数下增加即可
```
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
```
#### 2.2.5 W4A8C8量化
**原理:**
量化可以实现模型的压缩减少显存占用并加快推理计算速度。对模型MOE部分权重使用per-channel对称4比特量化激活使用静态per-tensor对称8比特量化KVCache使用静态per-channel对称8比特量化。以实现更优的推理效果。
**启用方式:**
需要在启动命令中指定对应的模型名称,`baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle`
```
--model baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle
```
#### 2.2.6 拒绝采样
**原理:**
拒绝采样即从一个易于采样的提议分布proposal distribution中生成样本避免显式排序从而达到提升采样速度的效果对小尺寸的模型有较明显的提升。
**启用方式:**
启动前增加下列环境变量
```
export FD_SAMPLING_CLASS=rejection
```
#### 2.2.7 分离式部署
**原理:** 分离式部署的核心思想是将Prefill 和 Decode 分开部署,在一定场景下可以提高硬件利用率,有效提高吞吐,降低整句时延。具体请参考分离式部署
**启用方式:** 以单机8GPU1P1D各4GPU部署为例与默认的混合式部署方式相比 需要`--splitwise-role`指定节点的角色。并通过环境变量`FD_LOG_DIR``CUDA_VISIBLE_DEVICES`将两个节点的GPU 和日志隔离开
```
export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 --metrics-port 8181 \
--engine-worker-queue-port 8182 \
--cache-queue-port 8183 \
--tensor-parallel-size 4 \
--quantization wint4 \
--splitwise-role "prefill"
```
```
export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=4,5,6,7
# 注意innode-prefill-ports指定为Prefill服务的engine-worker-queue-port
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle\
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port 8186 \
--cache-queue-port 8187 \
--tensor-parallel-size 4 \
--quantization wint4 \
--innode-prefill-ports 8182 \
--splitwise-role "decode"
```
#### 2.2.8 CUDAGraph
**原理:**
CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操作序列捕获capture为图结构graph实现 GPU 任务的高效执行和优化。CUDAGraph 的核心思想是将一系列 GPU 计算和内存操作封装为一个可重复执行的图,从而减少 CPU-GPU 通信开销、降低内核启动延迟,并提升整体计算性能。
**启用方式:**
在启动命令中增加
```
--use-cudagraph
--enable-custom-all-reduce
```
注:
1. 通常情况下不需要额外设置其他参数但CUDAGraph会产生一些额外的显存开销在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
2. 开启CUDAGraph时如果是TP>1的多卡推理场景需要同时指定 `--enable-custom-all-reduce`
3. 开启CUDAGraph时暂时不支持`max-model-len > 32768`的场景。
## 三、常见问题FAQ
如果您在使用过程中遇到问题,可以在[FAQ](./FAQ.md)中查阅。

View File

@@ -0,0 +1,134 @@
# ERNIE-4.5-VL-28B-A3B-Paddle
## 一、环境准备
### 1.1 支持情况
在下列硬件上部署所需要的最小卡数如下:
| 设备[显存] | WINT4 | WINT8 | BFLOAT16 |
|:----------:|:----------:|:------:| :------:|
| A30 [24G] | 2 | 2 | 4 |
| L20 [48G] | 1 | 1 | 2 |
| H20 [144G] | 1 | 1 | 1 |
| A100 [80G] | 1 | 1 | 1 |
| H800 [80G] | 1 | 1 | 1 |
### 1.2 安装fastdeploy
安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)
> ⚠️ 注意事项
> - FastDeploy只支持Paddle格式的模型注意下载Paddle后缀的模型
> - 使用模型名称会自动下载模型,如果已经下载过模型,可以直接使用模型下载位置的绝对路径
## 二、如何使用
### 2.1 基础:启动服务
**示例1** 4090上单卡部署32K上下文的服务
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 32 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
```
**示例2** H800上双卡部署128K上下文的服务
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--max-num-seqs 128 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
```
> ⚠️ 2.1及以上版本需要通过环境变量开启新调度器 `ENABLE_V1_KVCACHE_SCHEDULER=1`,否则可能会有部分请求最大长度前截断或返空。
示例是可以稳定运行的一组配置,同时也能得到比较好的性能。
如果对精度、性能有进一步的要求,请继续阅读下面的内容。
### 2.2 进阶:如何获取更优性能
#### 2.2.1 评估应用场景,正确设置参数
> **上下文长度**
- **参数:** `--max-model-len`
- **描述:** 控制模型可处理的最大上下文长度。
- **推荐:** 更长的上下文会导致吞吐降低,根据实际情况设置,`ERNIE-4.5-VL-28B-A3B-Paddle`最长支持**128k**131072长度的上下文。
⚠️ 注更长的上下文会显著增加GPU显存需求设置更长的上下文之前确保硬件资源是满足的。
> **最大序列数量**
- **参数:** `--max-num-seqs`
- **描述:** 控制服务可以处理的最大序列数量支持1256。
- **推荐:** 如果您不知道实际应用场景中请求的平均序列数量是多少,我们建议设置为**256**。如果您的应用场景中请求的平均序列数量明显少于256我们建议设置为一个略大于平均值的较小值以进一步降低显存占用优化服务性能。
> **多图、多视频输入**
- **参数**`--limit-mm-per-prompt`
- **描述**我们的模型支持单次提示词prompt中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量,以确保资源高效利用。
- **推荐**我们建议将单次提示词prompt中的图片和视频数量均设置为100个以平衡性能与内存占用。
> **初始化时可用的显存比例**
- **参数:** `--gpu-memory-utilization`
- **用处:** 用于控制 FastDeploy 初始化服务的可用显存默认0.9即预留10%的显存备用。
- **推荐:** 推荐使用默认值0.9。如果服务压测时提示显存不足,可以尝试调低该值。
#### 2.2.2 Chunked Prefill
- **参数:** `--enable-chunked-prefill`
- **用处:** 开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。
- **其他相关配置**:
`--max-num-batched-tokens`限制每个chunk的最大token数量。多模场景下每个chunk会向上取整保持图片的完整性因此实际每次推理的总token数会大于该值。我们推荐设置为384。
#### 2.2.3 **量化精度**
- **参数:** `--quantization`
- **已支持的精度类型:**
- WINT4 (适合大多数用户)
- WINT8
- BFLOAT16 (未设置 `--quantization` 参数时默认使用BFLOAT16)
- **推荐:**
- 除非您有极其严格的精度要求否则我们建议使用WINT4量化。这将显著降低内存占用并提升吞吐量。
- 若需要稍高的精度可尝试WINT8。
- 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16因为它需要更多显存。
#### 2.2.4 **可调整的环境变量**
> **拒绝采样:**`FD_SAMPLING_CLASS=rejection`
- **描述**拒绝采样即从一个易于采样的提议分布proposal distribution中生成样本避免显式排序从而达到提升采样速度的效果可以提升推理性能。
- **推荐**:这是一种影响效果的较为激进的优化策略,我们还在全面验证影响。如果对性能有较高要求,也可以接受对效果的影响时可以尝试开启。
> **Attention超参**`FLAGS_max_partition_size=1024`
- **描述**Append Attntion(默认)后端的超参我们在常用数据集上的测试结果表明设置为1024后可以大幅提升解码速度尤其是长文场景。
- **推荐**:未来会修改为自动调整的机制。如果对性能有较高要求可以尝试开启。
## 三、常见问题FAQ
**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`
### 3.1 显存不足(OOM)
如果服务启动时提示显存不足,请尝试以下方法:
1. 确保无其他进程占用显卡显存;
2. 使用WINT4/WINT8量化开启chunked prefill
3. 酌情降低上下文长度和最大序列数量;
4. 增加部署卡数使用2卡或4卡部署即修改参数 `--tensor-parallel-size 2``--tensor-parallel-size 4`
如果可以服务可以正常启动,运行时提示显存不足,请尝试以下方法:
1. 酌情降低初始化时可用的显存比例,即调整参数 `--gpu-memory-utilization` 的值;
2. 增加部署卡数,参数修改同上。

View File

@@ -0,0 +1,109 @@
# ERNIE-4.5-VL-424B-A47B-Paddle
## 一、环境准备
### 1.1 支持情况
在下列硬件上部署所需要的最小卡数如下:
| 设备[显存] | WINT4 | WINT8 | BFLOAT16 |
|:----------:|:----------:|:------:| :------:|
| H20 [144G] | 8 | 8 | 8 |
| A100 [80G] | 8 | 8 | - |
| H800 [80G] | 8 | 8 | - |
### 1.2 安装fastdeploy
安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)
> ⚠️ 注意事项
> - FastDeploy只支持Paddle格式的模型注意下载Paddle后缀的模型
> - 使用模型名称会自动下载模型,如果已经下载过模型,可以直接使用模型下载位置的绝对路径
## 二、如何使用
### 2.1 基础:启动服务
**示例1** H800上8卡部署128K上下文的服务
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--tensor-parallel-size 8 \
--max-model-len 131072 \
--max-num-seqs 16 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.8 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
```
> ⚠️ 2.1及以上版本需要通过环境变量开启新调度器 `ENABLE_V1_KVCACHE_SCHEDULER=1`,否则可能会有部分请求最大长度前截断或返空。
示例是可以稳定运行的一组配置,同时也能得到比较好的性能。
如果对精度、性能有进一步的要求,请继续阅读下面的内容。
### 2.2 进阶:如何获取更优性能
#### 2.2.1 评估应用场景,正确设置参数
> **上下文长度**
- **参数:** `--max-model-len`
- **描述:** 控制模型可处理的最大上下文长度。
- **推荐:** 更长的上下文会导致吞吐降低,根据实际情况设置,`ERNIE-4.5-VL-424B-A47B-Paddle` 最长支持**128k**131072长度的上下文。
> **最大序列数量**
- **参数:** `--max-num-seqs`
- **描述:** 控制服务可以处理的最大序列数量支持1256。
- **推荐:** 128k场景下80G显存的单机我们建议设置为**16**。
> **多图、多视频输入**
- **参数**`--limit-mm-per-prompt`
- **描述**我们的模型支持单次提示词prompt中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量,以确保资源高效利用。
- **推荐**我们建议将单次提示词prompt中的图片和视频数量均设置为100个以平衡性能与内存占用。
> **初始化时可用的显存比例**
- **参数:** `--gpu-memory-utilization`
- **用处:** 用于控制 FastDeploy 初始化服务的可用显存默认0.9即预留10%的显存备用。
- **推荐:** 128k长度的上下文时推荐使用0.8。如果服务压测时提示显存不足,可以尝试调低该值。
#### 2.2.2 Chunked Prefill
- **参数:** `--enable-chunked-prefill`
- **用处:** 开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。
- **其他相关配置**:
`--max-num-batched-tokens`限制每个chunk的最大token数量。多模场景下每个chunk会向上取整保持图片的完整性因此实际每次推理的总token数会大于该值。推荐设置为384。
#### 2.2.3 **量化精度**
- **参数:** `--quantization`
- **已支持的精度类型:**
- WINT4 (适合大多数用户)
- WINT8
- BFLOAT16 (未设置 `--quantization` 参数时默认使用BFLOAT16)
- **推荐:**
- 除非您有极其严格的精度要求否则我们建议使用WINT4量化。这将显著降低内存占用并提升吞吐量。
- 若需要稍高的精度可尝试WINT8。
- 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16因为它需要更多显存。
#### 2.2.4 **可调整的环境变量**
> **拒绝采样:**`FD_SAMPLING_CLASS=rejection`
- **描述**拒绝采样即从一个易于采样的提议分布proposal distribution中生成样本避免显式排序从而达到提升采样速度的效果可以提升推理性能。
- **推荐**:这是一种影响效果的较为激进的优化策略,我们还在全面验证影响。如果对性能有较高要求,也可以接受对效果的影响时可以尝试开启。
> **Attention超参**`FLAGS_max_partition_size=1024`
- **描述**Append Attntion(默认)后端的超参我们在常用数据集上的测试结果表明设置为1024后可以大幅提升解码速度尤其是长文场景。
- **推荐**:未来会修改为自动调整的机制。如果对性能有较高要求可以尝试开启。
## 三、常见问题FAQ
**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`
### 3.1 显存不足(OOM)
如果服务启动时提示显存不足,请尝试以下方法:
1. 确保无其他进程占用显卡显存;
2. 使用WINT4/WINT8量化开启chunked prefill
3. 酌情降低上下文长度和最大序列数量。
如果可以服务可以正常启动,运行时提示显存不足,请尝试以下方法:
1. 酌情降低初始化时可用的显存比例,即调整参数 `--gpu-memory-utilization` 的值。

View File

@@ -0,0 +1,37 @@
# 常见问题FAQ
## 1.显存不足
1. 启动服务时显存不足:
- 核对模型和量化方式对应的部署最小卡数,如果不满足则需要增加部署卡数
- 如果开启了CUDAGraph尝试通过降低 `gpu_memory_utilization`来为CUDAGraph留存更多的显存或通过减少 `max_num_seqs`,设置`cudagraph_capture_sizes`来减少CUDAGraph的显存占用。
2. 服务运行期间显存不足:
- 检查log中是否有类似如下信息如有通常是输出block不足导致需要减小`kv-cache-ratio`
```
need_block_len: 1 free_list_len: 0
step max_id: 2 max_num: 133 encoder block len: 24
recover seq_id: 2 free_list_len: 144 used_list_len: 134
need_block_len: 1 free_list_len: 0
step max_id: 2 max_num: 144 encoder_block_len: 24
```
建议启用服务管理全局 Block功能在启动服务前加入环境变量
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
## 2.模型性能差
1. 首先检查输出长度是否符合预期,是否是解码过长导致。
如果场景输出本身较长请检查log中是否有类似如下信息如有通常是输出block不足导致需要减小`kv-cache-ratio`
```
need_block_len: 1 free_list_len: 0
step max_id: 2 max_num: 133 encoder block len: 24
recover seq_id: 2 free_list_len: 144 used_list_len: 134
need_block_len: 1 free_list_len: 0
step max_id: 2 max_num: 144 encoder_block_len: 24
```
同样建议启用服务管理全局 Block功能在启动服务前加入环境变量
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
2. 检查自动profile分配的KVCache block是否符合预期如果自动profile中受到显存波动影响可能导致分配偏少可以通过手工设置`num_gpu_blocks_override`参数扩大KVCache block。

View File

@@ -0,0 +1,7 @@
# 最佳实践
- [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md)
- [ERNIE-4.5-21B-A3B-Paddle.md](ERNIE-4.5-21B-A3B-Paddle.md)
- [ERNIE-4.5-300B-A47B-Paddle.md](ERNIE-4.5-300B-A47B-Paddle.md)
- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
- [ERNIE-4.5-VL-424B-A47B-Paddle](ERNIE-4.5-VL-424B-A47B-Paddle.md)

View File

@@ -0,0 +1,117 @@
# 早停功能
早停功能用于提前结束模型生成token的过程具体来说早停功能会采取不同的策略判断当前生成的token序列是否满足早停条件如果满足则提前结束token生成。FastDeploy目前支持`Repetition`策略和`Stop Sequence`策略。
## 1.Repetition策略
* Repetition策略通过检查生成高概率token的次数决定是否需要触发早停功能。
* 具体来说当某个batch生成token的概率连续超过用户设置的概率阈值达到用户指定的次数将提前结束该batch的token生成过程。
### 使用说明
在启动服务时,添加早停功能的启动项。
* 在线推理启动示例:
* 使用默认超参数:--enable-early-stop
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
--enable-early-stop
```
* 使用自定义超参数:--early-stop-config
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
--early-stop-config '{"enable_early_stop":true, "window_size": 1000, "threshold": 0.9}'
```
* 离线推理示例
* 使用默认超参数enable_early_stop
```python
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM
model_name_or_path = "baidu/ERNIE-4.5-0.3B-Paddle"
sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
llm = LLM(model=model_name_or_path, tensor_parallel_size=1, enable_early_stop=True)
output = llm.generate(prompts="who are you?", use_tqdm=True, sampling_params=sampling_params)
print(output)
```
* 使用自定义超参数early_stop_config
```python
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM
model_name_or_path = "baidu/ERNIE-4.5-0.3B-Paddle"
early_stop_config = {"enable_early_stop":True, "window_size":1000, "threshold":0.9}
sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
llm = LLM(model=model_name_or_path, tensor_parallel_size=1, early_stop_config=early_stop_config)
output = llm.generate(prompts="who are you?", use_tqdm=True, sampling_params=sampling_params)
print(output)
```
### 参数说明
* `enable_early_stop`: (bool) 是否启用早停功能默认设置为False。
* `strategy`: (str) 早停功能使用的策略目前仅支持repetition策略默认设置为"repetition"。
* `window_size`: (int) repetition策略中连续出现高概率token的次数上限超过该次数将触发早停功能默认设置为3000。
* `threshold`: (float) repetition策略中的高概率阈值默认设置为0.99。
## 2.Stop Sequence策略
* Stop Sequence策略通过检查生成的token序列是否包含用户指定的停止序列决定是否需要触发早停功能。
* 具体来说当某个batch生成的token序列中包含用户指定的停止序列时将提前结束该batch的token生成过程。
### 使用说明
启动服务前,设置下列环境变量
```
FD_STOP_SEQS_MAX_LEN 表示支持停止序列的最大长度默认为8
FD_MAX_STOP_SEQS_NUM表示支持停止序列的最大数量默认为5
```
在请求服务时,在请求中包含`stop`字段,可以是`str`或`List[str]`。
* 在线推理请求示例请求时添加stop参数
```
# create a chat request with "stop" parameter
import openai
ip = "0.0.0.0"
service_http_port = "8233"
client = openai.Client(base_url=f"http://{ip}:{service_http_port}/v1", api_key="EMPTY_API_KEY")
response = client.chat.completions.create(
model="default",
messages=[
{"role": "user", "content": '今天天气真好'},
],
temperature=1.0,
top_p=0,
stream=False,
stop=["明天", "出去走走"]
)
```
* 离线推理请求,在`SamplingParams`中增加`stop`参数
```
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM
model_name_or_path = "ERNIE-4.5-21B-A3B-Paddle"
sampling_params = SamplingParams(temperature=1, top_p=0, stop=["出去走走"])
llm = LLM(model=model_name_or_path, tensor_parallel_size=1)
output = llm.chat(messages=[{"role": "user", "content": "今天天气真好"}], use_tqdm=True, sampling_params=sampling_params)
print(output)
```

View File

@@ -0,0 +1,119 @@
# FastDeploy 中的图优化技术
FastDeploy 的 `GraphOptimizationBackend` 中集成了多种图优化技术:
+ **CUDA Graph**:一种通过单个 CPU 操作启动多个 GPU 操作的机制,可以降低开销并提高性能
+ **动态图转静态图**:将动态图转换为静态图,利用全局图结构信息优化计算图、提升执行效率
+ **CINN 神经网络编译器**:在静态图的基础上执行 IR 转换、Kernel 融合、Kernel 生成等计算图编译优化方法,实现综合优化
任何依赖数据的控制流、Host-Device 同步、地址/形状变化的模型输入、动态的 Kernel 执行配置等动态情况都会导致 CUDAGraph Capture/Replay 失败,而大模型推理中面临场景的是动态的输入长度、动态的 Batch Size灵活的 Attention 实现和多卡通信,导致 CUDA Graph 难以应用。
开源主流方案基于静态图实现 CUDA Graph技术栈较深。FastDeploy 不仅支持静态图、神经网络编译器、CUDAGraph 组合优化,还支持直接在动态图中应用 CUDA Graph ,开发成本更低,但面临的动态情况更复杂。
FastDeploy 的 `GraphOptimizationBackend` 设计架构如下,**部分功能仍在开发中,建议仔细阅读第一章节使用限制**。
![](./images/GraphOptBackendArch.svg)
## 1. GraphOptimizationBackend 当前使用限制
### 1.1 多卡场景需要开启 Custom all-reduce
在 CUDAGraph 多卡推理任务中需要使用 Custom all-reduce 算子进行多卡 all-reduce
在 2.2 版本之前CUDAGraph 和 Custom all-reduce 算子都未默认开启,需要在启动命令中添加 `--enable-custom-all-reduce` 手动开启。
### 1.2 FLAGS_max_partition_size 相关的 Kernel 的动态执行配置导致 CUDAGraph 执行失败
`FLAGS_max_partition_size` 环境变量控制了 CascadeAppend Attention 中 Kernel 的`gridDim` 执行配置 , 而动态的执行配置会导致 CUDAGraph 执行失败。
[PR#3223](https://github.com/PaddlePaddle/FastDeploy/pull/3223) 修复了这个问题,但在 2.2 之前的 Release 版本依然存在这个问题。
**问题自查方法:**
+ 根据`FLAGS_max_partition_size`的值(默认是 32K和启动参数中的 `max_model_len`计算`div_up(max_model_len, max_partition_size)`,结果大于`1`时无法执行,等于`1`时可以正常运行
**解决方法:**
1. 调整`FLAGS_max_partition_size``max_model_len`的值,不触发动态执行配置。
2. 关闭 CUDAGraph
## 2. GraphOptimizationBackend 相关配置参数说明
当前仅支持用户配置以下参数:
+ `use_cudagraph` : bool = False
+ `graph_optimization_config` : Dict[str, Any]
+ `graph_opt_level`: int = 0
+ `use_cudagraph`: bool = False
+ `cudagraph_capture_sizes` : List[int] = None
可以通过设置 `--use-cudagraph``--graph-optimization-config '{"use_cudagraph":true}'` 开启 CudaGrpah。
`--graph-optimization-config` 中的 `graph_opt_level` 参数用于配置图优化等级,可选项如下:
+ `0`: 动态图,默认为 0
+ `1`: 静态图,初始化阶段会使用 Paddle API 将动态图转换为静态图
+ `2`: 在静态图的基础上,使用 Paddle 框架编译器CINN, Compiler Infrastructure for Neural Networks进行编译优化
一般情况下静态图比动态图的 Kernel Launch 开销更小,推荐使用静态图。
对于已适配的模型FastDeploy 的 CudaGraph **可同时支持动态图与静态图**
在默认配置下开启 CudaGraph 时,会根据 `max_num_seqs` 参数自动设置 CudaGraph 需要捕获的 Batch Size 列表,需要捕获的 Batch Size 的列表自动生成逻辑如下:
1. 生成一个范围为 [1,1024] Batch Size 的候选列表
```
# Batch Size [1, 2, 4, 8, 16, ... 120, 128]
candidate_capture_sizes = [1, 2, 4] + [8 * i for i in range(1, 17)]
# Batch Size (128, 144, ... 240, 256]
candidate_capture_sizes += [16 * i for i in range(9, 17)]
# Batch Size (256, 288, ... 992, 1024]
candidate_capture_sizes += [32 * i for i in range(17, 33)]
```
2. 根据用户设置的 `max_num_seqs` 裁剪候选列表,得到范围为 [1, `max_num_seqs`] 的 CudaGraph 捕获列表。
用户也可以通过 `--graph-optimization-config` 中的 `cudagraph_capture_sizes` 参数自定义需要被 CudaGraph 捕获的 Batch Size 列表:
```
--graph-optimization-config '{"cudagraph_capture_sizes": [1, 3, 5, 7, 9]}'
```
### 2.1 CudaGraph相关参数说明
使用 CudaGraph 会产生一些额外的显存开销在FastDeploy中分为下面两类
+ 额外的输入 Buffer 开销
+ CudaGraph 使用了专用的显存池,因此会持有一部分与主框架隔离的中间激活显存
FastDeploy 的初始化顺序为先使用 `gpu_memory_utilization` 参数计算 `KVCache` 可用的显存,初始化完 `KVCache` 之后才会使用剩余显存初始化 CudaGraph。由于 CudaGraph 目前还不是默认开启的,因此使用默认启动参数可能会遇到 `Out Of Memory` 错误,可以尝试使用下面三种方式解决:
+ 调低 `gpu_memory_utilization` 的值多预留一些显存给CudaGraph使用。
+ 调低 `max_num_seqs` 的值,降低最大并发数。
+ 通过 `graph_optimization_config` 自定义需要 CudaGraph 捕获的 Batch Size 列表 `cudagraph_capture_sizes`,减少捕获的图的数量
使用CudaGraph之前需要确保加载的模型被装饰器 ``@support_graph_optimization``正确修饰。
```python
# 1. import 装饰器
from fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization
...
# 2. 添加装饰器
@support_graph_optimization
class Ernie4_5_Model(nn.Layer): # 注意 decorator 加在 nn.Layer 的子类上
...
# 3. 修改 ModelForCasualLM 子类中 self.model() 的传参方式
class Ernie4_5_MoeForCausalLM(ModelForCasualLM):
...
def forward(
self,
ids_remove_padding: paddle.Tensor,
forward_meta: ForwardMeta,
):
hidden_states = self.model(ids_remove_padding=ids_remove_padding, # 传参时指定参数名
forward_meta=forward_meta)
return hidden_statesfrom fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization
...
@support_graph_optimization
class Ernie45TModel(nn.Layer): # 注意 decorator 加在 nn.Layer 的子类上
...
```

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 30 KiB

View File

@@ -56,6 +56,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
--port 8801 \
--metrics-port 8802 \
--engine-worker-queue-port 8803 \
--model baidu/ERNIE-4.5-0.3B-Paddle \
--scheduler-name global \
--scheduler-ttl 900 \
--scheduler-host "127.0.0.1" \
@@ -63,7 +64,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
--scheduler-db 0 \
--scheduler-password "" \
--scheduler-topic "default" \
--scheduler-min-load_score 3 \
--scheduler-min-load-score 3 \
--scheduler-load-shards-num 1
```

View File

@@ -0,0 +1,73 @@
# 多节点部署
## 概述
多节点部署旨在解决单个机器GPU显存不足时支持跨多台机器的张量并行执行。
## 环境准备
#### 网络要求
1. 所有节点必须在同一本地网络中
2. 确保所有节点之间双向连通(可使用`ping``nc -zv`测试)
#### 软件要求
1. 所有节点安装相同版本的FastDeploy
2. [建议安装]安装并配置MPIOpenMPI或MPICH
## 张量并行部署
### 推荐启动方式
我们推荐使用mpirun进行一键启动无需手动启动每个节点。
### 使用说明
1. 在所有机器上执行相同的命令
2. `ips`参数中的IP顺序决定了节点启动顺序
3. 第一个IP将被指定为主节点
4. 确保所有节点能够解析彼此的主机名
* 在线推理启动示例:
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
--tensor-parallel-size 16 \
--ips 192.168.1.101,192.168.1.102
```
* 离线启动示例:
```python
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM
model_name_or_path = "baidu/ERNIE-4.5-300B-A47B-Paddle"
sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
llm = LLM(model=model_name_or_path, tensor_parallel_size=16, ips="192.168.1.101,192.168.1.102")
if llm._check_master():
output = llm.generate(prompts="你是谁?", use_tqdm=True, sampling_params=sampling_params)
print(output)
```
* 注意:
- 只有主节点可以接收完成请求
- 请始终将请求发送到主节点ips列表中的第一个IP
- 主节点将在所有节点间分配工作负载
### 参数说明
#### `ips`参数
- **类型**: `字符串`
- **格式**: 逗号分隔的IPv4地址
- **描述**: 指定部署组中所有节点的IP地址
- **必填**: 仅多节点部署时需要
- **示例**: `"192.168.1.101,192.168.1.102,192.168.1.103"`
#### `tensor_parallel_size`参数
- **类型**: `整数`
- **描述**: 所有节点上的GPU总数
- **必填**: 是
- **示例**: 对于2个节点各8个GPU设置为16

View File

@@ -0,0 +1,85 @@
# FastDeploy 插件机制说明文档
FastDeploy 支持插件机制,允许用户在不修改核心代码的前提下扩展功能。插件通过 Python 的 `entry_points` 机制实现自动发现与加载。
## 插件工作原理
插件本质上是在 FastDeploy 启动时被自动调用的注册函数。系统使用 `load_plugins_by_group` 函数确保所有进程(包括分布式训练场景下的子进程)在正式运行前都已加载所需的插件。
## 插件发现机制
FastDeploy 利用 Python 的 `entry_points` 机制来发现并加载插件。开发者需在自己的项目中将插件注册到指定的 entry point 组中。
### 示例:创建一个插件
#### 1. 编写插件逻辑
假设你有一个自定义模型类 `MyModelForCasualLM` 和预训练类 `MyPretrainedModel`,你可以编写如下注册函数:
```python
# 文件fd_add_dummy_model/__init__.py
from fastdeploy.model_registry import ModelRegistry
from my_custom_model import MyModelForCasualLM, MyPretrainedModel
def register():
if "MyModelForCasualLM" not in ModelRegistry.get_supported_archs():
ModelRegistry.register_model_class(MyModelForCasualLM)
ModelRegistry.register_pretrained_model(MyPretrainedModel)
```
#### 2. 注册插件到 `setup.py`
```python
# setup.py
from setuptools import setup
setup(
name="fastdeploy-plugins",
version="0.1",
packages=["fd_add_dummy_model"],
entry_points={
"fastdeploy.model_register_plugins": [
"fd_add_dummy_model = fd_add_dummy_model:register",
],
},
)
```
## 插件结构说明
插件由三部分组成:
| 组件 | 说明 |
|------|------|
| **插件组Group** | 插件所属的功能分组,例如:<br> - `fastdeploy.model_register_plugins`: 用于注册模型<br> - `fastdeploy.model_runner_plugins`: 用于注册模型运行器<br> 用户可根据需要自定义分组。 |
| **插件名Name** | 每个插件的唯一标识名(如 `fd_add_dummy_model`),可通过环境变量 `FD_PLUGINS` 控制是否加载该插件。 |
| **插件值Value** | 格式为 `模块名:函数名`,指向实际执行注册逻辑的入口函数。 |
## 控制插件加载行为
默认情况下FastDeploy 会加载所有已注册的插件。若只想加载特定插件,可以设置环境变量:
```bash
export FD_PLUGINS=fastdeploy-plugins
```
多个插件名之间可以用逗号分隔:
```bash
export FD_PLUGINS=plugin_a,plugin_b
```
## 参考示例
请参见项目目录下的示例插件实现:
```
./test/plugins/
```
其中包含完整的插件结构和 `setup.py` 配置示例。
## 总结
通过插件机制,用户可以轻松地为 FastDeploy 添加自定义模型或功能模块,而无需修改核心源码。这不仅提升了系统的可扩展性,也方便了第三方开发者进行功能拓展。
如需进一步开发插件,请参考 FastDeploy 源码中的 `model_registry``plugin_loader` 模块。

View File

@@ -8,18 +8,18 @@
| baidu/ERNIE-4.5-VL-424B-A47B-Paddle | ernie-45-vl | ✓ |
| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | ernie-45-vl | ✓ |
思考模型需要指定解析器,以便于对思考内容进行解析. 通过`enable_thinking=False` 参数可以关闭模型思考模式.
思考模型需要指定解析器,以便于对思考内容进行解析. 通过 `"enable_thinking": false` 参数可以关闭模型思考模式.
可以支持思考模式开关的接口:
1. OpenAI 服务中 `/v1/chat/completions` 请求.
2. OpenAI Python客户端中 `/v1/chat/completions` 请求.
3. Offline 接口中 `llm.chat`请求.
同时在思考模型中,支持通过```reasoning_max_tokens```控制思考内容的长度,在请求中添加```metadata={"reasoning_max_tokens": 1024}```即可。
同时在思考模型中,支持通过 `reasoning_max_tokens` 控制思考内容的长度,在请求中添加 `"reasoning_max_tokens": 1024` 即可。
## 快速使用
在启动模型服务时, 通过`--reasoning-parser`参数指定解析器名称.
该解析器会解析思考模型的输出, 提取`reasoning_content`字段.
在启动模型服务时, 通过 `--reasoning-parser` 参数指定解析器名称.
该解析器会解析思考模型的输出, 提取 `reasoning_content` 字段.
```bash
python -m fastdeploy.entrypoints.openai.api_server \
@@ -43,15 +43,16 @@ curl -X POST "http://0.0.0.0:8192/v1/chat/completions" \
{"type": "text", "text": "图中的文物属于哪个年代"}
]}
],
"metadata": {"enable_thinking": true}
"chat_template_kwargs":{"enable_thinking": true},
"reasoning_max_tokens": 1024
}'
```
字段`reasoning_content`包含得出最终结论的思考步骤,而`content`字段包含最终结论。
字段 `reasoning_content` 包含得出最终结论的思考步骤,而 `content` 字段包含最终结论。
### 流式会话
在流式会话中, `reasoning_content`字段会可以在`chat completion response chunks`中的 `delta` 中获取
在流式会话中, `reasoning_content` 字段会可以在 `chat completion response chunks` 中的 `delta` 中获取
```python
from openai import OpenAI
@@ -69,7 +70,10 @@ chat_response = client.chat.completions.create(
],
model="vl",
stream=True,
metadata={"enable_thinking": True}
extra_body={
"chat_template_kwargs":{"enable_thinking": True},
"reasoning_max_tokens": 1024
}
)
for chunk in chat_response:
if chunk.choices[0].delta is not None:

Some files were not shown because too many files have changed in this diff Show More