mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-10-05 08:37:06 +08:00
update hybrid-mtp-with-ngram (#3924)
This commit is contained in:
@@ -18,6 +18,13 @@ This project implements an efficient **Speculative Decoding** inference framewor
|
||||
- ⏳ Coming Soon: Support Chunk-prefill
|
||||
- ⏳ Coming Soon: Multi-layer MTP Layer
|
||||
|
||||
- **Decoding with Hybrid MTP and Ngram Methods(Hybrid-MTP-with-Ngram)**
|
||||
|
||||
- Overview: A hybrid method combining MTP and Ngram. First, MTP generates N draft tokens, then Ngram matching is used to supplement additional draft tokens.
|
||||
|
||||
- Use Cases: Suitable when higher draft token coverage is required, leveraging both MTP’s generation capability and the efficiency of Ngram matching.
|
||||
|
||||
|
||||
---
|
||||
|
||||
### Coming Soon
|
||||
@@ -132,7 +139,13 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--scheduler-password "scheduler_mtp" \
|
||||
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' &
|
||||
```
|
||||
## Decoding with Hybrid MTP and Ngram Methods
|
||||
|
||||
When starting the service, you only need to modify the --speculative-config option.
|
||||
For example, use MTP to generate two draft tokens, and then append three additional draft tokens from Ngram matching:
|
||||
```
|
||||
--speculative-config '{"method": "mtp", "num_model_steps": 2, "mtp_strategy": "with_ngram", "num_speculative_tokens": 5, "model": "'$model_path'/mtp"}'
|
||||
```
|
||||
## 🧠 Using Ngram-Based Decoding
|
||||
This method uses an n-gram sliding window to match the prompt and generated tokens to predict draft tokens. It is particularly effective in scenarios with high input-output overlap (e.g., code completion, document search).
|
||||
|
||||
|
@@ -14,6 +14,9 @@
|
||||
- ⏳ 即将支持:兼容 Chunk Prefill
|
||||
- ⏳ 即将支持:多层 MTP layer
|
||||
|
||||
- **混合MTP、Ngram方法解码(Hybrid-MTP-with-Ngram)**
|
||||
- 方法概述:混合MTP与Ngram方法,先使用MTP产出N个草稿Token,再使用Ngram匹配补充草稿Token。
|
||||
- 使用场景:适合在需要更多草稿Token时使用,兼顾MTP生成能力与Ngram匹配的高效性。
|
||||
---
|
||||
|
||||
### ⏳ 规划中
|
||||
@@ -110,7 +113,12 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--scheduler-password "scheduler_mtp" \
|
||||
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": ""${path_to_mtp_model}"}' &
|
||||
```
|
||||
## 使用混合MTP、Ngram方法解码
|
||||
在启动服务时,只需改动 --speculative-config 即可。例如使用MTP产出两个DraftToken,再额外拼接三个Ngram匹配的DraftToken
|
||||
```
|
||||
--speculative-config '{"method": "mtp", "num_model_steps": 2, "mtp_strategy": "with_ngram" ,"num_speculative_tokens": 5, "model": "'$model_path'/mtp"}'
|
||||
|
||||
```
|
||||
## 🧠 使用 Ngram 解码
|
||||
该算法通过 n-gram 窗口从 prompt 和已生成的 Token 中进行匹配生成草稿 Token,适合输入和输出有很大 overlap 的场景,如代码续写、文档查询等。
|
||||
> 使用 4×H100;量化方式选择 WINT4
|
||||
|
Reference in New Issue
Block a user