Sync v2.0 version of code to github repo

2025-12-24 13:28:13 +08:00 · 2025-06-29 23:29:37 +00:00
parent d151496038
commit 92c2cfa2e7
597 changed files with 78776 additions and 22905 deletions
--- a/docs/features/speculative_decoding.md
+++ b/docs/features/speculative_decoding.md
@@ -0,0 +1,150 @@
+# 🔮 Speculative Decoding
+
+This project implements an efficient **Speculative Decoding** inference framework based on PaddlePaddle. It supports **Multi-Token Proposing (MTP)** to accelerate large language model (LLM) generation, significantly reducing latency and improving throughput.
+
+---
+
+## ✅ Supported Speculative Decoding Methods
+
+### Supported
+
+- **Ngram**
+
+- **MTP (Multi-Token Prediction)**  
+  - ✅ Supported: TP Sharding  
+  - ✅ Supported: Shared Prefix  
+  - ✅ Supported: TP Sharding + PD Separation  
+  - ⏳ Coming Soon: EP + DP + PD Separation
+  - ⏳ Coming Soon: Support Chunk-prefill
+  - ⏳ Coming Soon: Multi-layer MTP Layer  
+
+---
+
+### Coming Soon
+
+- Draft Model  
+- Eagle  
+- Hydra  
+- Medusa  
+- ...
+
+---
+
+## ⚙️ Efficient Speculative Decoding Architecture
+
+- **Attention Mechanism**: We employ [Cascade Append Attention](https://flashinfer.ai/2024/02/02/cascade-inference.html), which allows unified processing of queries with varying token lengths, enabling efficient verification. All tokens can be verified in a single forward pass. We deeply customized the underlying kernels to fully leverage Tensor Cores and maintain high throughput even under heavy concurrency.
+
+- **Virtual Padding Mechanism**: A virtual padding strategy is used to locate output token batch IDs, eliminating the overhead of data copying and slicing operations.
+
+- **Parallel Sampling and Verification**: We developed multiple fused CUDA kernels for concurrent sampling and verification. These kernels allow parallel processing for each sample in a batch, avoiding explicit loop execution on the host side.
+
+- **Efficient Draft Model/MTP Framework**: Multiple fused CUDA kernels are used to handle pre- and post-processing within the model class, replacing traditional loop-based and slicing-based methods with a more performant and maintainable structure.
+
+---
+
+## 🔧 Configuration Parameters
+
+- `method`: The speculative decoding strategy, currently supports `["mtp", "ngram"]`.
+- `num_speculative_tokens`: Number of speculative tokens to generate; max is 5, currently MTP supports only 1.
+- `model`: Path to the MTP draft model when using the `"mtp"` method.
+- `quantization`: Quantization method of the MTP model (e.g., WINT4).
+- Max `batch_size`: 256
+
+---
+
+## 🚀 Using Multi-Token Prediction (MTP)
+
+For detailed theory, refer to:  
+📄 [DeepSeek-V3 Paper](https://arxiv.org/pdf/2412.19437)
+
+### TP Sharding Mode
+
+Launch service on 4 × H100 GPUs using WINT4 quantization (Dense: WINT8, MoE: WINT4):
+
+> Config file: `benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml`
+
+```bash
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${path_to_main_model} \
+    --tensor-parallel-size 4 \
+    --config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
+    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
+```
+
+### PD-Separated Deployment (1P1D Mode)
+Deploy 1P1D on H100 with both Prefill (P) and Decode (D) nodes using TP4 + WINT4 quantization.
+This deployment only requires changing the config and adding speculative_config.
+For details, refer to the [PD Separation](./disaggregated.md).
+- P Node(Prefill)
+
+> Config file: `benchmarks/yaml/eb45t-32k-wint4-mtp-tp4-prefill.yaml`
+
+```
+export FD_LOG_DIR="log_prefill"
+rm -rf ${FD_LOG_DIR}
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${path_to_main_model} \
+    --port 8180 \
+    --metrics-port 8181 \
+    --engine-worker-queue-port 8182 \
+    --cache-queue-port 8183 \
+    --workers 2 \
+    --tensor-parallel-size 4 \
+    --quantization wint4 \
+    --splitwise-role "prefill" \
+    --scheduler-name "splitwise" \
+    --scheduler-host "127.0.0.1" \
+    --scheduler-port 6379 \
+    --scheduler-ttl 9000 \
+    --scheduler-topic mtp \
+    --config ${path_to_FastDeploy}/benchmarks/yaml/eb45t-32k-wint4-mtp-tp4-prefill.yaml \
+    --scheduler-password "scheduler_mtp" \
+    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' &
+```
+
+- D Node(Decode)
+
+> Config file: `benchmarks/yaml/eb45t-32k-wint4-mtp-tp4-decode.yaml`
+
+```
+export FD_LOG_DIR="log_decode"
+rm -rf ${FD_LOG_DIR}
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${path_to_main_model} \
+    --port 8190 \
+    --metrics-port 8191 \
+    --engine-worker-queue-port 8192 \
+    --cache-queue-port 8193 \
+    --workers 2 \
+    --tensor-parallel-size 4 \
+    --quantization wint4 \
+    --splitwise-role "decode" \
+    --scheduler-name "splitwise" \
+    --scheduler-host "127.0.0.1" \
+    --scheduler-port 6379 \
+    --scheduler-ttl 9000 \
+    --scheduler-topic mtp \
+    --config ${path_to_FastDeploy}/benchmarks/yaml/eb45t-32k-wint4-mtp-tp4-decode.yaml \
+    --scheduler-password "scheduler_mtp" \
+    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' &
+```
+
+## 🧠 Using Ngram-Based Decoding
+This method uses an n-gram sliding window to match the prompt and generated tokens to predict draft tokens. It is particularly effective in scenarios with high input-output overlap (e.g., code completion, document search).
+
+Run on 4 × H100 GPUs with WINT4 quantization:
+
+> Config file: `benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml`
+
+```
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${path_to_main_model} \
+    --tensor-parallel-size 4 \
+    --config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
+    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'
+
+```