From 927bd7407559eed6093dfb43127c112297802010 Mon Sep 17 00:00:00 2001
From: chen <103103266+ckl117@users.noreply.github.com>
Date: Mon, 10 Nov 2025 21:21:33 +0800
Subject: [PATCH] [Docs] add doc for glm (#4933)
* add doc for glm
* del v1 loader
* delete mtp
---
docs/best_practices/GLM-4-MoE-Text.md | 98 ++++++++++++++++++++++
docs/index.md | 1 +
docs/supported_models.md | 1 +
docs/zh/best_practices/GLM-4-MoE-Text.md | 100 +++++++++++++++++++++++
docs/zh/supported_models.md | 1 +
5 files changed, 201 insertions(+)
create mode 100644 docs/best_practices/GLM-4-MoE-Text.md
create mode 100644 docs/zh/best_practices/GLM-4-MoE-Text.md
diff --git a/docs/best_practices/GLM-4-MoE-Text.md b/docs/best_practices/GLM-4-MoE-Text.md
new file mode 100644
index 000000000..d569914ba
--- /dev/null
+++ b/docs/best_practices/GLM-4-MoE-Text.md
@@ -0,0 +1,98 @@
+[简体中文](../zh/best_practices/GLM-4-MoE-Text.md)
+
+# GLM-4.5/4.6 Text Model
+## Environmental Preparation
+### 1.1 Hardware requirements
+The minimum number of GPUs required to deploy `GLM-4.5/4.6` on the following hardware for each quantization is as follows:
+
+| | WINT8 | WINT4 | FP8 |
+|-----|-----|-----|-----|
+|H800 80GB| 4 | 4 | 4 |
+|A800 80GB| 4 | 4 | / |
+
+**Tips:**
+1. To modify the number of deployment GPUs, specify `--tensor-parallel-size 4` in starting command.
+2. For hardware not listed in the table, you can estimate whether it can be deployed based on the GPU memory.
+3. FP8 quantization is recommended.
+
+### 1.2 Install fastdeploy and prepare the model
+- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
+
+- Model Download,For detail, please refer to [Supported Models](../supported_models.md).
+
+## 2.How to Use
+### 2.1 Basic: Launching the Service
+Example 1: H100 4-GPU BF16 Deployment with 16K Context
+```bash
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model zai-org/GLM-4.5-Air \
+ --tensor-parallel-size 4 \
+ --port 8185 \
+ --max-model-len 16384 \
+
+```
+
+Example 2: H100 4-GPU FP8 Inference Deployment
+```bash
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model zai-org/GLM-4.5-Air \
+ --tensor-parallel-size 4 \
+ --port 8185 \
+ --quantization wfp8afp8 \
+
+```
+- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `wfp8afp8`(Hopper is needed).
+- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
+
+For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
+
+### 2.2 Advanced: How to get better performance
+#### 2.2.1 Correctly set parameters that match the application scenario
+Evaluate average input length, average output length, and maximum context length
+- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
+
+#### 2.2.2 Prefix Caching
+**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
+
+**How to enable:**
+Since version 2.2 (including the develop branch), Prefix Caching has been enabled by default.
+
+For versions 2.1 and earlier, you need to enable it manually by adding following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
+```
+--enable-prefix-caching
+--swap-space 50
+```
+
+#### 2.2.3 Chunked Prefill
+**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
+
+**How to enable:**
+Since version 2.2 (including the develop branch), Chunked Prefill has been enabled by default.
+
+For versions 2.1 and earlier, you need to enable it manually by adding
+```
+--enable-chunked-prefill
+```
+
+#### 2.2.4 CUDAGraph
+**Idea:**
+CUDAGraph is a GPU computing acceleration technology provided by NVIDIA. It achieves efficient execution and optimization of GPU tasks by capturing CUDA operation sequences into a graph structure. The core idea of CUDAGraph is to encapsulate a series of GPU computing and memory operations into a re-executable graph, thereby reducing CPU-GPU communication overhead, reducing kernel startup latency, and improving overall computing performance.
+
+**How to enable:**
+Before version 2.3, it needs to be enabled through `--use-cudagraph`.
+CUDAGraph has been enabled by default in some scenarios at the beginning of version 2.3. CUDAGraph will be automatically closed for functions that are not compatible with CUDAGraph (speculative decoding, RL training, multi-mode model).
+Notes:
+- Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
+
+#### 2.2.5 Rejection Sampling
+**Idea:**
+Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models.
+
+**How to enable:**
+Add the following environment variables before starting
+```
+export FD_SAMPLING_CLASS=rejection
+```
+
+## FAQ
+If you encounter any problems during use, you can refer to [FAQ](./FAQ.md).
diff --git a/docs/index.md b/docs/index.md
index d8c724f51..5ca64e32c 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -29,6 +29,7 @@
|QWEN2|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K|
|DEEPSEEK-V3|BF16/WINT4|⛔|✅|🚧|🚧|✅|128K|
|DEEPSEEK-R1|BF16/WINT4|⛔|✅|🚧|🚧|✅|128K|
+|GLM-4.5/4.6|BF16/FP8/WINT4|⛔|✅|✅|🚧|✅|128K|
```
✅ Supported 🚧 In Progress ⛔ No Plan
diff --git a/docs/supported_models.md b/docs/supported_models.md
index e2b6eb487..c9e8fc4ea 100644
--- a/docs/supported_models.md
+++ b/docs/supported_models.md
@@ -42,6 +42,7 @@ These models accept text input.
|⭐QWEN2|BF16/WINT8/FP8|Qwen/Qwen/qwen2-72B;
Qwen/Qwen/qwen2-7B;
Qwen/qwen2-1.5B;
Qwen/qwen2-0.5B;
Qwen/QwQ-32, etc.|
|⭐DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;
unsloth/DeepSeek-V3-0324-BF16;
unsloth/DeepSeek-R1-BF16, etc.|
|⭐GPT-OSS|BF16/WINT8|unsloth/gpt-oss-20b-BF16, etc.|
+|⭐GLM-4.5/4.6|BF16/wfp8afp8|zai-org/GLM-4.5-Air;
zai-org/GLM-4.6
[最佳实践](./best_practices/GLM-4-MoE-Text.md) etc.|
## Multimodal Language Models
diff --git a/docs/zh/best_practices/GLM-4-MoE-Text.md b/docs/zh/best_practices/GLM-4-MoE-Text.md
new file mode 100644
index 000000000..410e72e54
--- /dev/null
+++ b/docs/zh/best_practices/GLM-4-MoE-Text.md
@@ -0,0 +1,100 @@
+[English](../../best_practices/GLM-4-MoE-Text.md)
+
+# GLM-4.5/4.6 文本模型
+
+## 一、环境准备
+### 1.1 支持情况
+GLM-4.5/4.6 各量化精度,在下列硬件上部署所需要的最小卡数如下:
+
+| | WINT8 | WINT4 | FP8 |
+|-----|-----|-----|-----|
+|H800 80GB| 4 | 4 | 4 |
+|A800 80GB| 4 | 4 | / |
+
+**注:**
+1. 在启动命令后指定`--tensor-parallel-size 4` 即可修改部署卡数
+2. 表格中未列出的硬件,可根据显存大小进行预估是否可以部署
+3. 量化精度推荐FP8。
+
+### 1.2 安装fastdeploy
+
+安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)
+
+## 二、如何使用
+### 2.1 基础:启动服务
+ **示例1:** H100上四卡部署BF16模型16K上下文的服务
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model zai-org/GLM-4.5-Air \
+ --tensor-parallel-size 4 \
+ --port 8185 \
+ --max-model-len 16384 \
+
+```
+
+ **示例2:** H100上四卡部署FP8推理服务
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model zai-org/GLM-4.5-Air \
+ --tensor-parallel-size 4 \
+ --port 8185 \
+ --quantization wfp8afp8 \
+
+```
+其中:
+- `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `wfp8afp8`(需要Hopper架构)。
+- `--max-model-len`:表示当前部署的服务所支持的最长Token数量。设置得越大,模型可支持的上下文长度也越大,但相应占用的显存也越多,可能影响并发数。
+
+更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。
+
+### 2.2 进阶:如何获取更优性能
+#### 2.2.1 评估应用场景,正确设置参数
+结合应用场景,评估平均输入长度、平均输出长度、最大上下文长度。例如,平均输入长度为1000,输出长度为30000,那么建议设置为 32768
+- 根据最大上下文长度,设置`max-model-len`
+
+#### 2.2.2 Prefix Caching
+**原理:** Prefix Caching的核心思想是通过缓存输入序列的中间计算结果(KV Cache),避免重复计算,从而加速具有相同前缀的多个请求的响应速度。具体参考[prefix-cache](../features/prefix_caching.md)
+
+**启用方式:**
+自2.2版本开始(包括develop分支),Prefix Caching已经默认开启。
+
+对于2.1及更早的版本,需要手动开启。其中`--enable-prefix-caching`表示启用前缀缓存,`--swap-space`表示在GPU缓存的基础上,额外开启CPU缓存,大小为GB,应根据机器实际情况调整。建议取值为`(机器总内存 - 模型大小) * 20%`。如果因为其他程序占用内存等原因导致服务启动失败,可以尝试减小`--swap-space`的值。
+```
+--enable-prefix-caching
+--swap-space 50
+```
+
+#### 2.2.3 Chunked Prefill
+**原理:** 采用分块策略,将预填充(Prefill)阶段请求拆解为小规模子任务,与解码(Decode)请求混合批处理执行。可以更好地平衡计算密集型(Prefill)和访存密集型(Decode)操作,优化GPU资源利用率,减少单次Prefill的计算量和显存占用,从而降低显存峰值,避免显存不足的问题。 具体请参考[Chunked Prefill](../features/chunked_prefill.md)
+
+**启用方式:**
+自2.2版本开始(包括develop分支),Chunked Prefill已经默认开启。
+
+对于2.1及更早的版本,需要手动开启。
+```
+--enable-chunked-prefill
+```
+
+#### 2.2.4 CUDAGraph
+**原理:**
+CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操作序列捕获(capture)为图结构(graph),实现 GPU 任务的高效执行和优化。CUDAGraph 的核心思想是将一系列 GPU 计算和内存操作封装为一个可重复执行的图,从而减少 CPU-GPU 通信开销、降低内核启动延迟,并提升整体计算性能。
+
+**启用方式:**
+在2.3版本之前需要通过`--use-cudagraph`启用。
+
+2.3版本开始部分场景已默认开启 CUDAGraph,对于暂时不能兼容 CUDAGraph 的功能(投机解码、强化学习训练、多模模型推理)CUDAGraph 会自动关闭。
+注:
+- 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
+
+#### 2.2.5 拒绝采样
+**原理:**
+拒绝采样即从一个易于采样的提议分布(proposal distribution)中生成样本,避免显式排序从而达到提升采样速度的效果,对小尺寸的模型有较明显的提升。
+
+**启用方式:**
+启动前增加下列环境变量
+```
+export FD_SAMPLING_CLASS=rejection
+```
+
+## 三、常见问题FAQ
+如果您在使用过程中遇到问题,可以在[FAQ](./FAQ.md)中查阅。
diff --git a/docs/zh/supported_models.md b/docs/zh/supported_models.md
index dc42414e1..bb77750ca 100644
--- a/docs/zh/supported_models.md
+++ b/docs/zh/supported_models.md
@@ -40,6 +40,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
|⭐QWEN2|BF16/WINT8/FP8|Qwen/Qwen/qwen2-72B;
Qwen/Qwen/qwen2-7B;
Qwen/qwen2-1.5B;
Qwen/qwen2-0.5B;
Qwen/QwQ-32, etc.|
|⭐DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;
unsloth/DeepSeek-V3-0324-BF16;
unsloth/DeepSeek-R1-BF16, etc.|
|⭐GPT-OSS|BF16/WINT8|unsloth/gpt-oss-20b-BF16, etc.|
+|⭐GLM-4.5/4.6|BF16/wfp8afp8|zai-org/GLM-4.5-Air;
zai-org/GLM-4.6
[最佳实践](./best_practices/GLM-4-MoE-Text.md) etc.|
## 多模态语言模型列表