From 36971105991018eddb327773d4f0cc22a38b8964 Mon Sep 17 00:00:00 2001 From: SunLei Date: Thu, 4 Dec 2025 15:57:04 +0800 Subject: [PATCH] [Docs] update FAQ with logprobs MQ limits and deprecation (#5368) * [doc] update FAQ with logprobs MQ limits and deprecation * [doc] update FAQ with logprobs MQ limits and deprecation * update faq --- docs/best_practices/FAQ.md | 97 +++++++++++++++++++++++++++++++++++ docs/zh/best_practices/FAQ.md | 92 +++++++++++++++++++++++++++++++-- 2 files changed, 185 insertions(+), 4 deletions(-) diff --git a/docs/best_practices/FAQ.md b/docs/best_practices/FAQ.md index 851bfbd68..4e6aa45ed 100644 --- a/docs/best_practices/FAQ.md +++ b/docs/best_practices/FAQ.md @@ -37,3 +37,100 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1 ``` 2. Check whether the KVCache blocks allocated by the automatic profile are as expected. If the automatic profile is affected by the fluctuation of video memory and may result in less allocation, you can manually set the `num_gpu_blocks_override` parameter to expand the KVCache block. + +## 3. How much concurrency can the service support? + +1. It is recommended to configure the following environment variable when deploying the service: + + ``` + export ENABLE_V1_KVCACHE_SCHEDULER=1 + ``` + +2. When starting the service, you need to configure `max-num-seqs`. + This parameter specifies the maximum batch size during the Decode phase. + If the concurrency exceeds this value, the extra requests will be queued. + Under normal circumstances, you can set `max-num-seqs` to **128** to keep it relatively high; the actual concurrency is determined by the load-testing client. + +3. `max-num-seqs` represents only the upper limit you configure. + The **actual** concurrency the service can handle depends on the size of the **KVCache**. + After the service starts, check `log/worker_process.log` and look for logs similar to: + + ``` + num_blocks_global: 17131 + ``` + + This indicates that the current service has **17131 KVCache blocks**. + With `block_size = 64` (default), the total number of tokens that can be cached is: + + ``` + 17131 * 64 = 1,096,384 tokens + ``` + + If the average total number of tokens per request (input + output) is **20K**, then the service can actually support approximately: + + ``` + 1,096,384 / 20,000 ≈ 53 concurrent requests + ``` + +## 4. Inference Request Stalls After Enabling logprobs + +When **logprobs** is enabled, the inference output includes the log-probability of each token, which **significantly increases the size of each message body**. Under default settings, this may exceed the limits of the **System V Message Queue**, causing the inference request to **stall**. + +The increase in message size differs between MTP and non-MTP modes. The calculations are shown below. + +### Message Size Calculation + +1. **Non-MTP + logprobs enabled** + Size of a single message: + + ``` + ((512 * (20 + 1)) + 2) * 8 + + 512 * (20 + 1) * 4 + + 512 * 8 + = 133136 bytes + ``` + +2. **MTP + logprobs enabled** + Size of a single message: + + ``` + (512 * 6 * (20 + 1) + 512 + 3) * 8 + + 512 * 6 * (20 + 1) * 4 + + 512 * 6 * 8 + = 802840 bytes + ``` + +### Root Cause + +Running `ipcs -l` typically shows the default System V message queue limits: + +``` +------ Messages Limits -------- +max queues system wide = 32000 +max size of message (bytes) = 8192 +default max size of queue (bytes) = 16384 +``` + +If a single message **exceeds the `max size of message` limit (usually 8192 bytes)**, inter-process communication becomes blocked, causing the inference task to stall. + +### Solution + +**Increase the System V message queue size limits.** + +Since message sizes can approach 800 KB in MTP mode, it is recommended to increase the **maximum message size to at least 1 MB (1048576 bytes)**. + +Use the following commands on Linux: + +``` +# Increase maximum size of a single message +sysctl -w kernel.msgmax=1048576 + +# Increase maximum capacity of a message queue +sysctl -w kernel.msgmnb=268435456 +``` + +> **Note:** If running inside a Docker container, privileged mode (`--privileged`) is required, or you must explicitly set these kernel parameters via container startup options. + +### Deprecation Notice + +This System V message queue–based communication mechanism will be **deprecated in future releases**. Subsequent versions will migrate to a more robust communication method that eliminates the limitations described above. diff --git a/docs/zh/best_practices/FAQ.md b/docs/zh/best_practices/FAQ.md index 378015c3f..b6423265c 100644 --- a/docs/zh/best_practices/FAQ.md +++ b/docs/zh/best_practices/FAQ.md @@ -39,11 +39,95 @@ export ENABLE_V1_KVCACHE_SCHEDULER=1 2. 检查自动profile分配的KVCache block是否符合预期,如果自动profile中受到显存波动影响可能导致分配偏少,可以通过手工设置`num_gpu_blocks_override`参数扩大KVCache block。 ## 3.服务可以支持多大并发? -1. 服务部署时推荐配置环境变量 + +1. 服务部署时推荐配置以下环境变量 + + ``` + export ENABLE_V1_KVCACHE_SCHEDULER=1 + ``` + +2. 服务启动时需要配置 `max-num-seqs` + 该参数表示 Decode 阶段的**最大 Batch 数**,当并发超过该值时,多余的请求会进入排队等待处理。 + 一般情况下,你可以将 `max-num-seqs` 配置为 **128**,保持在较高范围;实际并发能力由压测客户端决定。 + +3. `max-num-seqs` 仅表示**配置的上限**,但服务真正能支持的并发量取决于 **KVCache 的总大小** + 服务启动后,在 `log/worker_process.log` 中会看到类似: + + ``` + num_blocks_global: 17131 + ``` + + 这表示当前服务的 KVCache Block 数量为 **17131**,若 `block_size = 64`(默认),则可缓存 Token 总量为: + + ``` + 17131 * 64 = 1,096,384 tokens + ``` + + 如果你的请求平均(输入 + 输出)为 **20K tokens**,那么服务实际能支持的并发大约为: + + ``` + 1,096,384 / 20,000 ≈ 53 + ``` + +## 4. 启用 logprobs 后推理请求卡住 + +启用 **logprobs** 后,推理结果会附带每个 token 的logprobs信息,使**单条消息体显著变大**。在默认配置下,这可能触发 **System V Message Queue** 的消息大小限制,从而导致推理任务token输出**卡住**。 + +不同模式下(MTP / 非 MTP)logprobs 会导致消息体膨胀的规模不同,具体计算如下。 + +### 消息体大小计算 + +1. **非 MTP 模式 + logprobs** + 单条消息体大小: + + ``` + ((512 * (20 + 1)) + 2) * 8 + + 512 * (20 + 1) * 4 + + 512 * 8 + = 133136 bytes + ``` + +2. **MTP 模式 + logprobs** + 单条消息体大小: + + ``` + (512 * 6 * (20 + 1) + 512 + 3) * 8 + + 512 * 6 * (20 + 1) * 4 + + 512 * 6 * 8 + = 802840 bytes + ``` + +### 问题原因 + +通过 `ipcs -l` 查看系统默认的 System V 消息队列限制,常见设置如下: + ``` -export ENABLE_V1_KVCACHE_SCHEDULER=1 +------ Messages Limits -------- +max queues system wide = 32000 +max size of message (bytes) = 8192 +default max size of queue (bytes) = 16384 ``` -2. 服务在启动时需要配置max-num-seqs,此参数用于表示Decode阶段的最大Batch数,如果并发超过此值,则超出的请求会排队等待处理, 常规情况下你可以将max-num-seqs配置为128,保持在较高的范围,实际并发由发压客户端来决定。 +当单条消息体大小**超过 max size of message(默认 8192 bytes)** 时,进程间通信会被阻塞,最终表现为推理请求卡住。 -3. max-num-seqs仅表示设定的上限,但实际上服务能并发处理的上限取决于KVCache的大小,在启动服务后,查看log/worker_process.log会看到类似num_blocks_global: 17131的日志,这表明当前服务的KVCache Block数量为17131, 17131block_size(默认64)即知道总共可缓存的Token数量,例如此处为1713164=1096384。如果你的请求数据平均输入和输出Token之和为20K,那么服务实际可以处理的并发大概为1096384/20k=53 +### 解决方案 + +**调大 System V Message Queue 的消息大小限制。** + +由于 MTP 下的消息体可接近 800 KB,建议将**单条消息大小限制提升至 1MB(1048576 bytes)**。 + +Linux 系统可通过以下命令调整: + +``` +# 提高单条消息的最大允许大小 +sysctl -w kernel.msgmax=1048576 + +# 提高单个消息队列的最大容量 +sysctl -w kernel.msgmnb=268435456 +``` + +> **注意**: 若在 Docker 容器中运行,需要启用特权模式(`--privileged`),或在启动参数中显式设置相关内核参数。 + +### 废弃说明 + +当前基于 System V Message Queue 的通信机制将在后续版本中被废弃。未来将迁移到更稳定、更高效的通信方式,以彻底解决上述限制问题。