Files
FastDeploy/docs/best_practices/FAQ.md
SunLei 3697110599 [Docs] update FAQ with logprobs MQ limits and deprecation (#5368)
* [doc] update FAQ with logprobs MQ limits and deprecation

* [doc] update FAQ with logprobs MQ limits and deprecation

* update faq
2025-12-04 15:57:04 +08:00

137 lines
5.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
[简体中文](../zh/best_practices/FAQ.md)
# FAQ
## 1.CUDA out of memory
1. when starting the service
- Check the minimum number of deployment GPUs corresponding to the model and quantification method. If it is not met, increase the number of deployment GPUs.
- If CUDAGraph is enabled, try to reserve more GPU memory for CUDAGraph by lowering `gpu_memory_utilization`, or reduce the GPU memory usage of CUDAGraph by reducing `max_num_seqs` and setting `cudagraph_capture_sizes`
2. during service operation:
- Check whether there is information similar to the following in the log. If so, it is usually caused by insufficient output blocks. You need to reduce `kv-cache-ratio`
```
need_block_len: 1 free_list_len: 0
step max_id: 2 max_num: 133 encoder block len: 24
recover seq_id: 2 free_list_len: 144 used_list_len: 134
need_block_len: 1 free_list_len: 0
step max_id: 2 max_num: 144 encoder_block_len: 24
```
It is recommended to enable the service management global block. You need add environment variables before starting the service.
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
## 2.Poor model performance
1. First, check whether the output length meets expectations and whether it is caused by excessive decoding length. If the output is long, please check whether there is similar information as follows in the log. If so, it is usually caused by insufficient output blocks and you need to reduce `kv-cache-ratio`
```
need_block_len: 1 free_list_len: 0
step max_id: 2 max_num: 133 encoder block len: 24
recover seq_id: 2 free_list_len: 144 used_list_len: 134
need_block_len: 1 free_list_len: 0
step max_id: 2 max_num: 144 encoder_block_len: 24
```
It is also recommended to enable the service management global block. You need add environment variables before starting the service.
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
2. Check whether the KVCache blocks allocated by the automatic profile are as expected. If the automatic profile is affected by the fluctuation of video memory and may result in less allocation, you can manually set the `num_gpu_blocks_override` parameter to expand the KVCache block.
## 3. How much concurrency can the service support?
1. It is recommended to configure the following environment variable when deploying the service:
```
export ENABLE_V1_KVCACHE_SCHEDULER=1
```
2. When starting the service, you need to configure `max-num-seqs`.
This parameter specifies the maximum batch size during the Decode phase.
If the concurrency exceeds this value, the extra requests will be queued.
Under normal circumstances, you can set `max-num-seqs` to **128** to keep it relatively high; the actual concurrency is determined by the load-testing client.
3. `max-num-seqs` represents only the upper limit you configure.
The **actual** concurrency the service can handle depends on the size of the **KVCache**.
After the service starts, check `log/worker_process.log` and look for logs similar to:
```
num_blocks_global: 17131
```
This indicates that the current service has **17131 KVCache blocks**.
With `block_size = 64` (default), the total number of tokens that can be cached is:
```
17131 * 64 = 1,096,384 tokens
```
If the average total number of tokens per request (input + output) is **20K**, then the service can actually support approximately:
```
1,096,384 / 20,000 ≈ 53 concurrent requests
```
## 4. Inference Request Stalls After Enabling logprobs
When **logprobs** is enabled, the inference output includes the log-probability of each token, which **significantly increases the size of each message body**. Under default settings, this may exceed the limits of the **System V Message Queue**, causing the inference request to **stall**.
The increase in message size differs between MTP and non-MTP modes. The calculations are shown below.
### Message Size Calculation
1. **Non-MTP + logprobs enabled**
Size of a single message:
```
((512 * (20 + 1)) + 2) * 8
+ 512 * (20 + 1) * 4
+ 512 * 8
= 133136 bytes
```
2. **MTP + logprobs enabled**
Size of a single message:
```
(512 * 6 * (20 + 1) + 512 + 3) * 8
+ 512 * 6 * (20 + 1) * 4
+ 512 * 6 * 8
= 802840 bytes
```
### Root Cause
Running `ipcs -l` typically shows the default System V message queue limits:
```
------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384
```
If a single message **exceeds the `max size of message` limit (usually 8192 bytes)**, inter-process communication becomes blocked, causing the inference task to stall.
### Solution
**Increase the System V message queue size limits.**
Since message sizes can approach 800 KB in MTP mode, it is recommended to increase the **maximum message size to at least 1 MB (1048576 bytes)**.
Use the following commands on Linux:
```
# Increase maximum size of a single message
sysctl -w kernel.msgmax=1048576
# Increase maximum capacity of a message queue
sysctl -w kernel.msgmnb=268435456
```
> **Note:** If running inside a Docker container, privileged mode (`--privileged`) is required, or you must explicitly set these kernel parameters via container startup options.
### Deprecation Notice
This System V message queuebased communication mechanism will be **deprecated in future releases**. Subsequent versions will migrate to a more robust communication method that eliminates the limitations described above.