[metrics] update metrics markdown file (#4061)

* adjust md

* trigger ci

---------

Co-authored-by: K11OntheBoat <your_email@example.com>
This commit is contained in:
qwes5s5
2025-09-12 11:13:43 +08:00
committed by GitHub
parent 8466219ec8
commit 58e0785bab
3 changed files with 30 additions and 4 deletions

View File

@@ -26,6 +26,19 @@ After FastDeploy is launched, it supports continuous monitoring of the FastDeplo
| `fastdeploy:hit_token_rate` | Gauge | Token-level prefix cache hit rate | Percentage |
| `fastdeploy:cpu_hit_token_rate` | Gauge | Token-level CPU prefix cache hit rate | Percentage |
| `fastdeploy:gpu_hit_token_rate` | Gauge | Token-level GPU prefix cache hit rate | Percentage |
| `fastdeploy:prefix_cache_token_num` | Counter | Total number of cached tokens | Count |
| `fastdeploy:prefix_gpu_cache_token_num` | Counter | Total number of cached tokens on GPU | Count |
| `fastdeploy:prefix_cpu_cache_token_num` | Counter | Total number of cached tokens on CPU | Count |
| `fastdeploy:batch_size` | Gauge | Real batch size during inference | Count |
| `fastdeploy:max_batch_size` | Gauge | Maximum batch size determined when service started | Count |
| `fastdeploy:available_gpu_block_num` | Gauge | Number of available gpu blocks in cache, including prefix caching blocks that are not officially released | Count |
| `fastdeploy:free_gpu_block_num` | Gauge | Number of free blocks in cache | Count |
| `fastdeploy:max_gpu_block_num` | Gauge | Number of total blocks determined when service started| Count |
| `available_gpu_resource` | Gauge | Available blocks percentage, i.e. available_gpu_block_num / max_gpu_block_num | Count |
| `fastdeploy:requests_number` | Counter | Total number of requests received | Count |
| `fastdeploy:send_cache_failed_num` | Counter | Total number of failures of sending cache | Count |
| `fastdeploy:first_token_latency` | Gauge | Latest time to generate first token in seconds | Seconds |
| `fastdeploy:infer_latency` | Gauge | Latest time to generate one token in seconds | Seconds |
## Accessing Metrics
- Access URL: `http://localhost:8000/metrics`

View File

@@ -26,6 +26,19 @@
| `fastdeploy:hit_token_rate` | Gauge | token级别前缀缓存命中率 | 百分比 |
| `fastdeploy:cpu_hit_token_rate` | Gauge | token级别CPU前缀缓存命中率 | 百分比 |
| `fastdeploy:gpu_hit_token_rate` | Gauge | token级别GPU前缀缓存命中率 | 百分比 |
| `fastdeploy:prefix_cache_token_num` | Counter | 前缀缓存token总数 | 个 |
| `fastdeploy:prefix_gpu_cache_token_num` | Counter | 位于GPU上的前缀缓存token总数 | 个 |
| `fastdeploy:prefix_cpu_cache_token_num` | Counter | 位于GPU上的前缀缓存token总数 | 个 |
| `fastdeploy:batch_size` | Gauge | 推理时的真实批处理大小 | 个 |
| `fastdeploy:max_batch_size` | Gauge | 服务启动时确定的最大批处理大小 | 个 |
| `fastdeploy:available_gpu_block_num` | Gauge | 缓存中可用的GPU块数量包含尚未正式释放的前缀缓存块| 个 |
| `fastdeploy:free_gpu_block_num` | Gauge | 缓存中的可用块数 | 个 |
| `fastdeploy:max_gpu_block_num` | Gauge | 服务启动时确定的总块数 | 个 |
| `available_gpu_resource` | Gauge | 可用块占比即可用GPU块数量 / 最大GPU块数量| 个 |
| `fastdeploy:requests_number` | Counter | 已接收的请求总数 | 个 |
| `fastdeploy:send_cache_failed_num` | Counter | 发送缓存失败的总次数 | 个 |
| `fastdeploy:first_token_latency` | Gauge | 最近一次生成首token耗时 | 秒 |
| `fastdeploy:infer_latency` | Gauge | 最近一次生成单个token的耗时 | 秒 |
## 指标访问
- 访问地址:`http://localhost:8000/metrics`

View File

@@ -152,10 +152,10 @@ class MetricsManager:
spec_decode_draft_single_head_acceptance_rate: "list[Gauge]"
# for YIYAN Adapter
prefix_cache_token_num: "Gauge"
prefix_gpu_cache_token_num: "Gauge"
prefix_cpu_cache_token_num: "Gauge"
prefix_ssd_cache_token_num: "Gauge"
prefix_cache_token_num: "Counter"
prefix_gpu_cache_token_num: "Counter"
prefix_cpu_cache_token_num: "Counter"
prefix_ssd_cache_token_num: "Counter"
batch_size: "Gauge"
max_batch_size: "Gauge"
available_gpu_block_num: "Gauge"