[metrics] update metrics markdown file (#4061)

* adjust md * trigger ci --------- Co-authored-by: K11OntheBoat <your_email@example.com>
2025-09-26 20:41:53 +08:00 · 2025-09-12 11:13:43 +08:00
parent 8466219ec8
commit 58e0785bab
3 changed files with 30 additions and 4 deletions
--- a/docs/online_serving/metrics.md
+++ b/docs/online_serving/metrics.md
@@ -26,6 +26,19 @@ After FastDeploy is launched, it supports continuous monitoring of the FastDeplo
 | `fastdeploy:hit_token_rate`                  | Gauge     | Token-level prefix cache hit rate                   | Percentage   |
 | `fastdeploy:cpu_hit_token_rate`              | Gauge     | Token-level CPU prefix cache hit rate               | Percentage   |
 | `fastdeploy:gpu_hit_token_rate`              | Gauge     | Token-level GPU prefix cache hit rate               | Percentage   |
+| `fastdeploy:prefix_cache_token_num`          | Counter   | Total number of cached tokens                       | Count   |
+| `fastdeploy:prefix_gpu_cache_token_num`      | Counter   | Total number of cached tokens on GPU                | Count   |
+| `fastdeploy:prefix_cpu_cache_token_num`      | Counter   | Total number of cached tokens on CPU                | Count   |
+| `fastdeploy:batch_size`                      | Gauge     | Real batch size during inference                    | Count   |
+| `fastdeploy:max_batch_size`                  | Gauge     | Maximum batch size determined when service started  | Count   |
+| `fastdeploy:available_gpu_block_num`         | Gauge     | Number of available gpu blocks in cache, including prefix caching blocks that are not officially released               | Count   |
+| `fastdeploy:free_gpu_block_num`              | Gauge     | Number of free blocks in cache                      | Count   |
+| `fastdeploy:max_gpu_block_num`               | Gauge     | Number of total blocks determined when service started| Count   |
+| `available_gpu_resource`                     | Gauge     | Available blocks percentage, i.e. available_gpu_block_num / max_gpu_block_num               | Count   |
+| `fastdeploy:requests_number`                 | Counter   | Total number of requests received                   | Count   |
+| `fastdeploy:send_cache_failed_num`           | Counter   | Total number of failures of sending cache           | Count   |
+| `fastdeploy:first_token_latency`             | Gauge     | Latest time to generate first token in seconds      | Seconds   |
+| `fastdeploy:infer_latency`                   | Gauge     | Latest time to generate one token in seconds        | Seconds   |
 ## Accessing Metrics

 - Access URL: `http://localhost:8000/metrics`
--- a/docs/zh/online_serving/metrics.md
+++ b/docs/zh/online_serving/metrics.md
@@ -26,6 +26,19 @@
 | `fastdeploy:hit_token_rate`               | Gauge     | token级别前缀缓存命中率      | 百分比   |
 | `fastdeploy:cpu_hit_token_rate`           | Gauge     | token级别CPU前缀缓存命中率   | 百分比   |
 | `fastdeploy:gpu_hit_token_rate`           | Gauge     | token级别GPU前缀缓存命中率   | 百分比   |
+| `fastdeploy:prefix_cache_token_num`       | Counter   | 前缀缓存token总数           | 个   |
+| `fastdeploy:prefix_gpu_cache_token_num`   | Counter   | 位于GPU上的前缀缓存token总数  | 个   |
+| `fastdeploy:prefix_cpu_cache_token_num`   | Counter   | 位于GPU上的前缀缓存token总数  | 个   |
+| `fastdeploy:batch_size`                   | Gauge     | 推理时的真实批处理大小        | 个   |
+| `fastdeploy:max_batch_size`               | Gauge     | 服务启动时确定的最大批处理大小  | 个   |
+| `fastdeploy:available_gpu_block_num`      | Gauge     | 缓存中可用的GPU块数量（包含尚未正式释放的前缀缓存块）| 个   |
+| `fastdeploy:free_gpu_block_num`           | Gauge     | 缓存中的可用块数             | 个   |
+| `fastdeploy:max_gpu_block_num`            | Gauge     | 服务启动时确定的总块数        | 个   |
+| `available_gpu_resource`                  | Gauge     | 可用块占比，即可用GPU块数量 / 最大GPU块数量| 个   |
+| `fastdeploy:requests_number`              | Counter   | 已接收的请求总数             | 个   |
+| `fastdeploy:send_cache_failed_num`        | Counter   | 发送缓存失败的总次数          | 个   |
+| `fastdeploy:first_token_latency`          | Gauge     | 最近一次生成首token耗时       | 秒   |
+| `fastdeploy:infer_latency`                | Gauge     | 最近一次生成单个token的耗时   | 秒   |
 ## 指标访问

 - 访问地址：`http://localhost:8000/metrics`
--- a/fastdeploy/metrics/metrics.py
+++ b/fastdeploy/metrics/metrics.py
@@ -152,10 +152,10 @@ class MetricsManager:
    spec_decode_draft_single_head_acceptance_rate: "list[Gauge]"

    # for YIYAN Adapter
-    prefix_cache_token_num: "Gauge"
-    prefix_gpu_cache_token_num: "Gauge"
-    prefix_cpu_cache_token_num: "Gauge"
-    prefix_ssd_cache_token_num: "Gauge"
+    prefix_cache_token_num: "Counter"
+    prefix_gpu_cache_token_num: "Counter"
+    prefix_cpu_cache_token_num: "Counter"
+    prefix_ssd_cache_token_num: "Counter"
    batch_size: "Gauge"
    max_batch_size: "Gauge"
    available_gpu_block_num: "Gauge"