diff --git a/docs/online_serving/metrics.md b/docs/online_serving/metrics.md index f71fa6390..a0de820e2 100644 --- a/docs/online_serving/metrics.md +++ b/docs/online_serving/metrics.md @@ -26,6 +26,19 @@ After FastDeploy is launched, it supports continuous monitoring of the FastDeplo | `fastdeploy:hit_token_rate` | Gauge | Token-level prefix cache hit rate | Percentage | | `fastdeploy:cpu_hit_token_rate` | Gauge | Token-level CPU prefix cache hit rate | Percentage | | `fastdeploy:gpu_hit_token_rate` | Gauge | Token-level GPU prefix cache hit rate | Percentage | +| `fastdeploy:prefix_cache_token_num` | Counter | Total number of cached tokens | Count | +| `fastdeploy:prefix_gpu_cache_token_num` | Counter | Total number of cached tokens on GPU | Count | +| `fastdeploy:prefix_cpu_cache_token_num` | Counter | Total number of cached tokens on CPU | Count | +| `fastdeploy:batch_size` | Gauge | Real batch size during inference | Count | +| `fastdeploy:max_batch_size` | Gauge | Maximum batch size determined when service started | Count | +| `fastdeploy:available_gpu_block_num` | Gauge | Number of available gpu blocks in cache, including prefix caching blocks that are not officially released | Count | +| `fastdeploy:free_gpu_block_num` | Gauge | Number of free blocks in cache | Count | +| `fastdeploy:max_gpu_block_num` | Gauge | Number of total blocks determined when service started| Count | +| `available_gpu_resource` | Gauge | Available blocks percentage, i.e. available_gpu_block_num / max_gpu_block_num | Count | +| `fastdeploy:requests_number` | Counter | Total number of requests received | Count | +| `fastdeploy:send_cache_failed_num` | Counter | Total number of failures of sending cache | Count | +| `fastdeploy:first_token_latency` | Gauge | Latest time to generate first token in seconds | Seconds | +| `fastdeploy:infer_latency` | Gauge | Latest time to generate one token in seconds | Seconds | ## Accessing Metrics - Access URL: `http://localhost:8000/metrics` diff --git a/docs/zh/online_serving/metrics.md b/docs/zh/online_serving/metrics.md index cb9448257..8220e43dc 100644 --- a/docs/zh/online_serving/metrics.md +++ b/docs/zh/online_serving/metrics.md @@ -26,6 +26,19 @@ | `fastdeploy:hit_token_rate` | Gauge | token级别前缀缓存命中率 | 百分比 | | `fastdeploy:cpu_hit_token_rate` | Gauge | token级别CPU前缀缓存命中率 | 百分比 | | `fastdeploy:gpu_hit_token_rate` | Gauge | token级别GPU前缀缓存命中率 | 百分比 | +| `fastdeploy:prefix_cache_token_num` | Counter | 前缀缓存token总数 | 个 | +| `fastdeploy:prefix_gpu_cache_token_num` | Counter | 位于GPU上的前缀缓存token总数 | 个 | +| `fastdeploy:prefix_cpu_cache_token_num` | Counter | 位于GPU上的前缀缓存token总数 | 个 | +| `fastdeploy:batch_size` | Gauge | 推理时的真实批处理大小 | 个 | +| `fastdeploy:max_batch_size` | Gauge | 服务启动时确定的最大批处理大小 | 个 | +| `fastdeploy:available_gpu_block_num` | Gauge | 缓存中可用的GPU块数量(包含尚未正式释放的前缀缓存块)| 个 | +| `fastdeploy:free_gpu_block_num` | Gauge | 缓存中的可用块数 | 个 | +| `fastdeploy:max_gpu_block_num` | Gauge | 服务启动时确定的总块数 | 个 | +| `available_gpu_resource` | Gauge | 可用块占比,即可用GPU块数量 / 最大GPU块数量| 个 | +| `fastdeploy:requests_number` | Counter | 已接收的请求总数 | 个 | +| `fastdeploy:send_cache_failed_num` | Counter | 发送缓存失败的总次数 | 个 | +| `fastdeploy:first_token_latency` | Gauge | 最近一次生成首token耗时 | 秒 | +| `fastdeploy:infer_latency` | Gauge | 最近一次生成单个token的耗时 | 秒 | ## 指标访问 - 访问地址:`http://localhost:8000/metrics` diff --git a/fastdeploy/metrics/metrics.py b/fastdeploy/metrics/metrics.py index 230aa9f66..c6a6bbba7 100644 --- a/fastdeploy/metrics/metrics.py +++ b/fastdeploy/metrics/metrics.py @@ -152,10 +152,10 @@ class MetricsManager: spec_decode_draft_single_head_acceptance_rate: "list[Gauge]" # for YIYAN Adapter - prefix_cache_token_num: "Gauge" - prefix_gpu_cache_token_num: "Gauge" - prefix_cpu_cache_token_num: "Gauge" - prefix_ssd_cache_token_num: "Gauge" + prefix_cache_token_num: "Counter" + prefix_gpu_cache_token_num: "Counter" + prefix_cpu_cache_token_num: "Counter" + prefix_ssd_cache_token_num: "Counter" batch_size: "Gauge" max_batch_size: "Gauge" available_gpu_block_num: "Gauge"