mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-10-05 16:48:03 +08:00

* collect-env subcommand * trigger ci --------- Co-authored-by: K11OntheBoat <your_email@example.com>
46 lines
4.9 KiB
Markdown
46 lines
4.9 KiB
Markdown
# Monitoring Metrics
|
|
|
|
After FastDeploy is launched, it supports continuous monitoring of the FastDeploy service status through Metrics. When starting FastDeploy, you can specify the port for the Metrics service by configuring the `metrics-port` parameter.
|
|
|
|
| Metric Name | Type | Description | Unit |
|
|
| --------------------------------------------- | --------- |-------------------------------------|------|
|
|
| `fastdeploy:num_requests_running` | Gauge | Number of currently running requests | Count |
|
|
| `fastdeploy:num_requests_waiting` | Gauge | Number of currently waiting requests | Count |
|
|
| `fastdeploy:time_to_first_token_seconds` | Histogram | Time required to generate the first token | Seconds |
|
|
| `fastdeploy:time_per_output_token_seconds` | Histogram | Generation time for interval output tokens | Seconds |
|
|
| `fastdeploy:e2e_request_latency_seconds` | Histogram | Distribution of end-to-end request latency | Seconds |
|
|
| `fastdeploy:request_inference_time_seconds` | Histogram | Time consumed by requests in the RUNNING phase | Seconds |
|
|
| `fastdeploy:request_queue_time_seconds` | Histogram | Time consumed by requests in the WAITING phase | Seconds |
|
|
| `fastdeploy:request_prefill_time_seconds` | Histogram | Time consumed by requests in the prefill phase | Seconds |
|
|
| `fastdeploy:request_decode_time_seconds` | Histogram | Time consumed by requests in the decode phase | Seconds |
|
|
| `fastdeploy:prompt_tokens_total` | Counter | Total number of processed prompt tokens | Count |
|
|
| `fastdeploy:generation_tokens_total` | Counter | Total number of generated tokens | Count |
|
|
| `fastdeploy:request_prompt_tokens` | Histogram | Number of prompt tokens per request | Count |
|
|
| `fastdeploy:request_generation_tokens` | Histogram | Number of tokens generated per request | Count |
|
|
| `fastdeploy:gpu_cache_usage_perc` | Gauge | GPU KV-cache usage rate | Percentage |
|
|
| `fastdeploy:request_params_max_tokens` | Histogram | Distribution of max_tokens for requests | Count |
|
|
| `fastdeploy:request_success_total` | Counter | Number of successfully processed requests | Count |
|
|
| `fastdeploy:cache_config_info` | Gauge | Information of the engine's CacheConfig | Count |
|
|
| `fastdeploy:available_batch_size` | Gauge | Number of requests that can still be inserted during the Decode phase| Count |
|
|
| `fastdeploy:hit_req_rate` | Gauge | Request-level prefix cache hit rate | Percentage |
|
|
| `fastdeploy:hit_token_rate` | Gauge | Token-level prefix cache hit rate | Percentage |
|
|
| `fastdeploy:cpu_hit_token_rate` | Gauge | Token-level CPU prefix cache hit rate | Percentage |
|
|
| `fastdeploy:gpu_hit_token_rate` | Gauge | Token-level GPU prefix cache hit rate | Percentage |
|
|
| `fastdeploy:prefix_cache_token_num` | Counter | Total number of cached tokens | Count |
|
|
| `fastdeploy:prefix_gpu_cache_token_num` | Counter | Total number of cached tokens on GPU | Count |
|
|
| `fastdeploy:prefix_cpu_cache_token_num` | Counter | Total number of cached tokens on CPU | Count |
|
|
| `fastdeploy:batch_size` | Gauge | Real batch size during inference | Count |
|
|
| `fastdeploy:max_batch_size` | Gauge | Maximum batch size determined when service started | Count |
|
|
| `fastdeploy:available_gpu_block_num` | Gauge | Number of available gpu blocks in cache, including prefix caching blocks that are not officially released | Count |
|
|
| `fastdeploy:free_gpu_block_num` | Gauge | Number of free blocks in cache | Count |
|
|
| `fastdeploy:max_gpu_block_num` | Gauge | Number of total blocks determined when service started| Count |
|
|
| `fastdeploy:available_gpu_resource` | Gauge | Available blocks percentage, i.e. available_gpu_block_num / max_gpu_block_num | Count |
|
|
| `fastdeploy:requests_number` | Counter | Total number of requests received | Count |
|
|
| `fastdeploy:send_cache_failed_num` | Counter | Total number of failures of sending cache | Count |
|
|
| `fastdeploy:first_token_latency` | Gauge | Latest time to generate first token in seconds | Seconds |
|
|
| `fastdeploy:infer_latency` | Gauge | Latest time to generate one token in seconds | Seconds |
|
|
## Accessing Metrics
|
|
|
|
- Access URL: `http://localhost:8000/metrics`
|
|
- Metric Type: Prometheus format
|