mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-10-05 00:33:03 +08:00
33 lines
3.2 KiB
Markdown
33 lines
3.2 KiB
Markdown
# Monitoring Metrics
|
|
|
|
After FastDeploy is launched, it supports continuous monitoring of the FastDeploy service status through Metrics. When starting FastDeploy, you can specify the port for the Metrics service by configuring the `metrics-port` parameter.
|
|
|
|
| Metric Name | Type | Description | Unit |
|
|
| --------------------------------------------- | --------- |-------------------------------------|------|
|
|
| `fastdeploy:num_requests_running` | Gauge | Number of currently running requests | Count |
|
|
| `fastdeploy:num_requests_waiting` | Gauge | Number of currently waiting requests | Count |
|
|
| `fastdeploy:time_to_first_token_seconds` | Histogram | Time required to generate the first token | Seconds |
|
|
| `fastdeploy:time_per_output_token_seconds` | Histogram | Generation time for interval output tokens | Seconds |
|
|
| `fastdeploy:e2e_request_latency_seconds` | Histogram | Distribution of end-to-end request latency | Seconds |
|
|
| `fastdeploy:request_inference_time_seconds` | Histogram | Time consumed by requests in the RUNNING phase | Seconds |
|
|
| `fastdeploy:request_queue_time_seconds` | Histogram | Time consumed by requests in the WAITING phase | Seconds |
|
|
| `fastdeploy:request_prefill_time_seconds` | Histogram | Time consumed by requests in the prefill phase | Seconds |
|
|
| `fastdeploy:request_decode_time_seconds` | Histogram | Time consumed by requests in the decode phase | Seconds |
|
|
| `fastdeploy:prompt_tokens_total` | Counter | Total number of processed prompt tokens | Count |
|
|
| `fastdeploy:generation_tokens_total` | Counter | Total number of generated tokens | Count |
|
|
| `fastdeploy:request_prompt_tokens` | Histogram | Number of prompt tokens per request | Count |
|
|
| `fastdeploy:request_generation_tokens` | Histogram | Number of tokens generated per request | Count |
|
|
| `fastdeploy:gpu_cache_usage_perc` | Gauge | GPU KV-cache usage rate | Percentage |
|
|
| `fastdeploy:request_params_max_tokens` | Histogram | Distribution of max_tokens for requests | Count |
|
|
| `fastdeploy:request_success_total` | Counter | Number of successfully processed requests | Count |
|
|
| `fastdeploy:cache_config_info` | Gauge | Information of the engine's CacheConfig | Count |
|
|
| `fastdeploy:available_batch_size` | Gauge | Number of requests that can still be inserted during the Decode phase| Count |
|
|
| `fastdeploy:hit_req_rate` | Gauge | Request-level prefix cache hit rate | Percentage |
|
|
| `fastdeploy:hit_token_rate` | Gauge | Token-level prefix cache hit rate | Percentage |
|
|
| `fastdeploy:cpu_hit_token_rate` | Gauge | Token-level CPU prefix cache hit rate | Percentage |
|
|
| `fastdeploy:gpu_hit_token_rate` | Gauge | Token-level GPU prefix cache hit rate | Percentage |
|
|
## Accessing Metrics
|
|
|
|
- Access URL: `http://localhost:8000/metrics`
|
|
- Metric Type: Prometheus format
|