Files
FastDeploy/docs/online_serving/metrics.md
qwes5s5 58e0785bab [metrics] update metrics markdown file (#4061)
* adjust md

* trigger ci

---------

Co-authored-by: K11OntheBoat <your_email@example.com>
2025-09-12 11:13:43 +08:00

4.9 KiB

Monitoring Metrics

After FastDeploy is launched, it supports continuous monitoring of the FastDeploy service status through Metrics. When starting FastDeploy, you can specify the port for the Metrics service by configuring the metrics-port parameter.

Metric Name Type Description Unit
fastdeploy:num_requests_running Gauge Number of currently running requests Count
fastdeploy:num_requests_waiting Gauge Number of currently waiting requests Count
fastdeploy:time_to_first_token_seconds Histogram Time required to generate the first token Seconds
fastdeploy:time_per_output_token_seconds Histogram Generation time for interval output tokens Seconds
fastdeploy:e2e_request_latency_seconds Histogram Distribution of end-to-end request latency Seconds
fastdeploy:request_inference_time_seconds Histogram Time consumed by requests in the RUNNING phase Seconds
fastdeploy:request_queue_time_seconds Histogram Time consumed by requests in the WAITING phase Seconds
fastdeploy:request_prefill_time_seconds Histogram Time consumed by requests in the prefill phase Seconds
fastdeploy:request_decode_time_seconds Histogram Time consumed by requests in the decode phase Seconds
fastdeploy:prompt_tokens_total Counter Total number of processed prompt tokens Count
fastdeploy:generation_tokens_total Counter Total number of generated tokens Count
fastdeploy:request_prompt_tokens Histogram Number of prompt tokens per request Count
fastdeploy:request_generation_tokens Histogram Number of tokens generated per request Count
fastdeploy:gpu_cache_usage_perc Gauge GPU KV-cache usage rate Percentage
fastdeploy:request_params_max_tokens Histogram Distribution of max_tokens for requests Count
fastdeploy:request_success_total Counter Number of successfully processed requests Count
fastdeploy:cache_config_info Gauge Information of the engine's CacheConfig Count
fastdeploy:available_batch_size Gauge Number of requests that can still be inserted during the Decode phase Count
fastdeploy:hit_req_rate Gauge Request-level prefix cache hit rate Percentage
fastdeploy:hit_token_rate Gauge Token-level prefix cache hit rate Percentage
fastdeploy:cpu_hit_token_rate Gauge Token-level CPU prefix cache hit rate Percentage
fastdeploy:gpu_hit_token_rate Gauge Token-level GPU prefix cache hit rate Percentage
fastdeploy:prefix_cache_token_num Counter Total number of cached tokens Count
fastdeploy:prefix_gpu_cache_token_num Counter Total number of cached tokens on GPU Count
fastdeploy:prefix_cpu_cache_token_num Counter Total number of cached tokens on CPU Count
fastdeploy:batch_size Gauge Real batch size during inference Count
fastdeploy:max_batch_size Gauge Maximum batch size determined when service started Count
fastdeploy:available_gpu_block_num Gauge Number of available gpu blocks in cache, including prefix caching blocks that are not officially released Count
fastdeploy:free_gpu_block_num Gauge Number of free blocks in cache Count
fastdeploy:max_gpu_block_num Gauge Number of total blocks determined when service started Count
available_gpu_resource Gauge Available blocks percentage, i.e. available_gpu_block_num / max_gpu_block_num Count
fastdeploy:requests_number Counter Total number of requests received Count
fastdeploy:send_cache_failed_num Counter Total number of failures of sending cache Count
fastdeploy:first_token_latency Gauge Latest time to generate first token in seconds Seconds
fastdeploy:infer_latency Gauge Latest time to generate one token in seconds Seconds

Accessing Metrics

  • Access URL: http://localhost:8000/metrics
  • Metric Type: Prometheus format