mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-12-24 13:28:13 +08:00
[Metrics] Update time_to_first_token to include tokenization & queue time, and remove redundant metrics (#4993)
* [update] update time_to_first_tokens to include queue time, and remove first_token_latency and infer_latency * [doc] update docs * [ci] fix test * [chore] delete redundant code --------- Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
This commit is contained in:
@@ -4,43 +4,42 @@
|
||||
|
||||
After FastDeploy is launched, it supports continuous monitoring of the FastDeploy service status through Metrics. When starting FastDeploy, you can specify the port for the Metrics service by configuring the `metrics-port` parameter.
|
||||
|
||||
| Metric Name | Type | Description | Unit |
|
||||
| --------------------------------------------- | --------- |-------------------------------------|------|
|
||||
| `fastdeploy:num_requests_running` | Gauge | Number of currently running requests | Count |
|
||||
| `fastdeploy:num_requests_waiting` | Gauge | Number of currently waiting requests | Count |
|
||||
| `fastdeploy:time_to_first_token_seconds` | Histogram | Time required to generate the first token | Seconds |
|
||||
| `fastdeploy:time_per_output_token_seconds` | Histogram | Generation time for interval output tokens | Seconds |
|
||||
| `fastdeploy:e2e_request_latency_seconds` | Histogram | Distribution of end-to-end request latency | Seconds |
|
||||
| `fastdeploy:request_inference_time_seconds` | Histogram | Time consumed by requests in the RUNNING phase | Seconds |
|
||||
| `fastdeploy:request_queue_time_seconds` | Histogram | Time consumed by requests in the WAITING phase | Seconds |
|
||||
| `fastdeploy:request_prefill_time_seconds` | Histogram | Time consumed by requests in the prefill phase | Seconds |
|
||||
| `fastdeploy:request_decode_time_seconds` | Histogram | Time consumed by requests in the decode phase | Seconds |
|
||||
| `fastdeploy:prompt_tokens_total` | Counter | Total number of processed prompt tokens | Count |
|
||||
| `fastdeploy:generation_tokens_total` | Counter | Total number of generated tokens | Count |
|
||||
| `fastdeploy:request_prompt_tokens` | Histogram | Number of prompt tokens per request | Count |
|
||||
| `fastdeploy:request_generation_tokens` | Histogram | Number of tokens generated per request | Count |
|
||||
| `fastdeploy:gpu_cache_usage_perc` | Gauge | GPU KV-cache usage rate | Percentage |
|
||||
| `fastdeploy:request_params_max_tokens` | Histogram | Distribution of max_tokens for requests | Count |
|
||||
| `fastdeploy:request_success_total` | Counter | Number of successfully processed requests | Count |
|
||||
| `fastdeploy:cache_config_info` | Gauge | Information of the engine's CacheConfig | Count |
|
||||
| `fastdeploy:available_batch_size` | Gauge | Number of requests that can still be inserted during the Decode phase| Count |
|
||||
| `fastdeploy:hit_req_rate` | Gauge | Request-level prefix cache hit rate | Percentage |
|
||||
| `fastdeploy:hit_token_rate` | Gauge | Token-level prefix cache hit rate | Percentage |
|
||||
| `fastdeploy:cpu_hit_token_rate` | Gauge | Token-level CPU prefix cache hit rate | Percentage |
|
||||
| `fastdeploy:gpu_hit_token_rate` | Gauge | Token-level GPU prefix cache hit rate | Percentage |
|
||||
| `fastdeploy:prefix_cache_token_num` | Counter | Total number of cached tokens | Count |
|
||||
| `fastdeploy:prefix_gpu_cache_token_num` | Counter | Total number of cached tokens on GPU | Count |
|
||||
| `fastdeploy:prefix_cpu_cache_token_num` | Counter | Total number of cached tokens on CPU | Count |
|
||||
| `fastdeploy:batch_size` | Gauge | Real batch size during inference | Count |
|
||||
| `fastdeploy:max_batch_size` | Gauge | Maximum batch size determined when service started | Count |
|
||||
| `fastdeploy:available_gpu_block_num` | Gauge | Number of available gpu blocks in cache, including prefix caching blocks that are not officially released | Count |
|
||||
| `fastdeploy:free_gpu_block_num` | Gauge | Number of free blocks in cache | Count |
|
||||
| `fastdeploy:max_gpu_block_num` | Gauge | Number of total blocks determined when service started| Count |
|
||||
| `fastdeploy:available_gpu_resource` | Gauge | Available blocks percentage, i.e. available_gpu_block_num / max_gpu_block_num | Count |
|
||||
| `fastdeploy:requests_number` | Counter | Total number of requests received | Count |
|
||||
| `fastdeploy:send_cache_failed_num` | Counter | Total number of failures of sending cache | Count |
|
||||
| `fastdeploy:first_token_latency` | Gauge | Latest time to generate first token in seconds | Seconds |
|
||||
| `fastdeploy:infer_latency` | Gauge | Latest time to generate one token in seconds | Seconds |
|
||||
| Category | Metric Name | Type | Description | Unit |
|
||||
| :---: | ----------------------------------------- | --------- | ------------------------------ | ------ |
|
||||
| Request | `fastdeploy:requests_number` | Counter | Total number of received requests | count |
|
||||
| Request | `fastdeploy:request_success_total` | Counter | Number of successfully processed requests | count |
|
||||
| Request | `fastdeploy:num_requests_running` | Gauge | Number of requests currently running | count |
|
||||
| Request | `fastdeploy:num_requests_waiting` | Gauge | Number of requests currently waiting | count |
|
||||
| Latency | `fastdeploy:time_to_first_token_seconds` | Histogram | Time to generate the first token (TTFT) | s |
|
||||
| Latency | `fastdeploy:time_per_output_token_seconds` | Histogram | Time interval between generated tokens (TPOT) | s |
|
||||
| Latency | `fastdeploy:e2e_request_latency_seconds` | Histogram | End-to-end request latency distribution | s |
|
||||
| Latency | `fastdeploy:request_inference_time_seconds` | Histogram | Time spent in the RUNNING phase | s |
|
||||
| Latency | `fastdeploy:request_queue_time_seconds` | Histogram | Time spent in the WAITING phase | s |
|
||||
| Latency | `fastdeploy:request_prefill_time_seconds` | Histogram | Time spent in the Prefill phase | s |
|
||||
| Latency | `fastdeploy:request_decode_time_seconds` | Histogram | Time spent in the Decode phase | s |
|
||||
| Token | `fastdeploy:prompt_tokens_total` | Counter | Total number of processed prompt tokens | count |
|
||||
| Token | `fastdeploy:generation_tokens_total` | Counter | Total number of generated tokens | count |
|
||||
| Token | `fastdeploy:request_prompt_tokens` | Histogram | Prompt token count per request | count |
|
||||
| Token | `fastdeploy:request_generation_tokens` | Histogram | Generation token count per request | count |
|
||||
| Token | `fastdeploy:request_params_max_tokens` | Histogram | Distribution of `max_tokens` per request | count |
|
||||
| Batch | `fastdeploy:available_batch_size` | Gauge | Number of additional requests that can be inserted during Decode | count |
|
||||
| Batch | `fastdeploy:batch_size` | Gauge | Actual batch size during inference | count |
|
||||
| Batch | `fastdeploy:max_batch_size` | Gauge | Maximum batch size configured at service startup | count |
|
||||
| KV Cache | `fastdeploy:cache_config_info` | Gauge | Cache configuration info of the inference engine | count |
|
||||
| KV Cache | `fastdeploy:hit_req_rate` | Gauge | Prefix cache hit rate at the request level | % |
|
||||
| KV Cache | `fastdeploy:hit_token_rate` | Gauge | Prefix cache hit rate at the token level | % |
|
||||
| KV Cache | `fastdeploy:cpu_hit_token_rate` | Gauge | CPU-side token-level prefix cache hit rate | % |
|
||||
| KV Cache | `fastdeploy:gpu_hit_token_rate` | Gauge | GPU-side token-level prefix cache hit rate | % |
|
||||
| KV Cache | `fastdeploy:prefix_cache_token_num` | Counter | Total number of tokens in prefix cache | count |
|
||||
| KV Cache | `fastdeploy:prefix_gpu_cache_token_num` | Counter | Total number of prefix cache tokens on GPU | count |
|
||||
| KV Cache | `fastdeploy:prefix_cpu_cache_token_num` | Counter | Total number of prefix cache tokens on CPU | count |
|
||||
| KV Cache | `fastdeploy:available_gpu_block_num` | Gauge | Available GPU blocks in cache (including unreleased prefix blocks) | count |
|
||||
| KV Cache | `fastdeploy:free_gpu_block_num` | Gauge | Number of free GPU blocks in cache | count |
|
||||
| KV Cache | `fastdeploy:max_gpu_block_num` | Gauge | Total number of GPU blocks initialized at startup | count |
|
||||
| KV Cache | `fastdeploy:available_gpu_resource` | Gauge | Ratio of available GPU blocks to total GPU blocks | % |
|
||||
| KV Cache | `fastdeploy:gpu_cache_usage_perc` | Gauge | GPU KV cache utilization | % |
|
||||
| KV Cache | `fastdeploy:send_cache_failed_num` | Counter | Total number of cache send failures | count |
|
||||
|
||||
## Accessing Metrics
|
||||
|
||||
- Access URL: `http://localhost:8000/metrics`
|
||||
|
||||
@@ -4,43 +4,42 @@
|
||||
|
||||
在 FastDeploy 启动后,支持通过 Metrics 持续监控的 FastDeploy 的服务状态。启动 FastDeploy 时,可以通过配置 `metrics-port` 参数指定 Metircs 服务的端口。
|
||||
|
||||
| 指标名称 | 类型 | 说明 | 单位 |
|
||||
| ----------------------------------------- | --------- |------------------------------|------|
|
||||
| `fastdeploy:num_requests_running` | Gauge | 当前正在运行的请求数量 | 个 |
|
||||
| `fastdeploy:num_requests_waiting` | Gauge | 当前等待中的请求数量 | 个 |
|
||||
| `fastdeploy:time_to_first_token_seconds` | Histogram | 首 token 生成所需时间 | 秒 |
|
||||
| `fastdeploy:time_per_output_token_seconds`| Histogram | 间隔输出 token 的生成时间 | 秒 |
|
||||
| `fastdeploy:e2e_request_latency_seconds` | Histogram | 请求的端到端延迟分布 | 秒 |
|
||||
| `fastdeploy:request_inference_time_seconds`| Histogram | 请求在 RUNNING 阶段耗时 | 秒 |
|
||||
| `fastdeploy:request_queue_time_seconds` | Histogram | 请求在 WAITING 阶段耗时 | 秒 |
|
||||
| `fastdeploy:request_prefill_time_seconds` | Histogram | 请求的 prefill 阶段耗时 | 秒 |
|
||||
| `fastdeploy:request_decode_time_seconds` | Histogram | 请求的 decode 阶段耗时 | 秒 |
|
||||
| `fastdeploy:prompt_tokens_total` | Counter | 已处理的 prompt token 总数 | 个 |
|
||||
| `fastdeploy:generation_tokens_total` | Counter | 已生成的 token 总数 | 个 |
|
||||
| `fastdeploy:request_prompt_tokens` | Histogram | 每个请求的 prompt token 数量 | 个 |
|
||||
| `fastdeploy:request_generation_tokens` | Histogram | 每个请求生成的 token 数量 | 个 |
|
||||
| `fastdeploy:gpu_cache_usage_perc` | Gauge | GPU KV-cache 使用率 | 百分比 |
|
||||
| `fastdeploy:request_params_max_tokens` | Histogram | 请求的 max_tokens 分布 | 个 |
|
||||
| `fastdeploy:request_success_total` | Counter | 成功处理的请求个数 | 个 |
|
||||
| `fastdeploy:cache_config_info` | Gauge | 推理引擎的缓存配置信息 | 个 |
|
||||
| `fastdeploy:available_batch_size` | Gauge | Decode阶段还可以插入的请求数量 | 个 |
|
||||
| `fastdeploy:hit_req_rate` | Gauge | 请求级别前缀缓存命中率 | 百分比 |
|
||||
| `fastdeploy:hit_token_rate` | Gauge | token级别前缀缓存命中率 | 百分比 |
|
||||
| `fastdeploy:cpu_hit_token_rate` | Gauge | token级别CPU前缀缓存命中率 | 百分比 |
|
||||
| `fastdeploy:gpu_hit_token_rate` | Gauge | token级别GPU前缀缓存命中率 | 百分比 |
|
||||
| `fastdeploy:prefix_cache_token_num` | Counter | 前缀缓存token总数 | 个 |
|
||||
| `fastdeploy:prefix_gpu_cache_token_num` | Counter | 位于GPU上的前缀缓存token总数 | 个 |
|
||||
| `fastdeploy:prefix_cpu_cache_token_num` | Counter | 位于GPU上的前缀缓存token总数 | 个 |
|
||||
| `fastdeploy:batch_size` | Gauge | 推理时的真实批处理大小 | 个 |
|
||||
| `fastdeploy:max_batch_size` | Gauge | 服务启动时确定的最大批处理大小 | 个 |
|
||||
| `fastdeploy:available_gpu_block_num` | Gauge | 缓存中可用的GPU块数量(包含尚未正式释放的前缀缓存块)| 个 |
|
||||
| `fastdeploy:free_gpu_block_num` | Gauge | 缓存中的可用块数 | 个 |
|
||||
| `fastdeploy:max_gpu_block_num` | Gauge | 服务启动时确定的总块数 | 个 |
|
||||
| `fastdeploy:available_gpu_resource` | Gauge | 可用块占比,即可用GPU块数量 / 最大GPU块数量| 个 |
|
||||
| `fastdeploy:requests_number` | Counter | 已接收的请求总数 | 个 |
|
||||
| `fastdeploy:send_cache_failed_num` | Counter | 发送缓存失败的总次数 | 个 |
|
||||
| `fastdeploy:first_token_latency` | Gauge | 最近一次生成首token耗时 | 秒 |
|
||||
| `fastdeploy:infer_latency` | Gauge | 最近一次生成单个token的耗时 | 秒 |
|
||||
| 指标分类 | 指标名称 | 数据类型 | 说明 | 单位 |
|
||||
| :---: | ----------------------------------------- | --------- |------------------------------|------|
|
||||
| 请求| `fastdeploy:requests_number` | Counter | 已接收的请求总数 | 个 |
|
||||
| 请求 | `fastdeploy:request_success_total` | Counter | 成功处理的请求个数 | 个 |
|
||||
| 请求 | `fastdeploy:num_requests_running` | Gauge | 当前正在运行的请求数量 | 个 |
|
||||
| 请求 | `fastdeploy:num_requests_waiting` | Gauge | 当前等待中的请求数量 | 个 |
|
||||
| 时延 | `fastdeploy:time_to_first_token_seconds` | Histogram | 首 token 生成所需时间 | 秒 |
|
||||
| 时延 | `fastdeploy:time_per_output_token_seconds`| Histogram | 间隔输出 token 的生成时间 | 秒 |
|
||||
| 时延 | `fastdeploy:e2e_request_latency_seconds` | Histogram | 请求的端到端延迟分布 | 秒 |
|
||||
| 时延 | `fastdeploy:request_inference_time_seconds`| Histogram | 请求在 RUNNING 阶段耗时 | 秒 |
|
||||
| 时延 | `fastdeploy:request_queue_time_seconds` | Histogram | 请求在 WAITING 阶段耗时 | 秒 |
|
||||
| 时延 | `fastdeploy:request_prefill_time_seconds` | Histogram | 请求的 Prefill 阶段耗时 | 秒 |
|
||||
| 时延 | `fastdeploy:request_decode_time_seconds` | Histogram | 请求的 Decode 阶段耗时 | 秒 |
|
||||
| Token | `fastdeploy:prompt_tokens_total` | Counter | 已处理的 prompt token 总数 | 个 |
|
||||
| Token | `fastdeploy:generation_tokens_total` | Counter | 已生成的 token 总数 | 个 |
|
||||
| Token | `fastdeploy:request_prompt_tokens` | Histogram | 每个请求的 prompt token 数量 | 个 |
|
||||
| Token | `fastdeploy:request_generation_tokens` | Histogram | 每个请求的 generation token 数量 | 个 |
|
||||
| Token | `fastdeploy:request_params_max_tokens` | Histogram | 请求的 max_tokens 分布 | 个 |
|
||||
| 批处理 | `fastdeploy:available_batch_size` | Gauge | Decode 阶段还可以插入的请求数量 | 个 |
|
||||
| 批处理 | `fastdeploy:batch_size` | Gauge | 推理时的真实批处理大小 | 个 |
|
||||
| 批处理 | `fastdeploy:max_batch_size` | Gauge | 服务启动时确定的最大批处理大小 | 个 |
|
||||
| KV缓存 | `fastdeploy:cache_config_info` | Gauge | 推理引擎的缓存配置信息 | 个 |
|
||||
| KV缓存 | `fastdeploy:hit_req_rate` | Gauge | 请求级别前缀缓存命中率 | 百分比 |
|
||||
| KV缓存 | `fastdeploy:hit_token_rate` | Gauge | token 级别前缀缓存命中率 | 百分比 |
|
||||
| KV缓存 | `fastdeploy:cpu_hit_token_rate` | Gauge | token 级别 CPU 前缀缓存命中率 | 百分比 |
|
||||
| KV缓存 | `fastdeploy:gpu_hit_token_rate` | Gauge | token 级别 GPU 前缀缓存命中率 | 百分比 |
|
||||
| KV缓存 | `fastdeploy:prefix_cache_token_num` | Counter | 前缀缓存token总数 | 个 |
|
||||
| KV缓存 | `fastdeploy:prefix_gpu_cache_token_num` | Counter | 位于 GPU 上的前缀缓存 token 总数 | 个 |
|
||||
| KV缓存 | `fastdeploy:prefix_cpu_cache_token_num` | Counter | 位于 GPU 上的前缀缓存 token 总数 | 个 |
|
||||
| KV缓存 | `fastdeploy:available_gpu_block_num` | Gauge | 缓存中可用的 GPU 块数量(包含尚未正式释放的前缀缓存块)| 个 |
|
||||
| KV缓存 | `fastdeploy:free_gpu_block_num` | Gauge | 缓存中的可用块数 | 个 |
|
||||
| KV缓存 | `fastdeploy:max_gpu_block_num` | Gauge | 服务启动时确定的总块数 | 个 |
|
||||
| KV缓存 | `fastdeploy:available_gpu_resource` | Gauge | 可用块占比,即可用 GPU 块数量 / 最大GPU块数量| 百分比 |
|
||||
| KV缓存 | `fastdeploy:gpu_cache_usage_perc` | Gauge | GPU 上的 KV 缓存使用率 | 百分比 |
|
||||
| KV缓存 | `fastdeploy:send_cache_failed_num` | Counter | 发送缓存失败的总次数 | 个 |
|
||||
|
||||
## 指标访问
|
||||
|
||||
- 访问地址:`http://localhost:8000/metrics`
|
||||
|
||||
@@ -1820,6 +1820,7 @@ class PrefixCacheManager:
|
||||
# reset metrics
|
||||
self.metrics.reset_metrics()
|
||||
main_process_metrics.free_gpu_block_num.set(len(self.gpu_free_block_list))
|
||||
main_process_metrics.available_gpu_block_num.set(len(self.gpu_free_block_list))
|
||||
main_process_metrics.available_gpu_resource.set(self.available_gpu_resource)
|
||||
|
||||
def clear_prefix_cache(self):
|
||||
|
||||
@@ -702,6 +702,7 @@ class EngineService:
|
||||
batch=num_prefill_batch,
|
||||
)
|
||||
for task in tasks:
|
||||
task.schedule_start_time = time.time()
|
||||
trace_print(LoggingEventName.REQUEST_QUEUE_END, task.request_id, getattr(task, "user", ""))
|
||||
|
||||
if self.cfg.scheduler_config.splitwise_role == "decode":
|
||||
@@ -814,6 +815,7 @@ class EngineService:
|
||||
break
|
||||
else:
|
||||
raise
|
||||
|
||||
# 2. Schedule requests
|
||||
tasks, error_tasks = self.resource_manager.schedule()
|
||||
|
||||
|
||||
@@ -75,8 +75,9 @@ class Request:
|
||||
pooling_params: Optional[PoolingParams] = None,
|
||||
preprocess_start_time: Optional[float] = None,
|
||||
preprocess_end_time: Optional[float] = None,
|
||||
inference_start_time: float = 0,
|
||||
llm_engine_recv_req_timestamp: float = 0,
|
||||
schedule_start_time: Optional[float] = None,
|
||||
inference_start_time: Optional[float] = None,
|
||||
llm_engine_recv_req_timestamp: Optional[float] = None,
|
||||
multimodal_inputs: Optional[dict] = None,
|
||||
multimodal_data: Optional[dict] = None,
|
||||
disable_chat_template: bool = False,
|
||||
@@ -122,10 +123,9 @@ class Request:
|
||||
self.arrival_time = arrival_time
|
||||
self.preprocess_start_time = preprocess_start_time
|
||||
self.preprocess_end_time = preprocess_end_time
|
||||
self.schedule_start_time = schedule_start_time
|
||||
self.inference_start_time = inference_start_time
|
||||
self.llm_engine_recv_req_timestamp = (
|
||||
llm_engine_recv_req_timestamp if llm_engine_recv_req_timestamp else time.time()
|
||||
)
|
||||
self.llm_engine_recv_req_timestamp = llm_engine_recv_req_timestamp or time.time()
|
||||
self.disable_chat_template = disable_chat_template
|
||||
self.disaggregate_info = disaggregate_info
|
||||
|
||||
|
||||
@@ -691,7 +691,6 @@ class ResourceManagerV1(ResourceManager):
|
||||
self.running.append(request)
|
||||
scheduled_reqs.append(self._prepare_prefill_task(request, num_new_tokens))
|
||||
request.inference_start_time = time.time()
|
||||
request.schedule_start_time = time.time()
|
||||
token_budget -= num_new_tokens
|
||||
request.num_computed_tokens += num_new_tokens
|
||||
if self.config.cache_config.enable_prefix_caching:
|
||||
@@ -926,7 +925,6 @@ class ResourceManagerV1(ResourceManager):
|
||||
with self.lock:
|
||||
for request in requests:
|
||||
request.inference_start_time = time.time()
|
||||
request.schedule_start_time = time.time()
|
||||
self.running.append(request)
|
||||
|
||||
def preallocate_resource_in_p(self, request: Request):
|
||||
|
||||
@@ -164,8 +164,6 @@ class MetricsManager:
|
||||
available_gpu_resource: "Gauge"
|
||||
requests_number: "Counter"
|
||||
send_cache_failed_num: "Counter"
|
||||
first_token_latency: "Gauge"
|
||||
infer_latency: "Gauge"
|
||||
cache_config_info: "Gauge"
|
||||
available_batch_size: "Gauge"
|
||||
hit_req_rate: "Gauge"
|
||||
@@ -349,18 +347,6 @@ class MetricsManager:
|
||||
"description": "Total number of failures of sending cache",
|
||||
"kwargs": {},
|
||||
},
|
||||
"first_token_latency": {
|
||||
"type": Gauge,
|
||||
"name": "fastdeploy:first_token_latency",
|
||||
"description": "Latest time to first token in seconds",
|
||||
"kwargs": {},
|
||||
},
|
||||
"infer_latency": {
|
||||
"type": Gauge,
|
||||
"name": "fastdeploy:infer_latency",
|
||||
"description": "Latest time to generate one token in seconds",
|
||||
"kwargs": {},
|
||||
},
|
||||
"available_batch_size": {
|
||||
"type": Gauge,
|
||||
"name": "fastdeploy:available_batch_size",
|
||||
|
||||
@@ -15,7 +15,6 @@
|
||||
"""
|
||||
|
||||
import copy
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
import traceback
|
||||
@@ -37,7 +36,7 @@ from fastdeploy.engine.request import (
|
||||
RequestMetrics,
|
||||
RequestOutput,
|
||||
)
|
||||
from fastdeploy.inter_communicator import IPCSignal, ZmqIpcServer
|
||||
from fastdeploy.inter_communicator import ZmqIpcServer
|
||||
from fastdeploy.metrics.metrics import main_process_metrics
|
||||
from fastdeploy.platforms import current_platform
|
||||
from fastdeploy.trace.constants import LoggingEventName
|
||||
@@ -114,14 +113,6 @@ class TokenProcessor:
|
||||
self.num_accept_requests_per_head = [
|
||||
0,
|
||||
] * MAX_DRAFT_TOKENS
|
||||
prefill_time_data = np.zeros([100], dtype=np.float32)
|
||||
self.prefill_time_signal = IPCSignal(
|
||||
name="prefill_time_signal",
|
||||
array=prefill_time_data,
|
||||
dtype=np.float32,
|
||||
suffix=os.getpid(),
|
||||
create=True,
|
||||
)
|
||||
self.executor = ThreadPoolExecutor(max_workers=1)
|
||||
self.prefill_result_status = dict()
|
||||
self._finalizer = weakref.finalize(self, self._cleanup_resources)
|
||||
@@ -129,9 +120,6 @@ class TokenProcessor:
|
||||
|
||||
def _cleanup_resources(self):
|
||||
"""Cleaning up shared memory resources"""
|
||||
if hasattr(self, "prefill_time_signal"):
|
||||
self.prefill_time_signal.clear()
|
||||
|
||||
if hasattr(self, "executor"):
|
||||
self.executor.shutdown(wait=False)
|
||||
|
||||
@@ -255,7 +243,7 @@ class TokenProcessor:
|
||||
arrival_time=task.arrival_time,
|
||||
inference_start_time=task.inference_start_time,
|
||||
first_token_time=time.time() - task.inference_start_time,
|
||||
time_in_queue=task.schedule_start_time - task.preprocess_end_time,
|
||||
time_in_queue=task.inference_start_time - task.preprocess_end_time,
|
||||
preprocess_cost_time=task.preprocess_end_time - task.preprocess_start_time,
|
||||
request_start_time=task.arrival_time,
|
||||
)
|
||||
@@ -416,28 +404,10 @@ class TokenProcessor:
|
||||
if self.output_tokens[0, 0] == -2:
|
||||
continue
|
||||
llm_logger.debug(f"rank_id {rank_id} self.output_tokens[0, 0] {self.output_tokens[0, 0]}")
|
||||
self._process_prefill_metrics()
|
||||
self._process_batch_output()
|
||||
except Exception as e:
|
||||
llm_logger.info(f"while get input_data error: {e} {traceback.format_exc()!s}")
|
||||
|
||||
def _process_prefill_metrics(self):
|
||||
"""Asynchronous processing prefill time indicators"""
|
||||
|
||||
def process_metrics():
|
||||
try:
|
||||
current_index = 0
|
||||
while current_index < len(self.prefill_time_signal.value):
|
||||
prefill_time = self.prefill_time_signal.value[current_index]
|
||||
if prefill_time > 0:
|
||||
main_process_metrics.request_prefill_time.observe(prefill_time)
|
||||
self.prefill_time_signal.value[current_index] = 0
|
||||
current_index += 1
|
||||
except Exception as e:
|
||||
llm_logger.error(f"Error processing prefill metrics: {e}, {str(traceback.format_exc())}")
|
||||
|
||||
self.executor.submit(process_metrics)
|
||||
|
||||
def postprocess(self, batch_result: List[RequestOutput], mtype=3):
|
||||
"""
|
||||
single post-processing function
|
||||
@@ -654,7 +624,7 @@ class TokenProcessor:
|
||||
inference_start_time=task.inference_start_time,
|
||||
model_execute_time=time.time() - task.inference_start_time,
|
||||
first_token_time=time.time() - task.inference_start_time,
|
||||
time_in_queue=task.schedule_start_time - task.preprocess_end_time,
|
||||
time_in_queue=task.inference_start_time - task.preprocess_end_time,
|
||||
preprocess_cost_time=task.preprocess_end_time - task.preprocess_start_time,
|
||||
request_start_time=task.arrival_time,
|
||||
llm_engine_recv_req_timestamp=task.llm_engine_recv_req_timestamp,
|
||||
@@ -784,9 +754,9 @@ class TokenProcessor:
|
||||
task.first_token_time = current_time
|
||||
trace_print(LoggingEventName.FIRST_TOKEN_GENERATED, task.request_id, getattr(task, "user", ""))
|
||||
trace_print(LoggingEventName.DECODE_START, task.request_id, getattr(task, "user", ""))
|
||||
main_process_metrics.first_token_latency.set(current_time - task.inference_start_time)
|
||||
main_process_metrics.time_to_first_token.observe(current_time - task.inference_start_time)
|
||||
main_process_metrics.request_queue_time.observe(task.schedule_start_time - task.preprocess_end_time)
|
||||
main_process_metrics.time_to_first_token.observe(current_time - task.arrival_time)
|
||||
main_process_metrics.request_queue_time.observe(task.inference_start_time - task.preprocess_end_time)
|
||||
main_process_metrics.request_prefill_time.observe(current_time - task.inference_start_time)
|
||||
|
||||
def _record_completion_metrics(self, task, current_time):
|
||||
"""Record metrics when request completes"""
|
||||
@@ -797,7 +767,6 @@ class TokenProcessor:
|
||||
trace_print(LoggingEventName.POSTPROCESSING_START, task.request_id, getattr(task, "user", ""))
|
||||
main_process_metrics.num_requests_running.dec(1)
|
||||
main_process_metrics.request_success_total.inc()
|
||||
main_process_metrics.infer_latency.set(current_time - task.inference_start_time)
|
||||
main_process_metrics.request_inference_time.observe(current_time - task.inference_start_time)
|
||||
main_process_metrics.request_generation_tokens.observe(self.tokens_counter[task.request_id])
|
||||
|
||||
|
||||
@@ -59,13 +59,12 @@ class TestCoverageFix(unittest.TestCase):
|
||||
self.mock_split_connector = MagicMock()
|
||||
self.mock_resource_manager = MagicMock()
|
||||
|
||||
with patch("fastdeploy.output.token_processor.IPCSignal"):
|
||||
self.processor = TokenProcessor(
|
||||
cfg=self.mock_cfg,
|
||||
cached_generated_tokens=self.mock_cached_generated_tokens,
|
||||
engine_worker_queue=self.mock_engine_worker_queue,
|
||||
split_connector=self.mock_split_connector,
|
||||
)
|
||||
self.processor = TokenProcessor(
|
||||
cfg=self.mock_cfg,
|
||||
cached_generated_tokens=self.mock_cached_generated_tokens,
|
||||
engine_worker_queue=self.mock_engine_worker_queue,
|
||||
split_connector=self.mock_split_connector,
|
||||
)
|
||||
self.processor.resource_manager = self.mock_resource_manager
|
||||
|
||||
# 使用 patch 来模拟 token_processor 模块中引用的 main_process_metrics
|
||||
|
||||
Reference in New Issue
Block a user