[Executor] Default use CUDAGraph (#3594)

* add start intercept * Adjustment GraphOptConfig * pre-commit * default use cudagraph * set default value * default use cuda graph * pre-commit * fix test case bug * disable rl * fix moba attention * only support gpu * Temporarily disable PD Disaggregation * set max_num_seqs of test case as 1 * set max_num_seqs and temperature * fix max_num_batched_tokens bug * close cuda graph * success run wint2 * profile run with max_num_batched_tokens * 1.add c++ memchecker 2.success run wint2 * updatee a800 yaml * update docs * 1. delete check 2. fix plas attn test case * default use use_unique_memory_pool * add try-except for warmup * ban mtp, mm, rl * fix test case mock * fix ci bug * fix form_model_get_output_topp0 bug * fix ci bug * refine deepseek ci * refine code * Disable PD * fix sot yaml
2025-12-24 13:28:13 +08:00 · 2025-10-21 14:25:45 +08:00
parent 99564349a7
commit 775edcc09a
32 changed files with 417 additions and 144 deletions
--- a/benchmarks/yaml/eb45-32k-wint4-a800-tp4.yaml
+++ b/benchmarks/yaml/eb45-32k-wint4-a800-tp4.yaml
@@ -1,6 +1,6 @@
 max_model_len: 32768
 max_num_seqs: 96
-gpu_memory_utilization: 0.9
+gpu_memory_utilization: 0.85
 kv_cache_ratio: 0.71
 tensor_parallel_size: 4
 quantization: wint4
--- a/benchmarks/yaml/eb45-32k-wint8-a800-tp8.yaml
+++ b/benchmarks/yaml/eb45-32k-wint8-a800-tp8.yaml
@@ -1,6 +1,6 @@
 max_model_len: 32768
 max_num_seqs: 96
-gpu_memory_utilization: 0.9
+gpu_memory_utilization: 0.85
 kv_cache_ratio: 0.71
 tensor_parallel_size: 8
 quantization: wint8
--- a/benchmarks/yaml/x1-64k-w4a8c8-tp4.yaml
+++ b/benchmarks/yaml/x1-64k-w4a8c8-tp4.yaml
@@ -6,5 +6,5 @@ max_num_seqs: 128
 enable_prefix_caching: True
 enable_chunked_prefill: True
 gpu_memory_utilization: 0.85
-use_cudagraph: True
-enable_custom_all_reduce: True
+graph_optimization_config:
+  use_cudagraph: True