mirror of
				https://github.com/PaddlePaddle/FastDeploy.git
				synced 2025-10-31 03:46:40 +08:00 
			
		
		
		
	Compare commits
	
		
			1 Commits
		
	
	
		
			feature/on
			...
			v2.0.0
		
	
	| Author | SHA1 | Date | |
|---|---|---|---|
|   | aa9d17b5ae | 
| @@ -27,7 +27,7 @@ When using FastDeploy to deploy models (including offline inference and service | |||||||
| | ```kv_cache_ratio``` | `float` | KVCache blocks are divided between Prefill phase and Decode phase according to kv_cache_ratio ratio, default: 0.75 | | | ```kv_cache_ratio``` | `float` | KVCache blocks are divided between Prefill phase and Decode phase according to kv_cache_ratio ratio, default: 0.75 | | ||||||
| | ```enable_prefix_caching``` | `bool` | Whether to enable Prefix Caching, default: False | | | ```enable_prefix_caching``` | `bool` | Whether to enable Prefix Caching, default: False | | ||||||
| | ```swap_space``` | `float` | When Prefix Caching is enabled, CPU memory size for KVCache swapping, unit: GB, default: None | | | ```swap_space``` | `float` | When Prefix Caching is enabled, CPU memory size for KVCache swapping, unit: GB, default: None | | ||||||
| | ```enable_chunk_prefill``` | `bool` | Enable Chunked Prefill, default: False | | | ```enable_chunked_prefill``` | `bool` | Enable Chunked Prefill, default: False | | ||||||
| | ```max_num_partial_prefills``` | `int` | When Chunked Prefill is enabled, maximum concurrent number of partial prefill batches, default: 1 | | | ```max_num_partial_prefills``` | `int` | When Chunked Prefill is enabled, maximum concurrent number of partial prefill batches, default: 1 | | ||||||
| | ```max_long_partial_prefills``` | `int` | When Chunked Prefill is enabled, maximum number of long requests in concurrent partial prefill batches, default: 1 | | | ```max_long_partial_prefills``` | `int` | When Chunked Prefill is enabled, maximum number of long requests in concurrent partial prefill batches, default: 1 | | ||||||
| | ```long_prefill_token_threshold``` | `int` | When Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04 | | | ```long_prefill_token_threshold``` | `int` | When Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04 | | ||||||
| @@ -115,5 +115,5 @@ FastDeploy initialization sequence first uses `gpu_memory_utilization` parameter | |||||||
|       ... |       ... | ||||||
|   ``` |   ``` | ||||||
| - When ```use_cudagraph``` is enabled, currently only supports single-GPU inference, i.e. ```tensor_parallel_size``` set to 1. | - When ```use_cudagraph``` is enabled, currently only supports single-GPU inference, i.e. ```tensor_parallel_size``` set to 1. | ||||||
| - When ```use_cudagraph``` is enabled, cannot enable ```enable_prefix_caching``` or ```enable_chunk_prefill```. | - When ```use_cudagraph``` is enabled, cannot enable ```enable_prefix_caching``` or ```enable_chunked_prefill```. | ||||||
| - When ```use_cudagraph``` is enabled, batches with size ≤ ```max_capture_batch_size``` will be executed by CudaGraph, batches > ```max_capture_batch_size``` will be executed by original dynamic/static graph. To have all batch sizes executed by CudaGraph, ```max_capture_batch_size``` value should match ```max_num_seqs```. ```max_capture_batch_size``` > ```max_num_seqs``` will cause waste by capturing batches that won't be encountered during inference, occupying more time and memory. | - When ```use_cudagraph``` is enabled, batches with size ≤ ```max_capture_batch_size``` will be executed by CudaGraph, batches > ```max_capture_batch_size``` will be executed by original dynamic/static graph. To have all batch sizes executed by CudaGraph, ```max_capture_batch_size``` value should match ```max_num_seqs```. ```max_capture_batch_size``` > ```max_num_seqs``` will cause waste by capturing batches that won't be encountered during inference, occupying more time and memory. | ||||||
| @@ -26,7 +26,7 @@ | |||||||
| | ```kv_cache_ratio```               | `float`     | KVCache块按kv_cache_ratio比例分给Prefill阶段和Decode阶段, 默认0.75 | | | ```kv_cache_ratio```               | `float`     | KVCache块按kv_cache_ratio比例分给Prefill阶段和Decode阶段, 默认0.75 | | ||||||
| | ```enable_prefix_caching```        | `bool`      | 是否开启Prefix Caching,默认False | | | ```enable_prefix_caching```        | `bool`      | 是否开启Prefix Caching,默认False | | ||||||
| | ```swap_space```                   | `float`     | 开启Prefix Caching时,用于swap KVCache的CPU内存大小,单位GB,默认None | | | ```swap_space```                   | `float`     | 开启Prefix Caching时,用于swap KVCache的CPU内存大小,单位GB,默认None | | ||||||
| | ```enable_chunk_prefill```         | `bool`      | 开启Chunked Prefill,默认False | | | ```enable_chunked_prefill```         | `bool`      | 开启Chunked Prefill,默认False | | ||||||
| | ```max_num_partial_prefills```     | `int`       | 开启Chunked Prefill时,Prefill阶段的最大并发数,默认1 | | | ```max_num_partial_prefills```     | `int`       | 开启Chunked Prefill时,Prefill阶段的最大并发数,默认1 | | ||||||
| | ```max_long_partial_prefills```    | `int`       | 开启Chunked Prefill时,Prefill阶段并发中包启的最多长请求数,默认1 | | | ```max_long_partial_prefills```    | `int`       | 开启Chunked Prefill时,Prefill阶段并发中包启的最多长请求数,默认1 | | ||||||
| | ```long_prefill_token_threshold``` | `int`       | 开启Chunked Prefill时,请求Token数超过此值的请求被视为长请求,默认为max_model_len*0.04 | | | ```long_prefill_token_threshold``` | `int`       | 开启Chunked Prefill时,请求Token数超过此值的请求被视为长请求,默认为max_model_len*0.04 | | ||||||
| @@ -113,5 +113,5 @@ FastDeploy 的初始化顺序为先使用 `gpu_memory_utilization` 参数计算 | |||||||
|       ... |       ... | ||||||
|   ``` |   ``` | ||||||
| - 当开启 ```use_cudagraph``` 时,暂时只支持单卡推理,即 ```tensor_parallel_size``` 设为1。 | - 当开启 ```use_cudagraph``` 时,暂时只支持单卡推理,即 ```tensor_parallel_size``` 设为1。 | ||||||
| - 当开启 ```use_cudagraph``` 时,暂不支持开启 ```enable_prefix_caching``` 或 ```enable_chunk_prefill``` 。 | - 当开启 ```use_cudagraph``` 时,暂不支持开启 ```enable_prefix_caching``` 或 ```enable_chunked_prefill``` 。 | ||||||
| - 当开启 ```use_cudagraph``` 后,size小于等于 ```max_capture_batch_size``` 的batch会由CudaGraph来执行前向计算,大于 ```max_capture_batch_size``` 的batch会由原本的动态图/静态图执行前向计算。如果希望所有batch size均由CudaGraph来执行,```max_capture_batch_size``` 的值建议与 ```max_num_seqs``` 一致。```max_capture_batch_size``` 大于 ```max_num_seqs``` 会导致浪费,会多捕获一些推理时不会遇到的batch,占用更多时间与显存。 | - 当开启 ```use_cudagraph``` 后,size小于等于 ```max_capture_batch_size``` 的batch会由CudaGraph来执行前向计算,大于 ```max_capture_batch_size``` 的batch会由原本的动态图/静态图执行前向计算。如果希望所有batch size均由CudaGraph来执行,```max_capture_batch_size``` 的值建议与 ```max_num_seqs``` 一致。```max_capture_batch_size``` 大于 ```max_num_seqs``` 会导致浪费,会多捕获一些推理时不会遇到的batch,占用更多时间与显存。 | ||||||
|   | |||||||
		Reference in New Issue
	
	Block a user