[Doc] Update cpp benchmark docs for CPU/GPU (#1377)

* [Benchmark] Init benchmark precision api * [Benchmark] Init benchmark precision api * [Benchmark] Add benchmark precision api * [Benchmark] Calculate the statis of diff * [Benchmark] Calculate the statis of diff * [Benchmark] Calculate the statis of diff * [Benchmark] Calculate the statis of diff * [Benchmark] Calculate the statis of diff * [Benchmark] Add SplitDataLine utils * [Benchmark] Add LexSortByXY func * [Benchmark] Add LexSortByXY func * [Benchmark] Add LexSortDetectionResultByXY func * [Benchmark] Add LexSortDetectionResultByXY func * [Benchmark] Add tensor diff presicion test * [Benchmark] fixed conflicts * [Benchmark] fixed calc tensor diff * fixed build bugs * fixed ci bugs when WITH_TESTING=ON * [Docs] init cpp benchmark docs * [Doc] update cpp benchmark docs * [Doc] update cpp benchmark docs * [Doc] update cpp benchmark docs * [Doc] update cpp benchmark docs
2025-10-26 18:10:32 +08:00 · 2023-02-21 15:41:37 +08:00
parent 5f8157c398
commit 42817ddc18
4 changed files with 141 additions and 2 deletions
--- a/benchmark/cpp/README.md
+++ b/benchmark/cpp/README.md
@@ -0,0 +1,137 @@
+# FastDeploy C++ Benchmarks
+
+## 1. 编译选项  
+以下选项为benchmark相关的编译选项，在编译用来跑benchmark的sdk时，必须开启。  
+
+|选项|需要设置的值|说明|
+|---|---|---|  
+| ENABLE_BENCHMARK  | ON | 默认OFF, 是否打开BENCHMARK模式 |
+| ENABLE_VISION     | ON | 默认OFF，是否编译集成视觉模型的部署模块 |
+| ENABLE_TEXT       | ON | 默认OFF，是否编译集成文本NLP模型的部署模块 |  
+
+运行FastDeploy C++ Benchmark，需先准备好相应的环境，并在ENABLE_BENCHMARK=ON模式下从源码编译FastDeploy C++ SDK. 以下将按照硬件维度，来说明相应的系统环境要求。不同环境下的详细要求，请参考[FastDeploy环境要求](../../docs/cn/build_and_install)  
+
+## 2. Benchmark 参数设置说明  
+
+<div id="参数设置说明"></div>  
+
+
+| 参数                 | 作用                                        |
+| -------------------- | ------------------------------------------ |
+| --model              | 模型路径                                     |
+| --image              | 图片路径    |
+| --device             | 选择 CPU/GPU/XPU，默认为 CPU  |
+| --cpu_thread_nums     | CPU 线程数，默认为 8      |
+| --device_id          | GPU/XPU 卡号，默认为 0 |
+| --warmup           | 跑benchmark的warmup次数，默认为 200 |
+| --repeat           | 跑benchmark的循环次数，默认为 1000 |  
+| --profile_mode      | 指定需要测试性能的模式，可选值为`[runtime, end2end]`，默认为 runtime |  
+| --include_h2d_d2h   | 是否把H2D+D2H的耗时统计在内，该参数只在profile_mode为runtime时有效，默认为 false |  
+| --backend            | 指定后端类型，有default, ort, ov, trt, paddle, paddle_trt, lite 等，为default时，会自动选择最优后端，推荐设置为显式设置明确的backend。默认为 default   |
+| --use_fp16    | 是否开启fp16，当前只对 trt, paddle-trt, lite后端有效，默认为 false |
+| --collect_memory_info    | 是否记录 cpu/gpu memory信息，默认 false  |
+| --sampling_interval    | 记录 cpu/gpu memory信息采样时间间隔，单位ms，默认为 50  |  
+
+## 3. X86_64 CPU 和 NVIDIA GPU 环境下运行 Benchmark
+
+### 3.1 环境准备  
+
+Linux上编译需满足:
+  - gcc/g++ >= 5.4(推荐8.2)
+  - cmake >= 3.18.0
+  - CUDA >= 11.2
+  - cuDNN >= 8.2
+  - TensorRT >= 8.5
+
+在GPU上编译FastDeploy需要准备好相应的CUDA环境以及TensorRT，详细文档请参考[GPU编译文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/gpu.md)。  
+
+### 3.2 编译FastDeploy C++ SDK  
+```bash
+# 源码编译SDK
+git clone https://github.com/PaddlePaddle/FastDeploy.git -b develop
+cd FastDeploy
+mkdir build && cd build
+cmake .. -DWITH_GPU=ON \
+         -DENABLE_ORT_BACKEND=ON \
+         -DENABLE_PADDLE_BACKEND=ON \
+         -DENABLE_OPENVINO_BACKEND=ON \
+         -DENABLE_TRT_BACKEND=ON \
+         -DENABLE_VISION=ON \
+         -DENABLE_TEXT=ON \
+         -DENABLE_BENCHMARK=ON \  # 开启benchmark模式
+         -DTRT_DIRECTORY=/Paddle/TensorRT-8.5.2.2 \
+         -DCUDA_DIRECTORY=/usr/local/cuda \
+         -DCMAKE_INSTALL_PREFIX=${PWD}/compiled_fastdeploy_sdk
+
+make -j12
+make install  
+
+# 配置SDK路径
+cd ..  
+export FD_GPU_SDK=${PWD}/build/compiled_fastdeploy_sdk
+```  
+### 3.3 编译 Benchmark 示例  
+```bash  
+cd benchmark/cpp
+mkdir build && cd build  
+cmake .. -DFASTDEPLOY_INSTALL_DIR=${FD_GPU_SDK}  
+make -j4
+```
+
+### 3.4 运行 Benchmark 示例  
+
+在X86 CPU + NVIDIA GPU下，FastDeploy 目前支持多种推理后端，下面以 PaddleYOLOv8 为例，跑出多后端在 CPU/GPU 对应 benchmark 数据。
+
+- 下载模型文件和测试图片  
+```bash  
+wget https://bj.bcebos.com/paddlehub/fastdeploy/yolov8_s_500e_coco.tgz  
+wget https://gitee.com/paddlepaddle/PaddleDetection/raw/release/2.4/demo/000000014439.jpg
+tar -zxvf yolov8_s_500e_coco.tgz
+```
+
+- 运行 yolov8 benchmark 示例  
+
+```bash  
+
+# 统计性能  
+# CPU
+# Paddle Inference
+./benchmark_ppyolov8 --model yolov8_s_500e_coco --image 000000014439.jpg --device cpu --cpu_thread_nums 8 --backend paddle --profile_mode runtime
+
+# ONNX Runtime
+./benchmark_ppyolov8 --model yolov8_s_500e_coco --image 000000014439.jpg --device cpu --cpu_thread_nums 8 --backend ort --profile_mode runtime
+
+# OpenVINO
+./benchmark_ppyolov8 --model yolov8_s_500e_coco --image 000000014439.jpg --device cpu --cpu_thread_nums 8 --backend ov --profile_mode runtime
+
+# GPU
+# Paddle Inference
+./benchmark_ppyolov8 --model yolov8_s_500e_coco --image 000000014439.jpg --device gpu --device_id 0 --backend paddle --profile_mode runtime --warmup 200 --repeat 2000
+
+# Paddle Inference + TensorRT
+./benchmark_ppyolov8 --model yolov8_s_500e_coco --image 000000014439.jpg --device gpu --device_id 0 --backend paddle_trt --profile_mode runtime --warmup 200 --repeat 2000
+
+# Paddle Inference + TensorRT + FP16
+./benchmark_ppyolov8 --model yolov8_s_500e_coco --image 000000014439.jpg --device gpu --device_id 0 --backend paddle --profile_mode runtime --warmup 200 --repeat 2000 --use_fp16
+
+# ONNX Runtime
+./benchmark_ppyolov8 --model yolov8_s_500e_coco --image 000000014439.jpg --device gpu --device_id 0 --backend ort --profile_mode runtime --warmup 200 --repeat 2000
+
+# TensorRT
+./benchmark_ppyolov8 --model yolov8_s_500e_coco --image 000000014439.jpg --device gpu --device_id 0 --backend paddle --profile_mode runtime --warmup 200 --repeat 2000
+
+# TensorRT + FP16
+./benchmark_ppyolov8 --model yolov8_s_500e_coco --image 000000014439.jpg --device gpu --device_id 0 --backend trt --profile_mode runtime --warmup 200 --repeat 2000 --use_fp16
+
+# 统计内存显存占用  
+# 增加--collect_memory_info选项
+./benchmark_ppyolov8 --model yolov8_s_500e_coco --image 000000014439.jpg --device cpu --cpu_thread_nums 8 --backend paddle --profile_mode runtime --collect_memory_info
+```
+注意，为避免对性能统计产生影响，测试性能时，最好不要开启内存显存统计的功能，当指定--collect_memory_info参数时，只有内存显存参数是稳定可靠的。更多参数设置，请参考[参数设置说明](#参数设置说明)
+
+
+## 4. ARM CPU 环境下运行 Benchmark
+- TODO
+
+## 5. 昆仑芯 XPU 环境下运行 Benchmark
+- TODO
--- a/benchmark/cpp/flags.h
+++ b/benchmark/cpp/flags.h
@@ -63,6 +63,7 @@ static void PrintUsage() {
 }

 static void PrintBenchmarkInfo() {
+#if defined(ENABLE_BENCHMARK) && defined(ENABLE_VISION)
  // Get model name
  std::vector<std::string> model_names;
  fastdeploy::benchmark::Split(FLAGS_model, model_names, sep);
@@ -97,5 +98,6 @@ static void PrintBenchmarkInfo() {
       << "ms" << std::endl;
  }
  std::cout << ss.str() << std::endl;
+#endif
  return;
 }
--- a/benchmark/cpp/run_benchmark_ppyolov8.sh
+++ b/benchmark/cpp/run_benchmark_ppyolov8.sh
--- a/benchmark/python/README.md
+++ b/benchmark/python/README.md
@@ -2,8 +2,8 @@

 在跑benchmark前，需确认以下两个步骤

-* 1. 软硬件环境满足要求，参考[FastDeploy环境要求](..//docs/cn/build_and_install/download_prebuilt_libraries.md)
-* 2. FastDeploy Python whl包安装，参考[FastDeploy Python安装](../docs/cn/build_and_install/download_prebuilt_libraries.md)
+* 1. 软硬件环境满足要求，参考[FastDeploy环境要求](../../docs/cn/build_and_install/download_prebuilt_libraries.md)
+* 2. FastDeploy Python whl包安装，参考[FastDeploy Python安装](../../docs/cn/build_and_install/download_prebuilt_libraries.md)

 FastDeploy 目前支持多种推理后端，下面以 PaddleClas MobileNetV1 为例，跑出多后端在 CPU/GPU 对应 benchmark 数据