mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-12-24 13:28:13 +08:00
* update readme * delete unnecessary line Co-authored-by: tianjinjin <tianjinjin@baidu.com>
204 lines
7.5 KiB
Markdown
204 lines
7.5 KiB
Markdown
# Poros AI Inference Accelerator
|
|
|
|
## Description
|
|
|
|
Poros is an AI Inference Accelerator for deep learning framework. It can provide significantly lower inference latency comparing with original model, and provide much flexibility for dynamic graphs.
|
|
Poros mainly works on the TorchScript IR currently, that means it supports the models from PyTorch, ONNX, TensorFlow and any other framework that can be converted to TorchScript. also, we are planting to support more IRs in the future.
|
|
Poros is designed to supports multiple hardware backends conveniently, For now, Poros supports GPU and XPU (BAIDU-Kunlun) Device, It's welcomed to add additional devices.
|
|
|
|
## How It Works
|
|
|
|
Figure 1 is the architecture of Poros. The central part marked by the red dotted line is Model Optimizer, the main module of Poros. IR graphs are optimized by IR lowering, op fusing, op converting and auto-tuning, and then segmented into engine related subgraph by maximize the op nums of each engine kernel and minimize the total count of engine kernels.
|
|
|
|

|
|
|
|
In order to achieve the above goals on GPU, we've rewritten hundreds of TorchScript OPs, which reduced extra subgraphs caused by unsupported op during subgraph partitioning. Dozens of lowering strategy including op fusions were employed to reduce the actual calculating load of CUDA Kernels.
|
|
|
|
## Dependencies
|
|
|
|
Poros is developed based on PyTorch, CUDA, TensorRT (TRT Engine), CuDNN. The minimum_required (recommended) versions of
|
|
these packages are listed as below:
|
|
|
|
| Package | Minimum Version | Recommended Version |
|
|
|----------|-----------------|---------------------|
|
|
| PyTorch | 1.9.0 | 1.12.1 |
|
|
| CUDA | 10.2 | 11.3 |
|
|
| TensorRT | 8.2 | 8.4 |
|
|
| CuDNN | 7.6.5 | 8.4 |
|
|
| Python | 3.6.5 | 3.8 |
|
|
|
|
If you want to build for GPU Inference, it's better to align the CUDA version with the version that PyTorch built on.
|
|
For example, we recommend you to use CUDA 11.1+ if the installed PyTorch version is 1.11.0+cu111, or some "undefined
|
|
reference CUDA...." errors may appear during building.
|
|
|
|
> There is a known cuBlas related issue of CUDA 10.2. If you are using CUDA 10.2, make sure these two patches have be installed.
|
|
> https://developer.nvidia.com/cuda-10.2-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal
|
|
|
|
## How To Build
|
|
|
|
### 0. Install Dependencies
|
|
|
|
get Poros source code:
|
|
|
|
```shell
|
|
git clone https://github.com/PaddlePaddle/FastDeploy.git
|
|
cd poros
|
|
git submodule update --init --recursive --jobs 0 -f
|
|
```
|
|
|
|
We strongly recommend you to prepare the building environment with anaconda3:
|
|
|
|
```shell
|
|
conda create --name poros python=3.8
|
|
conda activate poros
|
|
export CMAKE_PREFIX_PATH=$CONDA_PREFIX
|
|
conda install cmake==3.22.1 pytorch==1.12.1 cudatoolkit=11.3 numpy -c pytorch
|
|
```
|
|
**If CUDA has been installed as system driver, cudatoolkit is not necessary. And CMake version requires >= 3.21, GCC version requires >= 8.2.**
|
|
|
|
|
|
Poros uses cmake to manage dependencies. It will find all dependency packages automatically as long as the packages were
|
|
installed to the usual location. Otherwise, you should assign the install location of these packages manually.
|
|
|
|
```shell
|
|
export CUDAToolkit_ROOT=/cuda/install/dir/ #point CUDAToolkit_ROOT to the CUDA installation dir
|
|
export TENSORRT_ROOT=/tensorrt/install/dir/ #download from Nvidia and upack, no need to install into system
|
|
export CUDNN_ROOT=/cudnn/install/dir/ #download from Nvidia and upack, no need to install into system
|
|
```
|
|
Add cuda, tensorrt and cudnn into your environment variables.
|
|
|
|
```shell
|
|
export PATH=$CUDAToolkit_ROOT/bin:$PATH
|
|
export LD_LIBRARY_PATH=$CUDAToolkit_ROOT/lib64:$TENSORRT_ROOT/lib:$CUDNN_ROOT/lib:$LD_LIBRARY_PATH
|
|
```
|
|
|
|
Additional dependency `mkl` is needed while building with PyTorch1.11 + CUDA11.1
|
|
It can be added into cmake by installing, if not, you can try to add it by:
|
|
```shell
|
|
conda install mkl
|
|
```
|
|
|
|
Other packages that Poros depend on are: gflags, googletest etc. , they can be downloaded
|
|
by ` git submodule update --init --recursive --jobs 0 -f`
|
|
|
|
### 1. Build Project with CMake
|
|
|
|
```shell
|
|
cd poros
|
|
mkdir build
|
|
cd build
|
|
cmake ..
|
|
make
|
|
```
|
|
|
|
By default, only the shared library (libporos.so) will be built.
|
|
|
|
**To build a static lib (libporos.a):**
|
|
|
|
```shell
|
|
cmake -DBUILD_STATIC=on ..
|
|
make
|
|
```
|
|
|
|
Poros `kernel` contains the framework of Poros, as well as the IR lowering strategy, the sub-graph segmentation strategy
|
|
and the engine manager without any specific engine (e.g. TensorRT). For Developers who want to use their own
|
|
engines, `kernel` can be built separately with options as below:
|
|
|
|
**To build a shared kernel lib (libporos-kernel.so):**
|
|
|
|
```shell
|
|
cmake -DBUILD_KERNEL=on ..
|
|
make
|
|
```
|
|
|
|
**To build a static kernel lib (libporos-kernel.a):**
|
|
|
|
```shell
|
|
cmake -DBUILD_STATIC_KERNEL=on ..
|
|
make
|
|
```
|
|
|
|
### 2. Build Distributing Package with setuptools (Python3)
|
|
|
|
After the libporos.so has been built, you can build the `.whl` package for Python3:
|
|
|
|
```shell
|
|
cd ../python
|
|
python3 setup.py bdist_wheel
|
|
```
|
|
|
|
The output looks like: `poros-0.1.0-cp38-cp38m-linux_x86_64.whl`. It can be installed easily with pip:
|
|
|
|
```shell
|
|
cd dist
|
|
pip3 install poros-0.1.0-cp38-cp38m-linux_x86_64.whl
|
|
```
|
|
or, you can use `python3 setup.py develop` to create symbolic link to `python` dir.
|
|
|
|
### 3. Build Executable Binary
|
|
|
|
We provide an example C++ shell for users who want to build an executable binary. The `main.cpp` file locates
|
|
at `tools/main.cpp`, you modify the code according to your needs. The executable binary `poros-tool` can be built with
|
|
this command:
|
|
|
|
```shell
|
|
mkdir build
|
|
cd build
|
|
cmake -DBUILD_TOOL=on ..
|
|
make
|
|
```
|
|
|
|
### 4. Build Test
|
|
```shell
|
|
cmake -DUT=on ..
|
|
make
|
|
./unit_test # run unit test
|
|
```
|
|
|
|
|
|
## How To Use
|
|
|
|
### 1. Python Usage:
|
|
|
|
```python
|
|
import poros
|
|
import torch
|
|
from torchvision import models
|
|
|
|
original_model = models.resnet50(pretrained=True).cuda().eval() #load/download pre-trained model
|
|
option = poros.PorosOptions() #set poros option
|
|
poros_model = poros.compile(torch.jit.script(original_model), input_datas, option) #build the model
|
|
|
|
input = torch.randn(1,3,224,224, dtype=torch.float32).cuda()
|
|
poros_res = poros_model(input) # use compiled model in the same way as the original model
|
|
|
|
```
|
|
|
|
The complete benchmark example (resnet50) .py script is `python/example/test_resnet.py`
|
|
|
|
```shell
|
|
python3 python/example/test_resnet.py
|
|
```
|
|
|
|
### 2. CPP Usage:
|
|
|
|
If the executable binary `poros-tool` is built, you can run the benchmark like this:
|
|
|
|
```shell
|
|
./poros-tool --module_file_path ../../poros/tools/std_pretrained_resnet50_gpu.pt --test_mode=original #original PyTorch model
|
|
./poros-tool --module_file_path ../../poros/tools/std_pretrained_resnet50_gpu.pt --test_mode=poros #poros compiled model
|
|
```
|
|
> PyTorch has changed the packaging format of model since 1.4+, while the pretrained model of resnet50 is still using the old format (.tar).
|
|
> You may need to convert the format to the newer one (.zip) by your self. Convert command like this:
|
|
> ```python
|
|
> original_model = models.resnet50(pretrained=True).cuda().eval()
|
|
> torch.save(original_model, 'std_pretrained_resnet50_gpu.pt', _use_new_zipfile_serialization=False)
|
|
> ```
|
|
|
|
## Benchmark
|
|
|
|
Take a look at the [Benchmark](docs/Benchmark.md).
|
|
|
|
## Acknowledgement
|
|
Poros has been incubated for more than 2 years. In this project, NVIDIA helped us a lot (especially Gary Ji, Vincent Zhang, Jie Fang). They answered lots of technical questions about GPU and gave us many suggestions. Appreciate their great support.
|