FastDeploy/poros/README.md

# Poros AI Inference Accelerator

## Description

Poros is an AI Inference Accelerator for deep learning framework. It can provide significantly lower inference latency comparing with original model, and provide much flexibility for dynamic graphs.
Poros mainly works on the TorchScript IR currently, that means it supports the models from PyTorch, ONNX, TensorFlow and any other framework that can be converted to TorchScript. also, we are planting to support more IRs in the future.
Poros is designed to supports multiple hardware backends conveniently, For now, Poros supports GPU and XPU (BAIDU-Kunlun) Device, It's welcomed to add additional devices.

## How It Works

Figure 1 is the architecture of Poros. The central part marked by the red dotted line is Model Optimizer, the main module of Poros. IR graphs are optimized by IR lowering, op fusing, op converting and auto-tuning, and then segmented into engine related subgraph by maximize the op nums of each engine kernel and minimize the total count of engine kernels.

![image](https://user-images.githubusercontent.com/54064850/203691621-e75d7c17-320c-4dff-8abe-58c3c9db99a2.png)

In order to achieve the above goals on GPU, we've rewritten hundreds of TorchScript OPs, which reduced extra subgraphs caused by unsupported op during subgraph partitioning. Dozens of lowering strategy including op fusions were employed to reduce the actual calculating load of CUDA Kernels.

## Dependencies

Poros is developed based on PyTorch, CUDA, TensorRT (TRT Engine), CuDNN. The minimum_required (recommended) versions of
these packages are listed as below:

| Package  | Minimum Version | Recommended Version |
|----------|-----------------|---------------------|
| PyTorch  | 1.9.0           | 1.12.1              |
| CUDA     | 10.2            | 11.3                |
| TensorRT | 8.2             | 8.4                 |
| CuDNN    | 7.6.5           | 8.4                 |
| Python   | 3.6.5           | 3.8                 |

If you want to build for GPU Inference, it's better to align the CUDA version with the version that PyTorch built on.
For example, we recommend you to use CUDA 11.1+ if the installed PyTorch version is 1.11.0+cu111, or some "undefined
reference CUDA...." errors may appear during building.

> There is a known cuBlas related issue of CUDA 10.2. If you are using CUDA 10.2, make sure these two patches have be installed.
> https://developer.nvidia.com/cuda-10.2-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal

## How To Build

### 0. Install Dependencies

get Poros source code:

```shell
git clone https://github.com/PaddlePaddle/FastDeploy.git
cd poros
git submodule update --init --recursive --jobs 0 -f
```

We strongly recommend you to prepare the building environment with anaconda3:

```shell
conda create --name poros python=3.8
conda activate poros
export CMAKE_PREFIX_PATH=$CONDA_PREFIX
conda install cmake==3.22.1 pytorch==1.12.1 cudatoolkit=11.3 numpy -c pytorch
```
**If CUDA has been installed as system driver, cudatoolkit is not necessary. And CMake version requires >= 3.21, GCC version requires >= 8.2.**


Poros uses cmake to manage dependencies. It will find all dependency packages automatically as long as the packages were
installed to the usual location. Otherwise, you should assign the install location of these packages manually.

```shell
export CUDAToolkit_ROOT=/cuda/install/dir/  #point CUDAToolkit_ROOT to the CUDA installation dir
export TENSORRT_ROOT=/tensorrt/install/dir/ #download from Nvidia and upack, no need to install into system
export CUDNN_ROOT=/cudnn/install/dir/       #download from Nvidia and upack, no need to install into system
```
Add cuda, tensorrt and cudnn into your environment variables.

```shell
export PATH=$CUDAToolkit_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$CUDAToolkit_ROOT/lib64:$TENSORRT_ROOT/lib:$CUDNN_ROOT/lib:$LD_LIBRARY_PATH
```

Additional dependency `mkl` is needed while building with PyTorch1.11 + CUDA11.1
It can be added into cmake by installing, if not, you can try to add it by:
```shell
conda install mkl
```

Other packages that Poros depend on are: gflags, googletest etc. , they can be downloaded
by ` git submodule update --init --recursive --jobs 0 -f`

### 1. Build Project with CMake

```shell
cd poros
mkdir build
cd build
cmake ..
make
```

By default, only the shared library (libporos.so) will be built.

**To build a static lib (libporos.a):**

```shell
cmake -DBUILD_STATIC=on ..
make
```

Poros `kernel` contains the framework of Poros, as well as the IR lowering strategy, the sub-graph segmentation strategy
and the engine manager without any specific engine (e.g. TensorRT). For Developers who want to use their own
engines, `kernel` can be built separately with options as below:

**To build a shared kernel lib (libporos-kernel.so):**

```shell
cmake -DBUILD_KERNEL=on ..
make
```

**To build a static kernel lib (libporos-kernel.a):**

```shell
cmake -DBUILD_STATIC_KERNEL=on ..
make
```

### 2. Build Distributing Package with setuptools (Python3)

After the libporos.so has been built, you can build the `.whl` package for Python3:

```shell
cd ../python
python3 setup.py bdist_wheel
```

The output looks like: `poros-0.1.0-cp38-cp38m-linux_x86_64.whl`. It can be installed easily with pip:

```shell
cd dist
pip3 install poros-0.1.0-cp38-cp38m-linux_x86_64.whl
```
or, you can use `python3 setup.py develop` to create symbolic link to `python` dir.

### 3. Build Executable Binary

We provide an example C++ shell for users who want to build an executable binary. The `main.cpp` file locates
at `tools/main.cpp`, you modify the code according to your needs. The executable binary `poros-tool` can be built with
this command:

```shell
mkdir build
cd build
cmake -DBUILD_TOOL=on ..
make
```

### 4. Build Test
```shell
cmake -DUT=on ..
make
./unit_test # run unit test
```


## How To Use

### 1. Python Usage:

```python
import poros
import torch
from torchvision import models

original_model = models.resnet50(pretrained=True).cuda().eval() #load/download pre-trained model
option = poros.PorosOptions() #set poros option
poros_model = poros.compile(torch.jit.script(original_model), input_datas, option) #build the model

input = torch.randn(1,3,224,224, dtype=torch.float32).cuda()
poros_res = poros_model(input) # use compiled model in the same way as the original model

```

The complete benchmark example (resnet50) .py script is `python/example/test_resnet.py`

```shell
python3 python/example/test_resnet.py
```

### 2. CPP Usage:

If the executable binary `poros-tool` is built, you can run the benchmark like this:

```shell
./poros-tool --module_file_path ../../poros/tools/std_pretrained_resnet50_gpu.pt --test_mode=original #original PyTorch model
./poros-tool --module_file_path ../../poros/tools/std_pretrained_resnet50_gpu.pt --test_mode=poros #poros compiled model
```
> PyTorch has changed the packaging format of model since 1.4+, while the pretrained model of resnet50 is still using the old format (.tar).
> You may need to convert the format to the newer one (.zip) by your self. Convert command like this:
> ```python
> original_model = models.resnet50(pretrained=True).cuda().eval()
> torch.save(original_model, 'std_pretrained_resnet50_gpu.pt', _use_new_zipfile_serialization=False)
> ```

## Benchmark

Take a look at the [Benchmark](docs/Benchmark.md).

## Acknowledgement
Poros has been incubated for more than 2 years. In this project, NVIDIA helped us a lot (especially  Gary Ji, Vincent Zhang, Jie Fang). They answered lots of technical questions about GPU and gave us many suggestions. Appreciate their great support.