Files
FastDeploy/serving/docs/EN/model_configuration-en.md
Jason 713afe7f1c [Other] Deprecate some option api and parameters (#1243)
* Optimize Poros backend

* fix error

* Add more pybind

* fix conflicts

* add some deprecate notices

* [Other] Deprecate some apis in RuntimeOption (#1240)

* Deprecate more options

* modify serving

* Update option.h

* fix tensorrt error

* Update option_pybind.cc

* Update option_pybind.cc

* Fix error in serving

* fix word spell error
2023-02-07 17:57:46 +08:00

208 lines
6.5 KiB
Markdown

English | [中文](../zh_CN/model_configuration.md)
# Model Configuration
Each model in the model repository must contain a configuration that provides required and optional information about the model. The configuration information is generally written in [ModelConfig protobuf](https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto) format in file *config.pbtxt*.
## Minimum Model General Configuration
Please see the official website for detailed general configuration: [model_configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md). Minimum model configuration of Triton must include: attribute *platform* or *backend*, attribute *max_batch_size* and input and output of the model.
For example, the minimum configuration of a Paddle model should be (with two inputs *input0* and *input1*, one output *output0*, both inputs and outputs are tensors of type float32, and the maximum batch is 8):
```
backend: "fastdeploy"
max_batch_size: 8
input [
{
name: "input0"
data_type: TYPE_FP32
dims: [ 16 ]
},
{
name: "input1"
data_type: TYPE_FP32
dims: [ 16 ]
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [ 16 ]
}
]
```
## Configuration of CPU, GPU and Instances number
The attribute *instance_group* allows you to configure hardware resource and model inference instances number.
Here's an example of CPU deployment:
```
instance_group [
{
# Create two CPU instances
count: 2
# Use CPU for deployment
kind: KIND_CPU
}
]
```
Another example of deploying two instances on *GPU 0*, and one instance each on *GPU1* and *GPU*:
```
instance_group [
{
# Create tow GPU instances
count: 2
# Use GPU for inference
kind: KIND_GPU
# Deploy on GPU 0
gpus: [ 0 ]
},
{
count: 1
kind: KIND_GPU
# Deploy on GPU 1,2
gpus: [ 1, 2 ]
}
]
```
### Name, Platform and Backend
The attribute *name* is optional. If the model is not specified in the configuration, then the name is the model's directory name. When the name is specified, it should match the directory name.
Set *fastdeploy backend*. You should not configure attribute *platform*, but please instead configure attribute *backend* to *fastdeploy*.
```
backend: "fastdeploy"
```
### FastDeploy Backend Configuration
Currently FastDeploy backend supports *cpu* and *gpu* inference, with *paddle*, *onnxruntime* and *openvino* inference engines supported on *cpu*, and *paddle*, *onnxruntime* and *tensorrt* engines supported on *gpu*.
#### Paddle Engine Configuration
In addition to configuring *Instance Groups*, deciding whether the model runs on CPU or GPU, the Paddle engine can be configured as follows. You can see more specific examples in [A PP-OCRv3 example for Runtime configuration](../../../examples/vision/ocr/PP-OCRv3/serving/models/cls_runtime/config.pbtxt).
```
optimization {
execution_accelerators {
# CPU inference configuration, used with KIND_CPU.
cpu_execution_accelerator : [
{
name : "paddle"
# Set parallel inference computing threads number to 4.
parameters { key: "cpu_threads" value: "4" }
# Set mkldnn acceleration on, or off when set to 0.
parameters { key: "use_mkldnn" value: "1" }
}
],
# GPU inference configuration, used with KIND_GPU.
gpu_execution_accelerator : [
{
name : "paddle"
# Set parallel inference computing threads number to 4.
parameters { key: "cpu_threads" value: "4" }
# Set mkldnn acceleration on, or off when set to 0.
parameters { key: "use_mkldnn" value: "1" }
}
]
}
}
```
#### ONNXRuntime Engine Configuration
In addition to configuring *Instance Groups*, deciding whether the model runs on CPU or GPU, the ONNXRuntime engine can be configured as follows. You can see more specific examples in [A YOLOv5 example for Runtime configuration](../../../examples/vision/detection/yolov5/serving/models/runtime/config.pbtxt).
```
optimization {
execution_accelerators {
cpu_execution_accelerator : [
{
name : "onnxruntime"
# Set parallel inference computing threads number to 4.
parameters { key: "cpu_threads" value: "4" }
}
],
gpu_execution_accelerator : [
{
name : "onnxruntime"
}
]
}
}
```
### OpenVINO Engine Configuration
The OpenVINO engine only supports inferring on CPU, which can be configured as:
```
optimization {
execution_accelerators {
cpu_execution_accelerator : [
{
name : "openvino"
# Set parallel inference computing threads number to 4 (total number of threads for all instances).
parameters { key: "cpu_threads" value: "4" }
# Set num_streams in OpenVINO (usually the same as instances number)
parameters { key: "num_streams" value: "1" }
}
]
}
}
```
### TensorRT Engine Configuration
The TensorRT engine only supports inferring on GPU, which can be configured as:
```
optimization {
execution_accelerators {
gpu_execution_accelerator : [
{
name : "tensorrt"
# Use FP16 inference in TensorRT. You can also choose: trt_fp32
# If the loaded model is a quantized model, this precision will be int8 automatically
parameters { key: "precision" value: "trt_fp16" }
}
]
}
}
```
You can configure the TensorRT dynamic shape in the following format, and refer to [A PaddleCls example for Runtime configuration](../../../examples/vision/classification/paddleclas/serving/models/runtime/config.pbtxt):
```
optimization {
execution_accelerators {
gpu_execution_accelerator : [ {
# use TRT engine
name: "tensorrt",
# use fp16 on TRT engine
parameters { key: "precision" value: "trt_fp16" }
},
{
# Configure the minimum shape of dynamic shape
name: "min_shape"
# All input name and minimum shape
parameters { key: "input1" value: "1 3 224 224" }
parameters { key: "input2" value: "1 10" }
},
{
# Configure the optimal shape of dynamic shape
name: "opt_shape"
# All input name and optimal shape
parameters { key: "input1" value: "2 3 224 224" }
parameters { key: "input2" value: "2 20" }
},
{
# Configure the maximum shape of dynamic shape
name: "max_shape"
# All input name and maximum shape
parameters { key: "input1" value: "8 3 224 224" }
parameters { key: "input2" value: "8 30" }
}
]
}}
```