8.5 KiB
Batch Models
Overview
Typically computer vision inference models have a single input tensor in
the shape of NHWC such as [1,224,224,3]. The rknn-toolkit2 allows you to
build the model with Batch tensor inputs by setting the rknn_batch_size parameter
in the following python conversion script.
rknn.build(do_quantization=do_quant, dataset=DATASET_PATH, rknn_batch_size=8)
This results in a .rknn model with modified tensor input dimensions of [8,224,244,3].
When taking input from a video source frame-by-frame, the use of batching to process frames has little use case, as your only dealing with a single frame to be processed as soon as possible. However batching can be useful if you have many images to process at a single point in time, some examples of this could be;
- Running YOLO object detection on a frame, then passing all detected objects through a ReIdentification model in batches.
- Some applications will buffer video frames and upon an external signal, it will then trigger the processing of those buffered frames as a batch.
Batch Sizing
The NPU's in the different platforms RK356x, RK3576, and RK3588 have different amounts of SRAM and NPU core numbers, so finding the optimal batch size for your Model is critical.
A benchmarking tool has been created to test different batch sizes of your own
RKNN Models. Use your python conversion script to compile the ONNX model to RKNN
with various rknn_batch_size values you would like to test. Name those RKNN
Models using this format <name>-batch{N1,N2,...,Nk}.rknn. For example I wish
to test batch sizes of 1, 4, 8, and 16 of an OSNet model and have created the
following files and placed them in the directory /tmp/models on the host OS.
osnet-batch1.rknn
osnet-batch4.rknn
osnet-batch8.rknn
osnet-batch16.rknn
We can then pass all these Models to the benchmark using the -m argument in
the format of -m "/tmp/models/osnet-batch{1,4,8,16}".
To run the benchmark of your models on the rk3588 or replace with your Platform model.
# from project root directory
go test -bench=BenchmarkBatchSize -benchtime=10s \
-args -p rk3588 -m "/tmp/models/osnet-batch{1,4,8,16}.rknn"
Similarly using Docker we can mount the /tmp/models directory and run.
# from project root directory
docker run --rm \
--device /dev/dri:/dev/dri \
-v "$(pwd):/go/src/app" \
-v "$(pwd)/example/data:/go/src/data" \
-v "/usr/include/rknn_api.h:/usr/include/rknn_api.h" \
-v "/usr/lib/librknnrt.so:/usr/lib/librknnrt.so" \
-v "/tmp/models/:/tmp/models/" \
-w /go/src/app \
swdee/go-rknnlite:latest \
go test -bench=BenchmarkBatchSize -benchtime=10s \
-args -p rk3588 -m "/tmp/models/osnet-batch{1,4,8,16}"
Running the above benchmark command outputs the following results.
rk3588
BenchmarkBatchSize/Batch01-8 1897 8806025 ns/op 8.806 ms/batch 8.806 ms/img
BenchmarkBatchSize/Batch04-8 885 21555109 ns/op 21.55 ms/batch 5.389 ms/img
BenchmarkBatchSize/Batch08-8 534 22335645 ns/op 22.34 ms/batch 2.792 ms/img
BenchmarkBatchSize/Batch16-8 303 40253162 ns/op 40.25 ms/batch 2.516 ms/img
rk3576
BenchmarkBatchSize/Batch01-8 1312 8987117 ns/op 8.985 ms/batch 8.985 ms/img
BenchmarkBatchSize/Batch04-8 640 18836090 ns/op 18.83 ms/batch 4.709 ms/img
BenchmarkBatchSize/Batch08-8 385 31702649 ns/op 31.70 ms/batch 3.963 ms/img
BenchmarkBatchSize/Batch16-8 194 63801596 ns/op 63.80 ms/batch 3.988 ms/img
rk3566
BenchmarkBatchSize/Batch01-4 661 18658568 ns/op 18.66 ms/batch 18.66 ms/img
BenchmarkBatchSize/Batch04-4 158 74716574 ns/op 74.71 ms/batch 18.68 ms/img
BenchmarkBatchSize/Batch08-4 70 155374027 ns/op 155.4 ms/batch 19.42 ms/img
BenchmarkBatchSize/Batch16-4 37 294969497 ns/op 295.0 ms/batch 18.44 ms/img
Interpreting Benchmark Results
The ms/batch metric represents the number of milliseconds it took for the
whole batch inference to run and ms/img represents the average number of
milliseconds it took to run inference per image.
As can be seen in the rk3588 results the ideal batch size is 8 as it gives
a low 2.792 ms/img inference time versus total batch inference time of
22.34ms. The same applies to the rk3576.
The rk3566 has a single core NPU, the results show there is no benefit in running batching at all.
These results were for an OSNet Model, it's possible that different Models perform differently so you should run these benchmarks for your own application to optimize accordingly.
Usage
An example batch program is provided that combines inferencing on a Pool of runtimes, make sure you have downloaded the data files first for the examples. You only need to do this once for all examples.
cd example/
git clone --depth=1 https://github.com/swdee/go-rknnlite-data.git data
Run the batch example on rk3588 or replace with your Platform model.
cd example/batch
go run batch.go -s 3 -p rk3588
This will result in the output of:
Driver Version: 0.9.6, API Version: 2.3.0 (c949ad889d@2024-11-07T11:35:33)
Model Input Number: 1, Ouput Number: 1
Input tensors:
index=0, name=input, n_dims=4, dims=[8, 224, 224, 3], n_elems=1204224, size=1204224, fmt=NHWC, type=INT8, qnt_type=AFFINE, zp=-14, scale=0.018658
Output tensors:
index=0, name=output, n_dims=2, dims=[8, 1000, 0, 0], n_elems=8000, size=8000, fmt=UNDEFINED, type=INT8, qnt_type=AFFINE, zp=-55, scale=0.141923
Running...
File ../data/imagenet/n01514859_hen.JPEG, inference time 40ms
File ../data/imagenet/n01518878_ostrich.JPEG, inference time 40ms
File ../data/imagenet/n01530575_brambling.JPEG, inference time 40ms
File ../data/imagenet/n01531178_goldfinch.JPEG, inference time 40ms
...snip...
File ../data/imagenet/n13054560_bolete.JPEG, inference time 8ms
File ../data/imagenet/n13133613_ear.JPEG, inference time 8ms
File ../data/imagenet/n15075141_toilet_tissue.JPEG, inference time 8ms
Processed 1000 images in 2.098619346s, average inference per image is 2.10ms
See the help for command line parameters.
$ go run batch.go -h
Usage of /tmp/go-build1506342544/b001/exe/batch:
-d string
A directory of images to run inference on (default "../data/imagenet/")
-m string
RKNN compiled model file (default "../data/models/rk3588/mobilenetv2-batch8-rk3588.rknn")
-p string
Rockchip CPU Model number [rk3562|rk3566|rk3568|rk3576|rk3582|rk3582|rk3588] (default "rk3588")
-q Run in quiet mode, don't display individual inference results
-r int
Repeat processing image directory the specified number of times, use this if you don't have enough images (default 1)
-s int
Size of RKNN runtime pool, choose 1, 2, 3, or multiples of 3 (default 1)
Docker
To run the batch example using the prebuilt docker image, make sure the data files have been downloaded first, then run.
# from project root directory
docker run --rm \
--device /dev/dri:/dev/dri \
-v "$(pwd):/go/src/app" \
-v "$(pwd)/example/data:/go/src/data" \
-v "/usr/include/rknn_api.h:/usr/include/rknn_api.h" \
-v "/usr/lib/librknnrt.so:/usr/lib/librknnrt.so" \
-w /go/src/app \
swdee/go-rknnlite:latest \
go run ./example/batch/batch.go -p rk3588 -s 3
API
A convenience function rknnlite.NewBatch() is provided to concatenate individual
images into a single input tensor for the Model and then extract their results
from the combined outputs.
// create a new batch processor
batch := rt.NewBatch(batchSize, height, width, channels)
defer batch.Close()
for idx, file := range files {
// add files to the batch at the given index
batch.AddAt(idx, file)
// OR you can add images incrementally without specifying an index
batch.Add(file)
}
// pass the concatenated Mat to the runtime for inference
outputs, err := rt.Inference([]gocv.Mat{batch.Mat()})
// then get a single image result by index
output, err := batch.GetOutputInt(4, outputs.Output[0], int(outputs.OutputAttributes().DimForDFL))
See the full example code for more details.