Files

HTTP Stream with ByteTrack Object Tracking

A demo that streams video over HTTP with object detection using YOLOv5, YOLOv8, YOLOv10, YOLOv11, or YOLOX plus ByteTrack object tracking to assign an ID to each object and track them over time.

This demo uses a 720p MP4 video file to stream in a loop over HTTP, it could be adjusted easily to use a camera instead.

It also makes use of the Pool feature to run multiple of the same YOLO model concurrently for better performance.

palace-tracking.jpg

In the above image the statistics at the top represent the following.

Field Description
Frame The current frame number
FPS The frames per second being streamed
Lag A negative lag means the inference and processing time completed in less time than the interval taken to stream at 30 FPS. A positive number indicates processing time took longer and there is a playback lag from when the video first started.
Objects The number of objects detected in the scene
Inference The time it took to perform YOLO inferencing on the RKNN backend
Post Processing Time it took to post process the YOLO detection results from inferencing
Tracking Time it took to perform ByteTrack object tracking
Rendering How long it took to draw objects and annotate the image with this statistical data
Total Time Total time it took from receiving the video frame and processing it to be sent to browser

Each detected object has a bounding box drawn around it with classification label person and the number proceeding is the unique tracking id assigned from ByteTrack. The pink circle in the center of the bounding box represents the center and yellow line is the trail (path) of the motion the tracked object took. This effect replicates the trail effect discuss at the Ultralytics multi-object tracking page.

Basic Demo

To run the Stream example make sure you have downloaded the data files first. You only need to do this once for all examples.

cd example/
git clone --depth=1 https://github.com/swdee/go-rknnlite-data.git data

Start the Stream server on port 8080 and run the example using a pool of 3 YOLO models and limiting object detection to person on rk3588 or replace with your Platform model.

cd example/stream/
go run bytetrack.go -a :8080 -s 3 -x person -p rk3588

This will result in output of:

Limiting object detection class to: person
Open browser and view video at http://:8080/stream

In your browser open a connection to your Rockchip/SBC device to http://<ip address>:8080/stream to watch the stream with object tracking using YOLOv5 and ByteTrack.

https://github.com/user-attachments/assets/ee28fc00-1d07-4af0-bc77-171a3acde5ce

Docker

To run the Stream example using the prebuilt docker image, make sure the data files have been downloaded first, then run.

# from project root directory

docker run --rm -it \
  --device /dev/dri:/dev/dri \
  -v "$(pwd):/go/src/app" \
  -v "$(pwd)/example/data:/go/src/data" \
  -v "/usr/include/rknn_api.h:/usr/include/rknn_api.h" \
  -v "/usr/lib/librknnrt.so:/usr/lib/librknnrt.so" \
  -w /go/src/app \
  -p 8080:8080 \
  swdee/go-rknnlite:latest \
  go run ./example/stream/bytetrack.go -a :8080 -s 3 -x person -p rk3588

Detailed Usage

To see the command line options of the Stream example

go run bytetrack.go --help

This outputs options of:

Usage of /tmp/go-build3419105702/b001/exe/bytetrack:
  -a string
        HTTP Address to run server on, format address:port (default "localhost:8080")
  -c value
        Web Camera resolution in format <width>x<height>@<fps>, eg: 1280x720@30 (default 1280x720@30)
  -codec string
        Web Camera codec The rendering format [mjpg|yuyv] (default "mjpg")
  -l string
        Text file containing model labels (default "../data/coco_80_labels_list.txt")
  -m string
        RKNN compiled YOLO model file (default "../data/models/rk3588/yolov5s-rk3588.rknn")
  -p string
        Rockchip CPU Model number [rk3562|rk3566|rk3568|rk3576|rk3582|rk3582|rk3588] (default "rk3588")
  -r string
        The rendering format used for instance segmentation [outline|mask] (default "outline")
  -s int
        Size of RKNN runtime pool, choose 1, 2, 3, or multiples of 3 (default 3)
  -t string
        Version of YOLO model [v5|v8|v10|v11|x|v5seg|v8seg|v8pose] (default "v5")
  -v string
        Video file to run object detection and tracking on or device of web camera when used with -c flag (default "../data/palace.mp4")
  -x string
        Comma delimited list of labels (COCO) to restrict object tracking to

The default mode is to use the YOLOv5 model, however you can change to other models with.

# YOLOv8
go run bytetrack.go -a :8080 -s 3 -x person -p rk3588 -m ../data/models/rk3588/yolov8s-rk3588.rknn -t v8

# YOLOv10
go run bytetrack.go -a :8080 -s 3 -x person -p rk3588 -m ../data/models/rk3588/yolov10s-rk3588.rknn -t v10

# YOLOv11
go run bytetrack.go -a :8080 -s 3 -x person -p rk3588 -m ../data/models/rk3588/yolov11s-rk3588.rknn -t v11

# YOLOX
go run bytetrack.go -a :8080 -s 3 -x person -p rk3588 -m ../data/models/rk3588/yoloxs-rk3588.rknn -t x

The YOLO models used are the original ones from the RKNN Model Zoo and are not tuned well for the task of tracking. Your best training your own model to get more accurate results.

To help filter the objects tracked you can limit tracking to objects of a specific class. Using the COCO labels you can limit as follows.

# limit to just person class objects
go run bytetrack.go -x person

# limit to person and handbag objects
go run bytetrack.go -x person,handbag

Or don't pass an -x flag and no object limiting will be done.

There is also another test video provided at ../data/cars.mp4 for object tracking.

https://github.com/user-attachments/assets/ad4e806d-dc26-44cb-9c40-f3a9a60e7451

Web Camera

Use of a web camera is possible instead of passing video (MP4) files, use the following commands on your linux distribution to establish the device number of your web camera and available resolutions.

List available web camera devices

v4l2-ctl --list-devices

List available resolutions and frame rates camera supports for camera at /dev/video1.

v4l2-ctl --device=/dev/video1 --list-formats-ext

I find the results with object detection substandard however the YOLOv8-pose model works quite well. To run this example pass the camera device -v 1 and resolution and frame rate of -c 1280x720@30 on the command line to represent /dev/video1 and a resolution of 1280x720 running at 30FPS.

go run bytetrack.go -a :8080 -s 6 -v 1 -c 640x480@30 -codec MJPG -m ../data/models/rk3588/yolov8n-pose-rk3588.rknn -t v8pose -p rk3588

Performance

If you open multiple tabs in your browser pointing to the Stream server, each browser tab will be served its own Stream. The Stream server is in fact performing inference and object detection on seperate instances of the video to be streamed to each browser tab.

In the test videos provided they stream at 720p, I can open up three streams total and can maintain 30 FPS in each on the Rock 5B (RK3588) using the YOLOv5s model. Upon opening a 4th stream the performance drops and frames get lost and stream stutters. So consider the limit of the device to be around 100FPS total at 720p resolution using the YOLOv5s model.

The YOLOv8s model requires more processing power and takes longer to process each frame at around ~60ms versus YOLOv5s's ~28ms. Naturally the object detection results under YOLOv8 are better.

To achieve the three 720p streams you need the SBC networked by a wired ethernet connection, a WIFI connection does not have the bandwidth.

Watch the video

Lag (Parallel vs Serial Processing)

One of the nice aspects of Go is its concurrent programming and the great thing about the RK3588 is the three NPU cores which use system memory. This allows us to load multiple of the same YOLO model up in a Pool to process video frames concurrently to achieve superior streaming performance.

Take the YOLOv8 model which takes around 60ms to run inference and processing of each video frame. The following graph depicts how these frames can be processed in parallel or sequentially.

lag-graph.png

The video is rendered for playback at 30 FPS, if we were to process the frames sequentially we would only achieve a 15 FPS playback speed due to the time delay it takes to process each frame.

Parallel processing of the frames on the other hand enables us to still achieve a 30 FPS playback with object detection but only has the side effect of a 60ms lag. At point A on the graph we receive the first video frame for processing which completes at point B. 60ms has lapsed since the start but we can now show Frame 1 on the browser, so we have incurred a 60ms delay from when the video first started until when we first see playback.

However whilst we are half way through processing Frame 1 we receive Frame 2 and can begin processing that in parallel. When Frame 2 completes we are now at point C on the graph which is 30ms after point B.

This overlap of processing frames allows us to maintain the 30 FPS playback (with an inital playback lag) versus the 15 FPS of sequential processing.

As mentioned above the Lag statistic is rendered on top of the demo video stream to show what lag is experienced.

Instance Segmentation

The YOLOv5-seg and YOLOv8-seg processors have been integrated into this Stream example however instance segmentation requires alot of post processing via the CPU and achieving 30 FPS on a 720p video is not possible. When either of these processors are used the code downgrades to 10 FPS.

Further more tracking objects with ByteTrack involves the process of using YOLO object detection to detect the objects in a single image and ByteTrack then manages these tracked objects over time and returns its own object bounding boxes. This means the tracked objects per frame can vary to those found by pure YOLO object detection. When this occurs we have no segmentation mask to render, so during outline rendering mode it reverts back to drawing a bounding box of the tracked object. When rendering in mask mode the segment overlay will disappear for that frame.

Pose Estimation

The YOLOv8-pose processor has been added to this Stream example, use the following video to view example.

go run bytetrack.go -a :8080 -s 3 -v ../data/dance.mp4 -t v8pose -m ../data/models/rk3588/yolov8n-pose-rk3588.rknn -p rk3588

Tracking trails have been turned off as the trails distract from the skeleton.

Oriented Bounding Boxes

Using the YOLOv8-obb processor and example has been added for object detection using oriented bounding boxes.

go run bytetrack.go -a :8080 -s 3 -v ../data/aerial.mp4 -t v8obb -m ../data/models/rk3588/yolov8n-obb-rk3588.rknn -l ../data/yolov8_obb_labels_list.txt -p rk3588

Code Example Limitations

This streaming example serves as a code example for other developers, it is not production level code as it does not check input parameters or edge cases. This is done in an effort to keep the number of lines of code down.

When using the webcamera it only supports a single connection from the browser which opens the webcamera device for its exclusive use. Multiple connections are not possible, this would require a larger code base that opens the webcamera in a seperate goroutine which then copies the analysed camera frame to a Hub that handles and distributes it to all HTTP connections.

The method used to stream JPG images to the browser using HTTP headers is not the fastest method available, but it is simple. In our production code we stream the JPG images over a websocket and render them on an HTML canvas element. This is faster and also allows us to "scrub" (fast forward and rewind) the video timeframe client side.

Tracking Accuracy

The Go ByteTrack code ported outputs the same result as the C++ and Python source code. Personally I don't find object tracking to have the accuracy I would desire, the results vary depending on which YOLO model is run and the tracking itself is not 100%. Whilst this demo shows a complete solution, you would still need to do work to train a better model and testing for your own use case.

Re-Identification (ReID)

Experimental ReID has been added which follows the implementation of the FairMOT tracker, however this makes use of the OSNet model trained with the Market1501 dataset.

Usage of ReID is expensive and typically takes around 200ms per frame to complete on the RK3588 NPU. There is little accuracy improvement over straight ByteTrack which adds little overhead to the YOLO object detection.

We need to wait for Rockchips next generation RK36xx SoC before this may be useful.

Background

The ByteTrack code is a Go conversion of the C++ project. We also referenced the Python version.

Ultralytics also has there version of ByteTrack here