Files
FastDeploy/examples/vision/segmentation/paddleseg/quantize
charl-u 1135d33dd7 [Doc]Add English version of documents in examples/ (#1042)
* 第一次提交

* 补充一处漏翻译

* deleted:    docs/en/quantize.md

* Update one translation

* Update en version

* Update one translation in code

* Standardize one writing

* Standardize one writing

* Update some en version

* Fix a grammer problem

* Update en version for api/vision result

* Merge branch 'develop' of https://github.com/charl-u/FastDeploy into develop

* Checkout the link in README in vision_results/ to the en documents

* Modify a title

* Add link to serving/docs/

* Finish translation of demo.md

* Update english version of serving/docs/

* Update title of readme

* Update some links

* Modify a title

* Update some links

* Update en version of java android README

* Modify some titles

* Modify some titles

* Modify some titles

* modify article to document

* update some english version of documents in examples

* Add english version of documents in examples/visions

* Sync to current branch

* Add english version of documents in examples
2023-01-06 09:35:12 +08:00
..

English | 简体中文

PaddleSeg Quantized Model Deployment

FastDeploy already supports the deployment of quantitative models and provides a tool to automatically compress model with just one click. You can use the one-click automatical model compression tool to quantify and deploy the models, or directly download the quantified models provided by FastDeploy for deployment.

FastDeploy One-Click Automation Model Compression Tool

FastDeploy provides an one-click automatical model compression tool that can quantify a model simply by entering configuration file. For details, please refer to one-click automatical compression tool. Note: The quantized classification model still needs the deploy.yaml file in the FP32 model folder. Self-quantized model folder does not contain this yaml file, you can copy it from the FP32 model folder to the quantized model folder.

Download the Quantized PaddleSeg Model

You can also directly download the quantized models in the following table for deployment (click model name to download).

Note:

  • Runtime latency is the inference latency of the model on various Runtimes, including CPU->GPU data copy, GPU inference, and GPU->CPU data copy time. It does not include the respective pre and post processing time of the models.
  • The end-to-end latency is the latency of the model in the actual inference scenario, including the pre and post processing of the model.
  • The measured latencies are averaged over 1000 inferences, in milliseconds.
  • INT8 + FP16 is to enable the FP16 inference option for Runtime while inferring the INT8 quantization model.
  • INT8 + FP16 + PM is the option to use Pinned Memory while inferring INT8 quantization model and turning on FP16, which can speed up the GPU->CPU data copy speed.
  • The maximum speedup ratio is obtained by dividing the FP32 latency by the fastest INT8 inference latency.
  • The strategy is quantitative distillation training, using a small number of unlabeled data sets to train the quantitative model, and verify the accuracy on the full validation set, INT8 accuracy does not represent the highest INT8 accuracy.
  • The CPU is Intel(R) Xeon(R) Gold 6271C with a fixed CPU thread count of 1 in all tests. The GPU is Tesla T4, TensorRT version 8.4.15.

Runtime Benchmark

Model Inference Backends Hardware FP32 Runtime Latency INT8 Runtime Latency INT8 + FP16 Runtime Latency INT8+FP16+PM Runtime Latency Max Speedup FP32 mIoU INT8 mIoU Method
PP-LiteSeg-T(STDC1)-cityscapes Paddle Inference CPU 1138.04 602.62 None None 1.89 77.37 71.62 Quantaware Distillation Training

End to End Benchmark

Model Inference Backends Hardware FP32 End2End Latency INT8 End2End Latency INT8 + FP16 End2End Latency INT8+FP16+PM End2End Latency Max Speedup FP32 mIoU INT8 mIoU Method
PP-LiteSeg-T(STDC1)-cityscapes Paddle Inference CPU 4726.65 4134.91 None None 1.14 77.37 71.62 Quantaware Distillation Training

Detailed Deployment Documents