mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2025-10-05 08:37:06 +08:00

Files

charl-u 1135d33dd7 [Doc]Add English version of documents in examples/ (#1042 )

* 第一次提交

* 补充一处漏翻译

* deleted:    docs/en/quantize.md

* Update one translation

* Update en version

* Update one translation in code

* Standardize one writing

* Standardize one writing

* Update some en version

* Fix a grammer problem

* Update en version for api/vision result

* Merge branch 'develop' of https://github.com/charl-u/FastDeploy into develop

* Checkout the link in README in vision_results/ to the en documents

* Modify a title

* Add link to serving/docs/

* Finish translation of demo.md

* Update english version of serving/docs/

* Update title of readme

* Update some links

* Modify a title

* Update some links

* Update en version of java android README

* Modify some titles

* Modify some titles

* Modify some titles

* modify article to document

* update some english version of documents in examples

* Add english version of documents in examples/visions

* Sync to current branch

* Add english version of documents in examples

2023-01-06 09:35:12 +08:00

cpp

[Doc]Add English version of documents in examples/ (#1042 )

2023-01-06 09:35:12 +08:00

python

[Doc]Add English version of documents in examples/ (#1042 )

2023-01-06 09:35:12 +08:00

README_CN.md

[Doc]Add English version of documents in examples/ (#1042 )

2023-01-06 09:35:12 +08:00

README.md

[Doc]Add English version of documents in examples/ (#1042 )

2023-01-06 09:35:12 +08:00

README.md

English | 简体中文

PaddleSeg Quantized Model Deployment

FastDeploy already supports the deployment of quantitative models and provides a tool to automatically compress model with just one click. You can use the one-click automatical model compression tool to quantify and deploy the models, or directly download the quantified models provided by FastDeploy for deployment.

FastDeploy One-Click Automation Model Compression Tool

FastDeploy provides an one-click automatical model compression tool that can quantify a model simply by entering configuration file. For details, please refer to one-click automatical compression tool. Note: The quantized classification model still needs the deploy.yaml file in the FP32 model folder. Self-quantized model folder does not contain this yaml file, you can copy it from the FP32 model folder to the quantized model folder.

Download the Quantized PaddleSeg Model

You can also directly download the quantized models in the following table for deployment (click model name to download).

Note:

Runtime latency is the inference latency of the model on various Runtimes, including CPU->GPU data copy, GPU inference, and GPU->CPU data copy time. It does not include the respective pre and post processing time of the models.
The end-to-end latency is the latency of the model in the actual inference scenario, including the pre and post processing of the model.
The measured latencies are averaged over 1000 inferences, in milliseconds.
INT8 + FP16 is to enable the FP16 inference option for Runtime while inferring the INT8 quantization model.
INT8 + FP16 + PM is the option to use Pinned Memory while inferring INT8 quantization model and turning on FP16, which can speed up the GPU->CPU data copy speed.
The maximum speedup ratio is obtained by dividing the FP32 latency by the fastest INT8 inference latency.
The strategy is quantitative distillation training, using a small number of unlabeled data sets to train the quantitative model, and verify the accuracy on the full validation set, INT8 accuracy does not represent the highest INT8 accuracy.
The CPU is Intel(R) Xeon(R) Gold 6271C with a fixed CPU thread count of 1 in all tests. The GPU is Tesla T4, TensorRT version 8.4.15.

Runtime Benchmark

Model	Inference Backends	Hardware	FP32 Runtime Latency	INT8 Runtime Latency	INT8 + FP16 Runtime Latency	INT8+FP16+PM Runtime Latency	Max Speedup	FP32 mIoU	INT8 mIoU	Method
PP-LiteSeg-T(STDC1)-cityscapes	Paddle Inference	CPU	1138.04	602.62	None	None	1.89	77.37	71.62	Quantaware Distillation Training

End to End Benchmark

Model	Inference Backends	Hardware	FP32 End2End Latency	INT8 End2End Latency	INT8 + FP16 End2End Latency	INT8+FP16+PM End2End Latency	Max Speedup	FP32 mIoU	INT8 mIoU	Method
PP-LiteSeg-T(STDC1)-cityscapes	Paddle Inference	CPU	4726.65	4134.91	None	None	1.14	77.37	71.62	Quantaware Distillation Training

README.md

PaddleSeg Quantized Model Deployment

FastDeploy One-Click Automation Model Compression Tool

Download the Quantized PaddleSeg Model

Runtime Benchmark

End to End Benchmark

Detailed Deployment Documents