
* 第一次提交 * 补充一处漏翻译 * deleted: docs/en/quantize.md * Update one translation * Update en version * Update one translation in code * Standardize one writing * Standardize one writing * Update some en version * Fix a grammer problem * Update en version for api/vision result * Merge branch 'develop' of https://github.com/charl-u/FastDeploy into develop * Checkout the link in README in vision_results/ to the en documents * Modify a title * Add link to serving/docs/ * Finish translation of demo.md * Update english version of serving/docs/ * Update title of readme * Update some links * Modify a title * Update some links * Update en version of java android README * Modify some titles * Modify some titles * Modify some titles
15 KiB
中文 | English
客户端访问说明
本文以访问使用fastdeployserver部署的yolov5模型为例,讲述客户端如何请求服务端进行推理服务。关于如何使用fastdeployserver部署yolov5模型,可以参考文档yolov5服务化部署
基本原理介绍
fastdeployserver实现了由kserve提出的为机器学习模型推理服务而设计的Predict Protocol协议 API,该API既简单易用同时又支持高性能部署的使用场景,目前提供基于HTTP和GRPC两种网络协议的访问方式。
当fastdeployserver启动后,默认情况下,8000端口用于响应HTTP请求,8001端口用于响应GRPC请求。用户需要请求的资源通常有两种:
模型的元信息(metadata)
HTTP
访问方式: GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]
使用GET请求该url路径可以获取参与服务的模型的元信息,其中${MODEL_NAME}
表示模型的名字,${MODEL_VERSION}
表示模型的版本。服务器会把模型的元信息以json格式返回,返回的格式为一个字典,以$metadata_model_response
表示返回的对象,各字段和内容形式表示如下:
$metadata_model_response =
{
"name" : $string,
"versions" : [ $string, ... ] #optional,
"platform" : $string,
"inputs" : [ $metadata_tensor, ... ],
"outputs" : [ $metadata_tensor, ... ]
}
$metadata_tensor =
{
"name" : $string,
"datatype" : $string,
"shape" : [ $number, ... ]
}
GRPC
模型服务的GRPC定义为
service GRPCInferenceService
{
// Check liveness of the inference server.
rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) {}
// Check readiness of the inference server.
rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) {}
// Check readiness of a model in the inference server.
rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) {}
// Get server metadata.
rpc ServerMetadata(ServerMetadataRequest) returns (ServerMetadataResponse) {}
// Get model metadata.
rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) {}
// Perform inference using a specific model.
rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {}
}
访问方式:使用GRPC客户端调用模型服务GRPC接口中定义的ModelMetadata方法。
接口中请求的ModelMetadataRequest message和返回的ServerMetadataResponse message结构如下,可以看到和上面的HTTP里使用的json结构基本相同。
message ModelMetadataRequest
{
// The name of the model.
string name = 1;
// The version of the model to check for readiness. If not given the
// server will choose a version based on the model and internal policy.
string version = 2;
}
message ModelMetadataResponse
{
// Metadata for a tensor.
message TensorMetadata
{
// The tensor name.
string name = 1;
// The tensor data type.
string datatype = 2;
// The tensor shape. A variable-size dimension is represented
// by a -1 value.
repeated int64 shape = 3;
}
// The model name.
string name = 1;
// The versions of the model available on the server.
repeated string versions = 2;
// The model's platform. See Platforms.
string platform = 3;
// The model's inputs.
repeated TensorMetadata inputs = 4;
// The model's outputs.
repeated TensorMetadata outputs = 5;
}
推理服务
HTTP
访问方式:POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer
使用POST请求该url路径可以请求模型的推理服务,获取推理结果。POST请求中的数据同样以json格式上传,以$inference_request表示上传的对象,各字段和内容形式表示如下:
$inference_request =
{
"id" : $string #optional,
"parameters" : $parameters #optional,
"inputs" : [ $request_input, ... ],
"outputs" : [ $request_output, ... ] #optional
}
$request_input =
{
"name" : $string,
"shape" : [ $number, ... ],
"datatype" : $string,
"parameters" : $parameters #optional,
"data" : $tensor_data
}
$request_output =
{
"name" : $string,
"parameters" : $parameters #optional,
}
$parameters =
{
$parameter, ...
}
$parameter = $string : $string | $number | $boolean
其中$tensor_data表示一维或多维数组,如果是一维数据,必须按照行主序的方式进行排列tensor中的数据。 服务器推理完成后,返回结果数据,以$inference_response表示返回的对象,各字段和内容形式表示如下:
$inference_response =
{
"model_name" : $string,
"model_version" : $string #optional,
"id" : $string,
"parameters" : $parameters #optional,
"outputs" : [ $response_output, ... ]
}
$response_output =
{
"name" : $string,
"shape" : [ $number, ... ],
"datatype" : $string,
"parameters" : $parameters #optional,
"data" : $tensor_data
}
GRPC
访问方式:使用GRPC客户端调用模型服务GRPC接口中定义的ModelInfer方法。
接口中请求的ModelInferRequest message和返回的ModelInferResponse message结构如下,更完整的结构定义可以参考kserve Predict Protocol GRPC部分
message ModelInferRequest
{
// An input tensor for an inference request.
message InferInputTensor
{
// The tensor name.
string name = 1;
// The tensor data type.
string datatype = 2;
// The tensor shape.
repeated int64 shape = 3;
// Optional inference input tensor parameters.
map<string, InferParameter> parameters = 4;
// The tensor contents using a data-type format. This field must
// not be specified if "raw" tensor contents are being used for
// the inference request.
InferTensorContents contents = 5;
}
// An output tensor requested for an inference request.
message InferRequestedOutputTensor
{
// The tensor name.
string name = 1;
// Optional requested output tensor parameters.
map<string, InferParameter> parameters = 2;
}
// The name of the model to use for inferencing.
string model_name = 1;
// The version of the model to use for inference. If not given the
// server will choose a version based on the model and internal policy.
string model_version = 2;
// Optional identifier for the request. If specified will be
// returned in the response.
string id = 3;
// Optional inference parameters.
map<string, InferParameter> parameters = 4;
// The input tensors for the inference.
repeated InferInputTensor inputs = 5;
// The requested output tensors for the inference. Optional, if not
// specified all outputs produced by the model will be returned.
repeated InferRequestedOutputTensor outputs = 6;
// The data contained in an input tensor can be represented in "raw"
// bytes form or in the repeated type that matches the tensor's data
// type. To use the raw representation 'raw_input_contents' must be
// initialized with data for each tensor in the same order as
// 'inputs'. For each tensor, the size of this content must match
// what is expected by the tensor's shape and data type. The raw
// data must be the flattened, one-dimensional, row-major order of
// the tensor elements without any stride or padding between the
// elements. Note that the FP16 and BF16 data types must be represented as
// raw content as there is no specific data type for a 16-bit float type.
//
// If this field is specified then InferInputTensor::contents must
// not be specified for any input tensor.
repeated bytes raw_input_contents = 7;
}
message ModelInferResponse
{
// An output tensor returned for an inference request.
message InferOutputTensor
{
// The tensor name.
string name = 1;
// The tensor data type.
string datatype = 2;
// The tensor shape.
repeated int64 shape = 3;
// Optional output tensor parameters.
map<string, InferParameter> parameters = 4;
// The tensor contents using a data-type format. This field must
// not be specified if "raw" tensor contents are being used for
// the inference response.
InferTensorContents contents = 5;
}
// The name of the model used for inference.
string model_name = 1;
// The version of the model used for inference.
string model_version = 2;
// The id of the inference request if one was specified.
string id = 3;
// Optional inference response parameters.
map<string, InferParameter> parameters = 4;
// The output tensors holding inference results.
repeated InferOutputTensor outputs = 5;
// The data contained in an output tensor can be represented in
// "raw" bytes form or in the repeated type that matches the
// tensor's data type. To use the raw representation 'raw_output_contents'
// must be initialized with data for each tensor in the same order as
// 'outputs'. For each tensor, the size of this content must match
// what is expected by the tensor's shape and data type. The raw
// data must be the flattened, one-dimensional, row-major order of
// the tensor elements without any stride or padding between the
// elements. Note that the FP16 and BF16 data types must be represented as
// raw content as there is no specific data type for a 16-bit float type.
//
// If this field is specified then InferOutputTensor::contents must
// not be specified for any output tensor.
repeated bytes raw_output_contents = 6;
}
客户端工具
了解了fastdeployserver服务提供的接口之后,用户可以HTTP客户端工具来请求HTTP服务器,或者是使用GRPC客户端工具请求GRPC服务器。默认情况下,fastdeployserver启动后,8000端口用于响应HTTP请求,8001端口用于响应GRPC请求。
使用HTTP客户端
这里分别介绍如何使用tritonclient和requests库来访问fastdeployserver的HTTP服务,第一种工具是专门为模型服务做的客户端,对请求和响应进行了封装,方便用户使用。而第二种工具通用的http客户端工具,使用该工具进行访问可以帮助用户更好地理解上述原理描述中的数据结构。
一. 使用tritonclient访问服务
安装tritonclient[http]
pip install tritonclient[http]
1.获取yolov5的模型元数据
import tritonclient.http as httpclient # 导入httpclient
server_addr = 'localhost:8000' # 这里写fastdeployserver服务器的实际地址
client = httpclient.InferenceServerClient(server_addr) # 创建client
model_metadata = client.get_model_metadata(
model_name='yolov5', model_version='1') # 请求yolov5模型的元数据
可以打印看一下模型的输入和输出有哪些
print(model_metadata.inputs)
[{'name': 'INPUT', 'datatype': 'UINT8', 'shape': [-1, -1, -1, 3]}]
print(model_metadata.outputs)
[{'name': 'detction_result', 'datatype': 'BYTES', 'shape': [-1, -1]}]
2.请求推理服务
根据模型的inputs和outputs构造数据,然后请求推理
# 假设图像数据的文件名为000000014439.jpg
import cv2
image = cv2.imread('000000014439.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)[None]
inputs = []
infer_input = httpclient.InferInput('INPUT', image.shape, 'UINT8') # 构造输入
infer_input.set_data_from_numpy(image) # 载入输入数据
inputs.append(infer_input)
outputs = []
infer_output = httpclient.InferRequestedOutput('detction_result') # 构造输出
outputs.append(infer_output)
response = client.infer(
'yolov5', inputs, model_version='1', outputs=outputs) # 请求推理
response_outputs = response.as_numpy('detction_result') # 根据输出变量名获取结果
二. 使用requests访问服务
安装requests
pip install requests
1.获取yolov5的模型元数据
import requests
url = 'http://localhost:8000/v2/models/yolov5/versions/1' # 根据上述章节中"模型的元信息"的获取接口构造url
response = requests.get(url)
response = response.json() # 返回数据为json,以json格式解析
打印一下返回的模型元数据
print(response)
{'name': 'yolov5', 'versions': ['1'], 'platform': 'ensemble', 'inputs': [{'name': 'INPUT', 'datatype': 'UINT8', 'shape': [-1, -1, -1, 3]}], 'outputs': [{'name': 'detction_result', 'datatype': 'BYTES', 'shape': [-1, -1]}]}
2.请求推理服务
根据模型的inputs和outputs构造数据,然后请求推理。
url = 'http://localhost:8000/v2/models/yolov5/versions/1/infer' # 根据上述章节中"推理服务"的接口构造url
# 假设图像数据的文件名为000000014439.jpg
import cv2
image = cv2.imread('000000014439.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)[None]
payload = {
"inputs" : [
{
"name" : "INPUT",
"shape" : image.shape,
"datatype" : "UINT8",
"data" : image.tolist()
}
],
"outputs" : [
{
"name" : "detction_result"
}
]
}
response = requests.post(url, data=json.dumps(payload))
response = response.json() # 返回数据为json,以json格式解析后即为推理后返回的结果
使用GRPC客户端
安装tritonclient[grpc]
pip install tritonclient[grpc]
tritonclient[grpc]提供了使用GRPC的客户端,并且对GRPC的交互进行了封装,使得用户不用手动和服务端建立连接,也不用去直接使用grpc的stub去调用服务器的接口,而是封装后给用户提供了和tritonclient HTTP客户端一样的接口进行使用。
1.获取yolov5的模型元数据
import tritonclient.grpc as grpcclient # 导入grpc客户端
server_addr = 'localhost:8001' # 这里写fastdeployserver grpc服务器的实际地址
client = grpcclient.InferenceServerClient(server_addr) # 创建client
model_metadata = client.get_model_metadata(
model_name='yolov5', model_version='1') # 请求yolov5模型的元数据
2.请求推理服务 根据返回的model_metadata来构造请求数据。首先看一下模型的输入和输出有哪些
print(model_metadata.inputs)
[name: "INPUT"
datatype: "UINT8"
shape: -1
shape: -1
shape: -1
shape: 3
]
print(model_metadata.outputs)
[name: "detction_result"
datatype: "BYTES"
shape: -1
shape: -1
]
根据模型的inputs和outputs构造数据,然后请求推理
# 假设图像数据的文件名为000000014439.jpg
import cv2
image = cv2.imread('000000014439.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)[None]
inputs = []
infer_input = grpcclient.InferInput('INPUT', image.shape, 'UINT8') # 构造输入
infer_input.set_data_from_numpy(image) # 载入输入数据
inputs.append(infer_input)
outputs = []
infer_output = grpcclient.InferRequestedOutput('detction_result') # 构造输出
outputs.append(infer_output)
response = client.infer(
'yolov5', inputs, model_version='1', outputs=outputs) # 请求推理
response_outputs = response.as_numpy('detction_result') # 根据输出变量名获取结果