Files
FastDeploy/llm/README.md
2024-11-21 02:49:56 +00:00

40 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<h1 align="center"><b><em>FastDeploy大模型服务化部署</em></b></h1>
*FastDeploy基于英伟达Triton框架专为服务器场景的大模型服务化部署而设计的解决方案。它提供了支持gRPC、HTTP协议的服务接口以及流式Token输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化PTQ等加速优化策略为用户带来易用且高性能的部署体验。*
# 快速开始
基于预编译镜像部署,本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例,更多模型请参考[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)、[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)、[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md), 更细致的模型推理、量化教程可以参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md)
```
# 下载模型
wget https://paddle-qa.bj.bcebos.com/inference_model/Meta-Llama-3-8B-Instruct-A8W8C8.tar
mkdir Llama-3-8B-A8W8C8 && tar -xf Meta-Llama-3-8B-Instruct-A8W8C8.tar -C Llama-3-8B-A8W8C8
# 挂载模型文件
export MODEL_PATH=${PWD}/Llama-3-8B-A8W8C8
docker run --gpus all --shm-size 5G --network=host --privileged --cap-add=SYS_PTRACE \
-v ${MODEL_PATH}:/models/ \
-dit registry.baidubce.com/paddlepaddle/fastdeploy:llm-serving-cuda123-cudnn9-v1.2 \
bash -c 'export USE_CACHE_KV_INT8=1 && cd /opt/output/Serving && bash start_server.sh; exec bash'
```
等待服务启动成功服务初次启动大概需要40s可以通过以下命令测试
```
curl 127.0.0.1:9965/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"text": "hello, llm"}'
```
Note:
1. 请保证 shm-size >= 5不然可能会导致服务启动失败
更多关于 FastDeploy 的使用方法,请查看[服务化部署流程](https://github.com/PaddlePaddle/FastDeploy/blob/develop/llm/docs/FastDeploy_usage_tutorial.md)
# License
FastDeploy 遵循 [Apache-2.0开源协议](https://github.com/PaddlePaddle/FastDeploy/blob/develop/LICENSE) 。