Files
FastDeploy/zh/offline_inference/index.html

2217 lines
49 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!doctype html>
<html lang="zh" class="no-js">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width,initial-scale=1">
<link rel="prev" href="../online_serving/graceful_shutdown_service/">
<link rel="next" href="../best_practices/ERNIE-4.5-0.3B-Paddle/">
<link rel="icon" href="../../assets/images/favicon.ico">
<meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.6.20">
<title>离线推理 - 飞桨大语言模型推理部署工具包</title>
<link rel="stylesheet" href="../../assets/stylesheets/main.e53b48f4.min.css">
<link rel="stylesheet" href="../../assets/stylesheets/palette.06af60db.min.css">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto:300,300i,400,400i,700,700i%7CRoboto+Mono:400,400i,700,700i&display=fallback">
<style>:root{--md-text-font:"Roboto";--md-code-font:"Roboto Mono"}</style>
<script>__md_scope=new URL("../..",location),__md_hash=e=>[...e].reduce(((e,_)=>(e<<5)-e+_.charCodeAt(0)),0),__md_get=(e,_=localStorage,t=__md_scope)=>JSON.parse(_.getItem(t.pathname+"."+e)),__md_set=(e,_,t=localStorage,a=__md_scope)=>{try{t.setItem(a.pathname+"."+e,JSON.stringify(_))}catch(e){}}</script>
</head>
<body dir="ltr" data-md-color-scheme="default" data-md-color-primary="indigo" data-md-color-accent="indigo">
<input class="md-toggle" data-md-toggle="drawer" type="checkbox" id="__drawer" autocomplete="off">
<input class="md-toggle" data-md-toggle="search" type="checkbox" id="__search" autocomplete="off">
<label class="md-overlay" for="__drawer"></label>
<div data-md-component="skip">
<a href="#_1" class="md-skip">
跳转至
</a>
</div>
<div data-md-component="announce">
</div>
<header class="md-header md-header--shadow" data-md-component="header">
<nav class="md-header__inner md-grid" aria-label="页眉">
<a href="../" title="飞桨大语言模型推理部署工具包" class="md-header__button md-logo" aria-label="飞桨大语言模型推理部署工具包" data-md-component="logo">
<img src="../../assets/images/logo.jpg" alt="logo">
</a>
<label class="md-header__button md-icon" for="__drawer">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M3 6h18v2H3zm0 5h18v2H3zm0 5h18v2H3z"/></svg>
</label>
<div class="md-header__title" data-md-component="header-title">
<div class="md-header__ellipsis">
<div class="md-header__topic">
<span class="md-ellipsis">
飞桨大语言模型推理部署工具包
</span>
</div>
<div class="md-header__topic" data-md-component="header-topic">
<span class="md-ellipsis">
离线推理
</span>
</div>
</div>
</div>
<form class="md-header__option" data-md-component="palette">
<input class="md-option" data-md-color-media="(prefers-color-scheme: light)" data-md-color-scheme="default" data-md-color-primary="indigo" data-md-color-accent="indigo" aria-label="Switch to dark mode" type="radio" name="__palette" id="__palette_0">
<label class="md-header__button md-icon" title="Switch to dark mode" for="__palette_1" hidden>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 8a4 4 0 0 0-4 4 4 4 0 0 0 4 4 4 4 0 0 0 4-4 4 4 0 0 0-4-4m0 10a6 6 0 0 1-6-6 6 6 0 0 1 6-6 6 6 0 0 1 6 6 6 6 0 0 1-6 6m8-9.31V4h-4.69L12 .69 8.69 4H4v4.69L.69 12 4 15.31V20h4.69L12 23.31 15.31 20H20v-4.69L23.31 12z"/></svg>
</label>
<input class="md-option" data-md-color-media="(prefers-color-scheme: dark)" data-md-color-scheme="slate" data-md-color-primary="black" data-md-color-accent="indigo" aria-label="Switch to system preference" type="radio" name="__palette" id="__palette_1">
<label class="md-header__button md-icon" title="Switch to system preference" for="__palette_0" hidden>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 18c-.89 0-1.74-.2-2.5-.55C11.56 16.5 13 14.42 13 12s-1.44-4.5-3.5-5.45C10.26 6.2 11.11 6 12 6a6 6 0 0 1 6 6 6 6 0 0 1-6 6m8-9.31V4h-4.69L12 .69 8.69 4H4v4.69L.69 12 4 15.31V20h4.69L12 23.31 15.31 20H20v-4.69L23.31 12z"/></svg>
</label>
</form>
<script>var palette=__md_get("__palette");if(palette&&palette.color){if("(prefers-color-scheme)"===palette.color.media){var media=matchMedia("(prefers-color-scheme: light)"),input=document.querySelector(media.matches?"[data-md-color-media='(prefers-color-scheme: light)']":"[data-md-color-media='(prefers-color-scheme: dark)']");palette.color.media=input.getAttribute("data-md-color-media"),palette.color.scheme=input.getAttribute("data-md-color-scheme"),palette.color.primary=input.getAttribute("data-md-color-primary"),palette.color.accent=input.getAttribute("data-md-color-accent")}for(var[key,value]of Object.entries(palette.color))document.body.setAttribute("data-md-color-"+key,value)}</script>
<div class="md-header__option">
<div class="md-select">
<button class="md-header__button md-icon" aria-label="选择当前语言">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="m12.87 15.07-2.54-2.51.03-.03A17.5 17.5 0 0 0 14.07 6H17V4h-7V2H8v2H1v2h11.17C11.5 7.92 10.44 9.75 9 11.35 8.07 10.32 7.3 9.19 6.69 8h-2c.73 1.63 1.73 3.17 2.98 4.56l-5.09 5.02L4 19l5-5 3.11 3.11zM18.5 10h-2L12 22h2l1.12-3h4.75L21 22h2zm-2.62 7 1.62-4.33L19.12 17z"/></svg>
</button>
<div class="md-select__inner">
<ul class="md-select__list">
<li class="md-select__item">
<a href="../../offline_inference/" hreflang="en" class="md-select__link">
English
</a>
</li>
<li class="md-select__item">
<a href="./" hreflang="zh" class="md-select__link">
简体中文
</a>
</li>
</ul>
</div>
</div>
</div>
<label class="md-header__button md-icon" for="__search">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M9.5 3A6.5 6.5 0 0 1 16 9.5c0 1.61-.59 3.09-1.56 4.23l.27.27h.79l5 5-1.5 1.5-5-5v-.79l-.27-.27A6.52 6.52 0 0 1 9.5 16 6.5 6.5 0 0 1 3 9.5 6.5 6.5 0 0 1 9.5 3m0 2C7 5 5 7 5 9.5S7 14 9.5 14 14 12 14 9.5 12 5 9.5 5"/></svg>
</label>
<div class="md-search" data-md-component="search" role="dialog">
<label class="md-search__overlay" for="__search"></label>
<div class="md-search__inner" role="search">
<form class="md-search__form" name="search">
<input type="text" class="md-search__input" name="query" aria-label="搜索" placeholder="搜索" autocapitalize="off" autocorrect="off" autocomplete="off" spellcheck="false" data-md-component="search-query" required>
<label class="md-search__icon md-icon" for="__search">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M9.5 3A6.5 6.5 0 0 1 16 9.5c0 1.61-.59 3.09-1.56 4.23l.27.27h.79l5 5-1.5 1.5-5-5v-.79l-.27-.27A6.52 6.52 0 0 1 9.5 16 6.5 6.5 0 0 1 3 9.5 6.5 6.5 0 0 1 9.5 3m0 2C7 5 5 7 5 9.5S7 14 9.5 14 14 12 14 9.5 12 5 9.5 5"/></svg>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M20 11v2H8l5.5 5.5-1.42 1.42L4.16 12l7.92-7.92L13.5 5.5 8 11z"/></svg>
</label>
<nav class="md-search__options" aria-label="查找">
<button type="reset" class="md-search__icon md-icon" title="清空当前内容" aria-label="清空当前内容" tabindex="-1">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M19 6.41 17.59 5 12 10.59 6.41 5 5 6.41 10.59 12 5 17.59 6.41 19 12 13.41 17.59 19 19 17.59 13.41 12z"/></svg>
</button>
</nav>
</form>
<div class="md-search__output">
<div class="md-search__scrollwrap" tabindex="0" data-md-scrollfix>
<div class="md-search-result" data-md-component="search-result">
<div class="md-search-result__meta">
正在初始化搜索引擎
</div>
<ol class="md-search-result__list" role="presentation"></ol>
</div>
</div>
</div>
</div>
</div>
<div class="md-header__source">
<a href="https://github.com/PaddlePaddle/FastDeploy" title="前往仓库" class="md-source" data-md-component="source">
<div class="md-source__icon md-icon">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 448 512"><!--! Font Awesome Free 7.0.0 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2025 Fonticons, Inc.--><path d="M439.6 236.1 244 40.5c-5.4-5.5-12.8-8.5-20.4-8.5s-15 3-20.4 8.4L162.5 81l51.5 51.5c27.1-9.1 52.7 16.8 43.4 43.7l49.7 49.7c34.2-11.8 61.2 31 35.5 56.7-26.5 26.5-70.2-2.9-56-37.3L240.3 199v121.9c25.3 12.5 22.3 41.8 9.1 55-6.4 6.4-15.2 10.1-24.3 10.1s-17.8-3.6-24.3-10.1c-17.6-17.6-11.1-46.9 11.2-56v-123c-20.8-8.5-24.6-30.7-18.6-45L142.6 101 8.5 235.1C3 240.6 0 247.9 0 255.5s3 15 8.5 20.4l195.6 195.7c5.4 5.4 12.7 8.4 20.4 8.4s15-3 20.4-8.4l194.7-194.7c5.4-5.4 8.4-12.8 8.4-20.4s-3-15-8.4-20.4"/></svg>
</div>
<div class="md-source__repository">
FastDeploy
</div>
</a>
</div>
</nav>
</header>
<div class="md-container" data-md-component="container">
<main class="md-main" data-md-component="main">
<div class="md-main__inner md-grid">
<div class="md-sidebar md-sidebar--primary" data-md-component="sidebar" data-md-type="navigation" >
<div class="md-sidebar__scrollwrap">
<div class="md-sidebar__inner">
<nav class="md-nav md-nav--primary" aria-label="导航栏" data-md-level="0">
<label class="md-nav__title" for="__drawer">
<a href="../" title="飞桨大语言模型推理部署工具包" class="md-nav__button md-logo" aria-label="飞桨大语言模型推理部署工具包" data-md-component="logo">
<img src="../../assets/images/logo.jpg" alt="logo">
</a>
飞桨大语言模型推理部署工具包
</label>
<div class="md-nav__source">
<a href="https://github.com/PaddlePaddle/FastDeploy" title="前往仓库" class="md-source" data-md-component="source">
<div class="md-source__icon md-icon">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 448 512"><!--! Font Awesome Free 7.0.0 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2025 Fonticons, Inc.--><path d="M439.6 236.1 244 40.5c-5.4-5.5-12.8-8.5-20.4-8.5s-15 3-20.4 8.4L162.5 81l51.5 51.5c27.1-9.1 52.7 16.8 43.4 43.7l49.7 49.7c34.2-11.8 61.2 31 35.5 56.7-26.5 26.5-70.2-2.9-56-37.3L240.3 199v121.9c25.3 12.5 22.3 41.8 9.1 55-6.4 6.4-15.2 10.1-24.3 10.1s-17.8-3.6-24.3-10.1c-17.6-17.6-11.1-46.9 11.2-56v-123c-20.8-8.5-24.6-30.7-18.6-45L142.6 101 8.5 235.1C3 240.6 0 247.9 0 255.5s3 15 8.5 20.4l195.6 195.7c5.4 5.4 12.7 8.4 20.4 8.4s15-3 20.4-8.4l194.7-194.7c5.4-5.4 8.4-12.8 8.4-20.4s-3-15-8.4-20.4"/></svg>
</div>
<div class="md-source__repository">
FastDeploy
</div>
</a>
</div>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../" class="md-nav__link">
<span class="md-ellipsis">
FastDeploy
</span>
</a>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_2" >
<label class="md-nav__link" for="__nav_2" id="__nav_2_label" tabindex="0">
<span class="md-ellipsis">
快速入门
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="1" aria-labelledby="__nav_2_label" aria-expanded="false">
<label class="md-nav__title" for="__nav_2">
<span class="md-nav__icon md-icon"></span>
快速入门
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_2_1" >
<label class="md-nav__link" for="__nav_2_1" id="__nav_2_1_label" tabindex="0">
<span class="md-ellipsis">
安装
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="2" aria-labelledby="__nav_2_1_label" aria-expanded="false">
<label class="md-nav__title" for="__nav_2_1">
<span class="md-nav__icon md-icon"></span>
安装
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../get_started/installation/nvidia_gpu/" class="md-nav__link">
<span class="md-ellipsis">
英伟达 GPU
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../get_started/installation/kunlunxin_xpu/" class="md-nav__link">
<span class="md-ellipsis">
昆仑芯 XPU
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../get_started/installation/hygon_dcu/" class="md-nav__link">
<span class="md-ellipsis">
海光 DCU
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../get_started/installation/Enflame_gcu/" class="md-nav__link">
<span class="md-ellipsis">
燧原 S60
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../get_started/installation/iluvatar_gpu/" class="md-nav__link">
<span class="md-ellipsis">
天数 CoreX
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../get_started/installation/metax_gpu/" class="md-nav__link">
<span class="md-ellipsis">
沐曦 C550
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="../get_started/quick_start/" class="md-nav__link">
<span class="md-ellipsis">
ERNIE-4.5-0.3B快速部署
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../get_started/quick_start_vl/" class="md-nav__link">
<span class="md-ellipsis">
ERNIE-4.5-VL-28B-A3B快速部署
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../get_started/ernie-4.5/" class="md-nav__link">
<span class="md-ellipsis">
ERNIE-4.5-300B-A47B
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../get_started/ernie-4.5-vl/" class="md-nav__link">
<span class="md-ellipsis">
ERNIE-4.5-VL-424B-A47B
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../get_started/quick_start_qwen/" class="md-nav__link">
<span class="md-ellipsis">
Qwen3-0.6b快速部署
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_3" >
<label class="md-nav__link" for="__nav_3" id="__nav_3_label" tabindex="0">
<span class="md-ellipsis">
在线服务
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="1" aria-labelledby="__nav_3_label" aria-expanded="false">
<label class="md-nav__title" for="__nav_3">
<span class="md-nav__icon md-icon"></span>
在线服务
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../online_serving/" class="md-nav__link">
<span class="md-ellipsis">
兼容 OpenAI 协议的服务化部署
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../online_serving/metrics/" class="md-nav__link">
<span class="md-ellipsis">
监控Metrics
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../online_serving/scheduler/" class="md-nav__link">
<span class="md-ellipsis">
调度器
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../online_serving/graceful_shutdown_service/" class="md-nav__link">
<span class="md-ellipsis">
服务优雅关闭
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--active">
<input class="md-nav__toggle md-toggle" type="checkbox" id="__toc">
<label class="md-nav__link md-nav__link--active" for="__toc">
<span class="md-ellipsis">
离线推理
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<a href="./" class="md-nav__link md-nav__link--active">
<span class="md-ellipsis">
离线推理
</span>
</a>
<nav class="md-nav md-nav--secondary" aria-label="目录">
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
目录
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#1" class="md-nav__link">
<span class="md-ellipsis">
1. 使用方式
</span>
</a>
<nav class="md-nav" aria-label="1. 使用方式">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#llmchat" class="md-nav__link">
<span class="md-ellipsis">
对话接口(LLM.chat)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#llmgenerate" class="md-nav__link">
<span class="md-ellipsis">
续写接口(LLM.generate)
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#2" class="md-nav__link">
<span class="md-ellipsis">
2. 接口说明
</span>
</a>
<nav class="md-nav" aria-label="2. 接口说明">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#21-fastdeployllm" class="md-nav__link">
<span class="md-ellipsis">
2.1 fastdeploy.LLM
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#22-fastdeployllmchat" class="md-nav__link">
<span class="md-ellipsis">
2.2 fastdeploy.LLM.chat
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#23-fastdeployllmgenerate" class="md-nav__link">
<span class="md-ellipsis">
2.3 fastdeploy.LLM.generate
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#24-fastdeploysamplingparams" class="md-nav__link">
<span class="md-ellipsis">
2.4 fastdeploy.SamplingParams
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#25-fastdeployenginerequestrequestoutput" class="md-nav__link">
<span class="md-ellipsis">
2.5 fastdeploy.engine.request.RequestOutput
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#26-fastdeployenginerequestcompletionoutput" class="md-nav__link">
<span class="md-ellipsis">
2.6 fastdeploy.engine.request.CompletionOutput
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#27-fastdeployenginerequestrequestmetrics" class="md-nav__link">
<span class="md-ellipsis">
2.7 fastdeploy.engine.request.RequestMetrics
</span>
</a>
</li>
</ul>
</nav>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_5" >
<label class="md-nav__link" for="__nav_5" id="__nav_5_label" tabindex="0">
<span class="md-ellipsis">
最佳实践
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="1" aria-labelledby="__nav_5_label" aria-expanded="false">
<label class="md-nav__title" for="__nav_5">
<span class="md-nav__icon md-icon"></span>
最佳实践
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../best_practices/ERNIE-4.5-0.3B-Paddle/" class="md-nav__link">
<span class="md-ellipsis">
ERNIE-4.5-0.3B
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../best_practices/ERNIE-4.5-21B-A3B-Paddle/" class="md-nav__link">
<span class="md-ellipsis">
ERNIE-4.5-21B-A3B
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../best_practices/ERNIE-4.5-300B-A47B-Paddle/" class="md-nav__link">
<span class="md-ellipsis">
ERNIE-4.5-300B-A47B
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../best_practices/ERNIE-4.5-21B-A3B-Thinking/" class="md-nav__link">
<span class="md-ellipsis">
ERNIE-4.5-21B-A3B-Thinking
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../best_practices/ERNIE-4.5-VL-28B-A3B-Paddle/" class="md-nav__link">
<span class="md-ellipsis">
ERNIE-4.5-VL-28B-A3B
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../best_practices/ERNIE-4.5-VL-424B-A47B-Paddle/" class="md-nav__link">
<span class="md-ellipsis">
ERNIE-4.5-VL-424B-A47B
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../best_practices/FAQ/" class="md-nav__link">
<span class="md-ellipsis">
常见问题
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_6" >
<label class="md-nav__link" for="__nav_6" id="__nav_6_label" tabindex="0">
<span class="md-ellipsis">
量化
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="1" aria-labelledby="__nav_6_label" aria-expanded="false">
<label class="md-nav__title" for="__nav_6">
<span class="md-nav__icon md-icon"></span>
量化
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../quantization/" class="md-nav__link">
<span class="md-ellipsis">
概述
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../quantization/online_quantization/" class="md-nav__link">
<span class="md-ellipsis">
在线量化
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../quantization/wint2/" class="md-nav__link">
<span class="md-ellipsis">
WINT2量化
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_7" >
<label class="md-nav__link" for="__nav_7" id="__nav_7_label" tabindex="0">
<span class="md-ellipsis">
特性
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="1" aria-labelledby="__nav_7_label" aria-expanded="false">
<label class="md-nav__title" for="__nav_7">
<span class="md-nav__icon md-icon"></span>
特性
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../features/prefix_caching/" class="md-nav__link">
<span class="md-ellipsis">
前缀缓存
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/disaggregated/" class="md-nav__link">
<span class="md-ellipsis">
分离式部署
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/chunked_prefill/" class="md-nav__link">
<span class="md-ellipsis">
分块预填充
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/load_balance/" class="md-nav__link">
<span class="md-ellipsis">
负载均衡
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/speculative_decoding/" class="md-nav__link">
<span class="md-ellipsis">
投机解码
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/structured_outputs/" class="md-nav__link">
<span class="md-ellipsis">
结构化输出
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/reasoning_output/" class="md-nav__link">
<span class="md-ellipsis">
思考链内容
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/early_stop/" class="md-nav__link">
<span class="md-ellipsis">
早停功能
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/plugins/" class="md-nav__link">
<span class="md-ellipsis">
插件机制
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/sampling/" class="md-nav__link">
<span class="md-ellipsis">
采样策略
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/multi-node_deployment/" class="md-nav__link">
<span class="md-ellipsis">
多机部署
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/graph_optimization/" class="md-nav__link">
<span class="md-ellipsis">
图优化
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/data_parallel_service/" class="md-nav__link">
<span class="md-ellipsis">
数据并行
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../features/plas_attention/" class="md-nav__link">
<span class="md-ellipsis">
PLAS
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="../supported_models/" class="md-nav__link">
<span class="md-ellipsis">
支持模型列表
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../benchmark/" class="md-nav__link">
<span class="md-ellipsis">
基准测试
</span>
</a>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_10" >
<label class="md-nav__link" for="__nav_10" id="__nav_10_label" tabindex="0">
<span class="md-ellipsis">
用法
</span>
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" data-md-level="1" aria-labelledby="__nav_10_label" aria-expanded="false">
<label class="md-nav__title" for="__nav_10">
<span class="md-nav__icon md-icon"></span>
用法
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../usage/log/" class="md-nav__link">
<span class="md-ellipsis">
日志说明
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../usage/code_overview/" class="md-nav__link">
<span class="md-ellipsis">
代码概述
</span>
</a>
</li>
<li class="md-nav__item">
<a href="../usage/environment_variables/" class="md-nav__link">
<span class="md-ellipsis">
环境变量
</span>
</a>
</li>
</ul>
</nav>
</li>
</ul>
</nav>
</div>
</div>
</div>
<div class="md-sidebar md-sidebar--secondary" data-md-component="sidebar" data-md-type="toc" >
<div class="md-sidebar__scrollwrap">
<div class="md-sidebar__inner">
<nav class="md-nav md-nav--secondary" aria-label="目录">
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
目录
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#1" class="md-nav__link">
<span class="md-ellipsis">
1. 使用方式
</span>
</a>
<nav class="md-nav" aria-label="1. 使用方式">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#llmchat" class="md-nav__link">
<span class="md-ellipsis">
对话接口(LLM.chat)
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#llmgenerate" class="md-nav__link">
<span class="md-ellipsis">
续写接口(LLM.generate)
</span>
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="#2" class="md-nav__link">
<span class="md-ellipsis">
2. 接口说明
</span>
</a>
<nav class="md-nav" aria-label="2. 接口说明">
<ul class="md-nav__list">
<li class="md-nav__item">
<a href="#21-fastdeployllm" class="md-nav__link">
<span class="md-ellipsis">
2.1 fastdeploy.LLM
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#22-fastdeployllmchat" class="md-nav__link">
<span class="md-ellipsis">
2.2 fastdeploy.LLM.chat
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#23-fastdeployllmgenerate" class="md-nav__link">
<span class="md-ellipsis">
2.3 fastdeploy.LLM.generate
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#24-fastdeploysamplingparams" class="md-nav__link">
<span class="md-ellipsis">
2.4 fastdeploy.SamplingParams
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#25-fastdeployenginerequestrequestoutput" class="md-nav__link">
<span class="md-ellipsis">
2.5 fastdeploy.engine.request.RequestOutput
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#26-fastdeployenginerequestcompletionoutput" class="md-nav__link">
<span class="md-ellipsis">
2.6 fastdeploy.engine.request.CompletionOutput
</span>
</a>
</li>
<li class="md-nav__item">
<a href="#27-fastdeployenginerequestrequestmetrics" class="md-nav__link">
<span class="md-ellipsis">
2.7 fastdeploy.engine.request.RequestMetrics
</span>
</a>
</li>
</ul>
</nav>
</li>
</ul>
</nav>
</div>
</div>
</div>
<div class="md-content" data-md-component="content">
<article class="md-content__inner md-typeset">
<h1 id="_1">离线推理</h1>
<h2 id="1">1. 使用方式</h2>
<p>通过FastDeploy离线推理可支持本地加载模型并处理用户数据使用方式如下</p>
<h3 id="llmchat">对话接口(LLM.chat)</h3>
<pre><code class="language-python">from fastdeploy import LLM, SamplingParams
msg1=[
{&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: &quot;I'm a helpful AI assistant.&quot;},
{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;把李白的静夜思改写为现代诗&quot;},
]
msg2 = [
{&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: &quot;I'm a helpful AI assistant.&quot;},
{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;Write me a poem about large language model.&quot;},
]
messages = [msg1, msg2]
# 采样参数
sampling_params = SamplingParams(top_p=0.95, max_tokens=6400)
# 加载模型
llm = LLM(model=&quot;baidu/ERNIE-4.5-0.3B-Paddle&quot;, tensor_parallel_size=1, max_model_len=8192)
# 批量进行推理llm内部基于资源情况进行请求排队、动态插入处理
outputs = llm.chat(messages, sampling_params)
# 输出结果
for output in outputs:
prompt = output.prompt
generated_text = output.outputs.text
</code></pre>
<p>上述示例中 <code>LLM</code>配置方式, <code>SamplingParams</code> <code>LLM.generate</code> <code>LLM.chat</code>以及输出output对应的结构体 <code>RequestOutput</code> 接口说明见如下文档说明。</p>
<blockquote>
<p>注: 若为思考模型, 加载模型时需要指定 <code>reasoning_parser</code> 参数,并在请求时, 可以通过配置 <code>chat_template_kwargs</code><code>enable_thinking</code>参数, 进行开关思考。</p>
</blockquote>
<pre><code class="language-python">from fastdeploy.entrypoints.llm import LLM
# 加载模型
llm = LLM(model=&quot;baidu/ERNIE-4.5-VL-28B-A3B-Paddle&quot;, tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={&quot;image&quot;: 100}, reasoning_parser=&quot;ernie-45-vl&quot;)
outputs = llm.chat(
messages=[
{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: [ {&quot;type&quot;: &quot;image_url&quot;, &quot;image_url&quot;: {&quot;url&quot;: &quot;https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg&quot;}},
{&quot;type&quot;: &quot;text&quot;, &quot;text&quot;: &quot;图中的文物属于哪个年代&quot;}]}
],
chat_template_kwargs={&quot;enable_thinking&quot;: False})
# 输出结果
for output in outputs:
prompt = output.prompt
generated_text = output.outputs.text
reasoning_text = output.outputs.reasoning_content
</code></pre>
<h3 id="llmgenerate">续写接口(LLM.generate)</h3>
<pre><code class="language-python">from fastdeploy import LLM, SamplingParams
prompts = [
&quot;User: 帮我写一篇关于深圳文心公园的500字游记和赏析。\nAssistant: 好的。&quot;
]
# 采样参数
sampling_params = SamplingParams(top_p=0.95, max_tokens=6400)
# 加载模型
llm = LLM(model=&quot;baidu/ERNIE-4.5-21B-A3B-Base-Paddle&quot;, tensor_parallel_size=1, max_model_len=8192)
# 批量进行推理llm内部基于资源情况进行请求排队、动态插入处理
outputs = llm.generate(prompts, sampling_params)
# 输出结果
for output in outputs:
prompt = output.prompt
generated_text = output.outputs.text
</code></pre>
<blockquote>
<p>注: 续写接口, 适应于用户自定义好上下文输入, 并希望模型仅输出续写内容的场景; 推理过程不会增加其他 <code>prompt</code>拼接。
对于 <code>chat</code>模型, 建议使用对话接口(LLM.chat)。</p>
</blockquote>
<p>对于多模模型, 例如 <code>baidu/ERNIE-4.5-VL-28B-A3B-Paddle</code>, 在调用 <code>generate接口</code>时, 需要提供包含图片的prompt, 使用方式如下:</p>
<pre><code class="language-python">import io
import requests
from PIL import Image
from fastdeploy.entrypoints.llm import LLM
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.input.ernie_tokenizer import Ernie4_5Tokenizer
PATH = &quot;baidu/ERNIE-4.5-VL-28B-A3B-Paddle&quot;
tokenizer = Ernie4_5Tokenizer.from_pretrained(PATH)
messages = [
{
&quot;role&quot;: &quot;user&quot;,
&quot;content&quot;: [
{&quot;type&quot;:&quot;image_url&quot;, &quot;image_url&quot;: {&quot;url&quot;:&quot;https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg&quot;}},
{&quot;type&quot;:&quot;text&quot;, &quot;text&quot;:&quot;图中的文物属于哪个年代&quot;}
]
}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos = [], []
for message in messages:
content = message[&quot;content&quot;]
if not isinstance(content, list):
continue
for part in content:
if part[&quot;type&quot;] == &quot;image_url&quot;:
url = part[&quot;image_url&quot;][&quot;url&quot;]
image_bytes = requests.get(url).content
img = Image.open(io.BytesIO(image_bytes))
images.append(img)
elif part[&quot;type&quot;] == &quot;video_url&quot;:
url = part[&quot;video_url&quot;][&quot;url&quot;]
video_bytes = requests.get(url).content
videos.append({
&quot;video&quot;: video_bytes,
&quot;max_frames&quot;: 30
})
sampling_params = SamplingParams(temperature=0.1, max_tokens=6400)
llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={&quot;image&quot;: 100}, reasoning_parser=&quot;ernie-45-vl&quot;)
outputs = llm.generate(prompts={
&quot;prompt&quot;: prompt,
&quot;multimodal_data&quot;: {
&quot;image&quot;: images,
&quot;video&quot;: videos
}
}, sampling_params=sampling_params)
# 输出结果
for output in outputs:
prompt = output.prompt
generated_text = output.outputs.text
reasoning_text = output.outputs.reasoning_content
</code></pre>
<blockquote>
<p>注: <code>generate</code> 接口, 暂时不支持思考开关参数控制, 均使用模型默认思考能力。</p>
</blockquote>
<h2 id="2">2. 接口说明</h2>
<h3 id="21-fastdeployllm">2.1 fastdeploy.LLM</h3>
<p>支持配置参数参考 <a href="../parameters/">FastDeploy参数说明</a></p>
<blockquote>
<p>参数配置说明:</p>
<ol>
<li>离线推理不需要配置 <code>port</code><code>metrics_port</code> 参数。</li>
<li>模型服务启动后会在日志文件log/fastdeploy.log中打印如 <code>Doing profile, the total_block_num:640</code> 的日志其中640即表示自动计算得到的KV Cache block数量将它乘以block_size(默认值64)即可得到部署后总共可以在KV Cache中缓存的Token数。</li>
<li><code>max_num_seqs</code> 用于配置decode阶段最大并发处理请求数该参数可以基于第1点中缓存的Token数来计算一个较优值例如线上统计输入平均token数800, 输出平均token数500本次计&gt;算得到KV Cache block为640 block_size为64。那么我们可以配置 <code>kv_cache_ratio = 800 / (800 + 500) = 0.6</code> , 配置 <code>max_seq_len = 640 * 64 / (800 + 500) = 31</code></li>
</ol>
</blockquote>
<h3 id="22-fastdeployllmchat">2.2 fastdeploy.LLM.chat</h3>
<ul>
<li>messages(list[dict],list[list[dict]]): 输入的message, 支持batch message 输入</li>
<li>sampling_params: 模型超参设置具体说明见2.4</li>
<li>use_tqdm: 是否打开推理进度可视化</li>
<li>chat_template_kwargs(dict): 传递给对话模板的额外参数当前支持enable_thinking(bool)
<em>使用示例 <code>chat_template_kwargs={"enable_thinking": False}</code></em></li>
</ul>
<h3 id="23-fastdeployllmgenerate">2.3 fastdeploy.LLM.generate</h3>
<ul>
<li>prompts(str, list[str], list[int], list[list[int]], dict[str, Any], list[dict[str, Any]]): 输入的prompt, 支持batch prompt 输入解码后的token ids 进行输入
<em>dict 类型使用示例 <code>prompts={"prompt": prompt, "multimodal_data": {"image": images}}</code></em></li>
<li>sampling_params: 模型超参设置具体说明见2.4</li>
<li>use_tqdm: 是否打开推理进度可视化</li>
</ul>
<h3 id="24-fastdeploysamplingparams">2.4 fastdeploy.SamplingParams</h3>
<ul>
<li>presence_penalty(float): 控制模型生成重复内容的惩罚系数,正值降低重复话题出现的概率</li>
<li>frequency_penalty(float): 控制重复token的惩罚力度比presence_penalty更严格会惩罚高频重复</li>
<li>repetition_penalty(float): 直接对重复生成的token进行惩罚的系数&gt;1时惩罚重复&lt;1时鼓励重复</li>
<li>temperature(float): 控制生成随机性的参数,值越高结果越随机,值越低结果越确定</li>
<li>top_p(float): 概率累积分布截断阈值仅考虑累计概率达到此阈值的最可能token集合</li>
<li>top_k(int): 采样概率最高的token数量考虑概率最高的k个token进行采样</li>
<li>min_p(float): token入选的最小概率阈值(相对于最高概率token的比值设为&gt;0可通过过滤低概率token来提升文本生成质量)</li>
<li>max_tokens(int): 限制模型生成的最大token数量包括输入和输出</li>
<li>min_tokens(int): 强制模型生成的最少token数量避免过早结束</li>
<li>bad_words(list[str]): 禁止生成的词列表, 防止模型生成不希望出现的词</li>
</ul>
<h3 id="25-fastdeployenginerequestrequestoutput">2.5 fastdeploy.engine.request.RequestOutput</h3>
<ul>
<li>request_id(str): 标识request 的id</li>
<li>prompt(str)输入请求的request内容</li>
<li>prompt_token_ids(list[int]): 拼接后经过词典解码的输入的token 列表</li>
<li>outputs(fastdeploy.engine.request.CompletionOutput): 输出结果</li>
<li>finished(bool)标识当前query 是否推理结束</li>
<li>metrics(fastdeploy.engine.request.RequestMetrics):记录推理耗时指标</li>
<li>num_cached_tokens(int): 缓存的token数量, 仅在开启 <code>enable_prefix_caching</code>时有效</li>
<li>error_code(int): 错误码</li>
<li>error_msg(str): 错误信息</li>
</ul>
<h3 id="26-fastdeployenginerequestcompletionoutput">2.6 fastdeploy.engine.request.CompletionOutput</h3>
<ul>
<li>index(int):推理服务时的 batch index</li>
<li>send_idx(int): 当前请求返回的 token 序号</li>
<li>token_ids(list[int]):输出的 token 列表</li>
<li>text(str): token ids 对应的文本</li>
<li>reasoning_content(str):(仅思考模型有效)返回思考链的结果</li>
</ul>
<h3 id="27-fastdeployenginerequestrequestmetrics">2.7 fastdeploy.engine.request.RequestMetrics</h3>
<ul>
<li>arrival_time(float)::收到数据的时间,若流式返回则该时间为拿到推理结果的时间,若非流式返回则为收到推理数据</li>
<li>inference_start_time(float)::开始推理的时间点</li>
<li>first_token_time(float):推理侧首token 耗时</li>
<li>time_in_queue(float):等待推理的排队耗时</li>
<li>model_forward_time(float)::推理侧模型前向的耗时</li>
<li>model_execute_time(float):: 模型执行耗时,包括前向推理,排队,预处理(文本拼接,解码操作)的耗时</li>
</ul>
</article>
</div>
<script>var target=document.getElementById(location.hash.slice(1));target&&target.name&&(target.checked=target.name.startsWith("__tabbed_"))</script>
</div>
</main>
<footer class="md-footer">
<div class="md-footer-meta md-typeset">
<div class="md-footer-meta__inner md-grid">
<div class="md-copyright">
<div class="md-copyright__highlight">
Copyright &copy; 2025 Maintained by FastDeploy
</div>
Made with
<a href="https://squidfunk.github.io/mkdocs-material/" target="_blank" rel="noopener">
Material for MkDocs
</a>
</div>
</div>
</div>
</footer>
</div>
<div class="md-dialog" data-md-component="dialog">
<div class="md-dialog__inner md-typeset"></div>
</div>
<script id="__config" type="application/json">{"base": "../..", "features": [], "search": "../../assets/javascripts/workers/search.973d3a69.min.js", "tags": null, "translations": {"clipboard.copied": "\u5df2\u590d\u5236", "clipboard.copy": "\u590d\u5236", "search.result.more.one": "\u5728\u8be5\u9875\u4e0a\u8fd8\u6709 1 \u4e2a\u7b26\u5408\u6761\u4ef6\u7684\u7ed3\u679c", "search.result.more.other": "\u5728\u8be5\u9875\u4e0a\u8fd8\u6709 # \u4e2a\u7b26\u5408\u6761\u4ef6\u7684\u7ed3\u679c", "search.result.none": "\u6ca1\u6709\u627e\u5230\u7b26\u5408\u6761\u4ef6\u7684\u7ed3\u679c", "search.result.one": "\u627e\u5230 1 \u4e2a\u7b26\u5408\u6761\u4ef6\u7684\u7ed3\u679c", "search.result.other": "# \u4e2a\u7b26\u5408\u6761\u4ef6\u7684\u7ed3\u679c", "search.result.placeholder": "\u952e\u5165\u4ee5\u5f00\u59cb\u641c\u7d22", "search.result.term.missing": "\u7f3a\u5c11", "select.version": "\u9009\u62e9\u5f53\u524d\u7248\u672c"}, "version": null}</script>
<script src="../../assets/javascripts/bundle.f55a23d4.min.js"></script>
</body>
</html>