[Docx] add language (en/cn) switch links (#4470)

* add install docs

* 修改文档

* 修改文档
This commit is contained in:
yangjianfengo1
2025-10-17 15:47:41 +08:00
committed by GitHub
parent a3e0a15495
commit ba5c2b7e37
106 changed files with 206 additions and 0 deletions

View File

@@ -1,3 +1,5 @@
[简体中文](zh/benchmark.md)
# Benchmark
FastDeploy extends the [vLLM benchmark](https://github.com/vllm-project/vllm/blob/main/benchmarks/) script with additional metrics, enabling more detailed performance benchmarking for FastDeploy.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/best_practices/ERNIE-4.5-0.3B-Paddle.md)
# ERNIE-4.5-0.3B
## Environmental Preparation
### 1.1 Hardware requirements

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md)
# ERNIE-4.5-21B-A3B
## Environmental Preparation
### 1.1 Hardware requirements

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md)
# ERNIE-4.5-21B-A3B
## Environmental Preparation
### 1.1 Hardware requirements

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md)
# ERNIE-4.5-300B-A47B
## Environmental Preparation
### 1.1 Hardware requirements

View File

@@ -1,3 +1,4 @@
[简体中文](../zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md)
# ERNIE-4.5-VL-28B-A3B-Paddle

View File

@@ -1,3 +1,4 @@
[简体中文](../zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md)
# ERNIE-4.5-VL-424B-A47B-Paddle

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/best_practices/FAQ.md)
# FAQ
## 1.CUDA out of memory
1. when starting the service

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/best_practices/README.md)
# Optimal Deployment
- [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md)

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/chunked_prefill.md)
# Chunked Prefill
Chunked Prefill employs a segmentation strategy that breaks down Prefill requests into smaller subtasks, which are then batched together with Decode requests. This approach better balances compute-intensive (Prefill) and memory-intensive (Decode) operations, optimizes GPU resource utilization, reduces computational overhead and memory footprint per Prefill, thereby lowering peak memory usage and avoiding out-of-memory issues.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/data_parallel_service.md)
# Data Parallelism
Under the MOE model, enabling Expert Parallelism (EP) combined with Data Parallelism (DP), where EP distributes expert workloads and DP enables parallel request processing.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/disaggregated.md)
# Disaggregated Deployment
Large model inference consists of two phases: Prefill and Decode, which are compute-intensive and memory access-intensive respectively. Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency.

View File

@@ -1,3 +1,4 @@
[简体中文](../zh/features/early_stop.md)
# Early Stopping

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/graph_optimization.md)
# Graph optimization technology in FastDeploy
FastDeploy's `GraphOptimizationBackend` integrates a variety of graph optimization technologies:

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/load_balance.md)
# Global Scheduler: Multi-Instance Load Balancing
## Design Overview

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/multi-node_deployment.md)
# Multi-Node Deployment
## Overview

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/plas_attention.md)
# PLAS
## Introduction

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/plugins.md)
# FastDeploy Plugin Mechanism Documentation
FastDeploy supports a plugin mechanism that allows users to extend functionality without modifying the core code. Plugins are automatically discovered and loaded through Python's `entry_points` mechanism.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/prefix_caching.md)
# Prefix Caching
Prefix Caching is a technique to optimize the inference efficiency of generative models. Its core idea is to cache intermediate computation results (KV Cache) of input sequences, avoiding redundant computations and thereby accelerating response times for multiple requests sharing the same prefix.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/reasoning_output.md)
# Reasoning Outputs
Reasoning models return an additional `reasoning_content` field in their output, which contains the reasoning steps that led to the final conclusion.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/sampling.md)
# Sampling Strategies
Sampling strategies are used to determine how to select the next token from the output probability distribution of a model. FastDeploy currently supports multiple sampling strategies including Top-p, Top-k_Top-p, and Min-p Sampling.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/speculative_decoding.md)
# 🔮 Speculative Decoding
This project implements an efficient **Speculative Decoding** inference framework based on PaddlePaddle. It supports **Multi-Token Proposing (MTP)** to accelerate large language model (LLM) generation, significantly reducing latency and improving throughput.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/features/structured_outputs.md)
# Structured Outputs
## Overview

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/get_started/README.md)
# Get Started
- [Deploy ERNIE-4.5-0.3B-Paddle in 10 Minutes](quick_start.md)

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/get_started/ernie-4.5-vl.md)
# Deploy ERNIE-4.5-VL-424B-A47B Multimodal Model
This document explains how to deploy the ERNIE-4.5-VL multimodal model, which supports users to interact with the model using multimodal data (including reasoning capabilities). Before starting the deployment, please ensure that your hardware environment meets the following requirements:

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/get_started/ernie-4.5.md)
# Deploy ERNIE-4.5-300B-A47B Model
This document explains how to deploy the ERNIE-4.5 model. Before starting the deployment, please ensure that your hardware environment meets the following requirements:

View File

@@ -1,3 +1,5 @@
[简体中文](../../zh/get_started/installation/Enflame_gcu.md)
# Running ERNIE 4.5 Series Models with FastDeploy
The Enflame S60 ([Learn about Enflame](https://www.enflame-tech.com/)) is a next-generation AI inference accelerator card designed for large-scale deployment in data centers. It meets the demands of large language models (LLMs), search/advertising/recommendation systems, and traditional models. Characterized by broad model coverage, user-friendliness, and high portability, it is widely applicable to mainstream inference scenarios such as image and text generation applications, search and recommendation systems, and text/image/speech recognition.

View File

@@ -1,3 +1,5 @@
[简体中文](../../zh/get_started/installation/README.md)
# FastDeploy Installation
FastDeploy currently supports installation on the following hardware platforms:

View File

@@ -1,3 +1,5 @@
[简体中文](../../zh/get_started/installation/hygon_dcu.md)
# Run ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B model on hygon machine
The current version of the software merely serves as a demonstration demo for the hygon k100AI combined with the Fastdeploy inference framework for large models. There may be issues when running the latest ERNIE4.5 model, and we will conduct repairs and performance optimization in the future. Subsequent versions will provide customers with a more stable version.

View File

@@ -1,3 +1,5 @@
[简体中文](../../zh/get_started/installation/iluvatar_gpu.md)
# Run ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B model on iluvatar machine
## Machine Preparation

View File

@@ -1,3 +1,5 @@
[简体中文](../../zh/get_started/installation/intel_gaudi.md)
# Intel Gaudi Installation for running ERNIE 4.5 Series Models
The following installation methods are available when your environment meets these requirements:

View File

@@ -1,3 +1,5 @@
[简体中文](../../zh/get_started/installation/kunlunxin_xpu.md)
# Kunlunxin XPU
## Requirements

View File

@@ -1,3 +1,5 @@
[简体中文](../../zh/get_started/installation/metax_gpu.md)
# Metax GPU Installation for running ERNIE 4.5 Series Models
The following installation methods are available when your environment meets these requirements:

View File

@@ -1,3 +1,5 @@
[简体中文](../../zh/get_started/installation/nvidia_gpu.md)
# NVIDIA CUDA GPU Installation
The following installation methods are available when your environment meets these requirements:

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/get_started/quick_start.md)
# Deploy ERNIE-4.5-0.3B-Paddle in 10 Minutes
Before deployment, ensure your environment meets the following requirements:

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/get_started/quick_start_qwen.md)
# Deploy QWEN3-0.6b in 10 Minutes
Before deployment, ensure your environment meets the following requirements:

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/get_started/quick_start_vl.md)
# Deploy ERNIE-4.5-VL-28B-A3B-Paddle Multimodal Model in 10 Minutes
Before deployment, please ensure your environment meets the following requirements:

View File

@@ -1,3 +1,5 @@
[简体中文](zh/index.md)
# FastDeploy
**FastDeploy** is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers **production-ready, out-of-the-box deployment solutions** with core acceleration technologies:

View File

@@ -1,3 +1,5 @@
[简体中文](zh/offline_inference.md)
# Offline Inference
## 1. Usage

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/online_serving/README.md)
# OpenAI Protocol-Compatible API Server
FastDeploy provides a service-oriented deployment solution that is compatible with the OpenAI protocol. Users can quickly deploy it using the following command:

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/online_serving/graceful_shutdown_service.md)
# Graceful Service Node Shutdown Solution
## 1. Core Objective

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/online_serving/metrics.md)
# Monitoring Metrics
After FastDeploy is launched, it supports continuous monitoring of the FastDeploy service status through Metrics. When starting FastDeploy, you can specify the port for the Metrics service by configuring the `metrics-port` parameter.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/online_serving/scheduler.md)
# Scheduler
FastDeploy currently supports two types of schedulers: **Local Scheduler** and **Global Scheduler**. The Global Scheduler is designed for large-scale clusters, enabling secondary load balancing across nodes based on real-time workload metrics.

View File

@@ -1,3 +1,5 @@
[简体中文](zh/parameters.md)
# FastDeploy Parameter Documentation
## Parameter Description

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/quantization/README.md)
# Quantization
FastDeploy supports various quantization inference precisions including FP8, INT8, INT4, 2-bits, etc. It supports different precision inference for weights, activations, and KVCache tensors, which can meet the inference requirements of different scenarios such as low cost, low latency, and long context.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/quantization/online_quantization.md)
# Online Quantization
Online quantization refers to the inference engine quantizing weights after loading BF16 weights, rather than loading pre-quantized low-precision weights. FastDeploy supports online quantization of BF16 to various precisions, including: INT4, INT8, and FP8.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/quantization/wint2.md)
# WINT2 Quantization
Weights are compressed offline using the [CCQ (Convolutional Coding Quantization)](https://arxiv.org/pdf/2507.07145) method. The actual stored numerical type of weights is INT8, with 4 weights packed into each INT8 value, equivalent to 2 bits per weight. Activations are not quantized. During inference, weights are dequantized and decoded in real-time to BF16 numerical type, and calculations are performed using BF16 numerical type.

View File

@@ -1,3 +1,5 @@
[简体中文](zh/supported_models.md)
# Supported Models
FastDeploy currently supports the following models, which can be downloaded automatically during FastDeploy deployment.Specify the ``model`` parameter as the model name in the table below to automatically download model weights (all supports resumable downloads). The following three download sources are supported:

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/usage/code_overview.md)
# Code Overview
Below is an overview of the FastDeploy code structure and functionality organized by directory.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/usage/environment_variables.md)
# FastDeploy Environment Variables
FastDeploy's environment variables are defined in `fastdeploy/envs.py` at the root of the repository. Below is the documentation:

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/usage/fastdeploy_unit_test_guide.md)
# FastDeploy Unit Test Specification
1. Test Naming Conventions
- Test files must start with test_.

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/usage/kunlunxin_xpu_deployment.md)
## Supported Models
|Model Name|Context Length|Quantization|XPUs Required|Deployment Commands|Minimum Version Required|
|-|-|-|-|-|-|

View File

@@ -1,3 +1,5 @@
[简体中文](../zh/usage/log.md)
# Log Description
FastDeploy generates the following log files during deployment. Below is an explanation of each log's purpose.

View File

@@ -1,3 +1,5 @@
[English](../benchmark.md)
# Benchmark
FastDeploy基于[vLLM benchmark](https://github.com/vllm-project/vllm/blob/main/benchmarks/)脚本增加了部分统计信息可用于benchmark FastDeploy更详细的性能指标。

View File

@@ -1,3 +1,5 @@
[English](../../best_practices/ERNIE-4.5-0.3B-Paddle.md)
# ERNIE-4.5-0.3B
## 一、环境准备
### 1.1 支持情况

View File

@@ -1,3 +1,5 @@
[English](../../best_practices/ERNIE-4.5-21B-A3B-Paddle.md)
# ERNIE-4.5-21B-A3B
## 一、环境准备
### 1.1 支持情况

View File

@@ -1,3 +1,5 @@
[English](../../best_practices/ERNIE-4.5-21B-A3B-Thinking.md)
# ERNIE-4.5-21B-A3B-Thinking
## 一、环境准备
### 1.1 支持情况

View File

@@ -1,3 +1,5 @@
[English](../../best_practices/ERNIE-4.5-300B-A47B-Paddle.md)
# ERNIE-4.5-300B-A47B
## 一、环境准备
### 1.1 支持情况

View File

@@ -1,3 +1,4 @@
[English](../../best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md)
# ERNIE-4.5-VL-28B-A3B-Paddle

View File

@@ -1,3 +1,4 @@
[English](../../best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md)
# ERNIE-4.5-VL-424B-A47B-Paddle

View File

@@ -1,3 +1,5 @@
[English](../../best_practices/FAQ.md)
# 常见问题FAQ
## 1.显存不足
1. 启动服务时显存不足:

View File

@@ -1,3 +1,5 @@
[English](../../best_practices/README.md)
# 最佳实践
- [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md)

View File

@@ -1,3 +1,5 @@
[English](../../features/chunked_prefill.md)
# Chunked Prefill 与 128K 长文推理部署
Chunked Prefill 采用分块策略将预填充Prefill阶段请求拆解为小规模子任务与解码Decode请求混合批处理执行。可以更好地平衡计算密集型Prefill和访存密集型Decode操作优化GPU资源利用率减少单次Prefill的计算量和显存占用从而降低显存峰值避免显存不足的问题。

View File

@@ -1,3 +1,5 @@
[English](../../features/data_parallel_service.md)
# 数据并行
在MOE模型下开启专家并行EP与数据并行DP相结合EP 分摊专家负载,结合 DP 实现请求并行处理。

View File

@@ -1,3 +1,5 @@
[English](../../features/disaggregated.md)
# 分离式部署
大模型推理分为两个部分Prefill和Decode阶段分别为计算密集型Prefill和存储密集型Decode两部分。将Prefill 和 Decode 分开部署在一定场景下可以提高硬件利用率,有效提高吞吐,降低整句时延,

View File

@@ -1,3 +1,4 @@
[English](../../features/early_stop.md)
# 早停功能

View File

@@ -1,3 +1,5 @@
[English](../../features/graph_optimization.md)
# FastDeploy 中的图优化技术
FastDeploy 的 `GraphOptimizationBackend` 中集成了多种图优化技术:

View File

@@ -1,3 +1,5 @@
[English](../../features/load_balance.md)
# 全局调度器: 多实例负载均衡
## 设计方案

View File

@@ -1,3 +1,5 @@
[English](../../features/multi-node_deployment.md)
# 多节点部署
## 概述

View File

@@ -1,3 +1,5 @@
[English](../../features/plas_attention.md)
# PLAS
## 介绍

View File

@@ -1,3 +1,5 @@
[English](../../features/plugins.md)
# FastDeploy 插件机制说明文档
FastDeploy 支持插件机制,允许用户在不修改核心代码的前提下扩展功能。插件通过 Python 的 `entry_points` 机制实现自动发现与加载。

View File

@@ -1,3 +1,5 @@
[English](../../features/prefix_caching.md)
# Prefix Caching
Prefix Caching前缀缓存是一种优化生成式模型推理效率的技术核心思想是通过缓存输入序列的中间计算结果KV Cache避免重复计算从而加速具有相同前缀的多个请求的响应速度。

View File

@@ -1,3 +1,5 @@
[English](../../features/reasoning_output.md)
# 思考链内容
思考模型在输出中返回 `reasoning_content` 字段,表示思考链内容,即得出最终结论的思考步骤.

View File

@@ -1,3 +1,5 @@
[English](../../features/sampling.md)
# 采样策略
采样策略用于决定如何从模型的输出概率分布中选择下一个token。FastDeploy目前支持 Top-p 、 Top-k_Top-p 和 Min-p Sampling 多种采样策略。

View File

@@ -1,3 +1,5 @@
[English](../../features/speculative_decoding.md)
# 🔮 投机解码
本项目基于 PaddlePaddle 实现了高效的 **投机解码Speculative Decoding** 推理框架,支持多 Token 预测Multi-token Proposing, MTP用于加速大语言模型LLM的生成显著降低时延并提升吞吐量。

View File

@@ -1,3 +1,5 @@
[English](../../features/structured_outputs.md)
# Structured Outputs
## 概述

View File

@@ -1,3 +1,5 @@
[English](../../get_started/ernie-4.5-vl.md)
# ERNIE-4.5-VL多模态模型
本文档讲解如何部署ERNIE-4.5-VL多模态模型支持用户使用多模态数据与模型进行对话交互(包含思考Reasoning),在开始部署前,请确保你的硬件环境满足如下条件:

View File

@@ -1,3 +1,5 @@
[English](../../get_started/ernie-4.5.md)
# ERNIE-4.5模型
本文档讲解如何部署ERNIE-4.5模型,在开始部署前,请确保你的硬件环境满足如下条件:

View File

@@ -1,3 +1,5 @@
[English](../../../get_started/installation/Enflame_gcu.md)
# 使用 FastDeploy 在燧原 S60 上运行 ERNIE 4.5 系列模型
燧原 S60[了解燧原](https://www.enflame-tech.com/))是面向数据中心大规模部署的新一代人工智能推理加速卡,满足大语言模型、搜广推及传统模型的需求,具有模型覆盖面广、易用性强、易迁移易部署等特点,可广泛应用于图像及文本生成等应用、搜索与推荐、文本、图像及语音识别等主流推理场景。

View File

@@ -1,3 +1,5 @@
[English](../../../get_started/installation/README.md)
# FastDeploy 安装
FastDeploy支持如下硬件平台:

View File

@@ -1,3 +1,5 @@
[English](../../../get_started/installation/hygon_dcu.md)
# 使用 FastDeploy 在海光 K100AI 上运行 ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B
当前版本软件只是作为K100AI + Fastdeploy 推理大模型的一个演示 demo跑最新ERNIE4.5模型可能存在问题,后续进行修复和性能优化,给客户提供一个更稳定的版本。

View File

@@ -1,3 +1,5 @@
[English](../../../get_started/installation/iluvatar_gpu.md)
# 如何在天数机器上运行 ERNIE-4.5-300B-A47B-BF16 & ERNIE-4.5-21B-A3B
## 准备机器

View File

@@ -1,3 +1,5 @@
[English](../../../get_started/installation/intel_gaudi.md)
# 使用 Intel Gaudi 运行ERNIE 4.5 系列模型
在环境满足如下条件前提下

View File

@@ -1,3 +1,5 @@
[English](../../../get_started/installation/kunlunxin_xpu.md)
# 昆仑芯 XPU
## 要求

View File

@@ -1,3 +1,5 @@
[English](../../../get_started/installation/metax_gpu.md)
# 使用 Metax GPU C550 运行ERNIE 4.5 系列模型
FastDeploy在Metax C550上对ERNIE 4.5系列模型进行了深度适配和优化实现了推理入口和GPU的统一无需修改即可完成推理任务的迁移。

View File

@@ -1,3 +1,5 @@
[English](../../../get_started/installation/nvidia_gpu.md)
# NVIDIA CUDA GPU Installation
在环境满足如下条件前提下

View File

@@ -1,3 +1,5 @@
[English](../../get_started/quick_start.md)
# 10分钟完成 ERNIE-4.5-0.3B-Paddle 模型部署
本文档讲解如何部署ERNIE-4.5-0.3B-Base-Paddle模型在开始部署前请确保你的硬件环境满足如下条件

View File

@@ -1,3 +1,5 @@
[English](../../get_started/quick_start_qwen.md)
# 10分钟完成 Qwen3-0.6b 模型部署
本文档讲解如何部署Qwen3-0.6b模型,在开始部署前,请确保你的硬件环境满足如下条件:

View File

@@ -1,3 +1,5 @@
[English](../../get_started/quick_start_vl.md)
# 10分钟完成 ERNIE-4.5-VL-28B-A3B-Paddle 多模态模型部署
本文档讲解如何部署ERNIE-4.5-VL-28B-A3B-Paddle模型在开始部署前请确保你的硬件环境满足如下条件

View File

@@ -1,3 +1,5 @@
[English](../index.md)
# FastDeploy
**FastDeploy** 是基于飞桨PaddlePaddle的大语言模型LLM与视觉语言模型VLM推理部署工具包提供**开箱即用的生产级部署方案**,核心技术特性包括:

View File

@@ -1,3 +1,5 @@
[English](../offline_inference.md)
# 离线推理
## 1. 使用方式

View File

@@ -1,3 +1,5 @@
[English](../../online_serving/README.md)
# 兼容 OpenAI 协议的服务化部署
FastDeploy 提供与 OpenAI 协议兼容的服务化部署方案。用户可以通过如下命令快速进行部署:

View File

@@ -1,3 +1,5 @@
[English](../../online_serving/graceful_shutdown_service.md)
# 服务节点优雅关闭方案
## 1. 核心目标

View File

@@ -1,3 +1,5 @@
[English](../../online_serving/metrics.md)
# 监控 Metrics
在 FastDeploy 启动后,支持通过 Metrics 持续监控的 FastDeploy 的服务状态。启动 FastDeploy 时,可以通过配置 `metrics-port` 参数指定 Metircs 服务的端口。

View File

@@ -1,3 +1,5 @@
[English](../../online_serving/scheduler.md)
# 调度器
FastDeploy 目前支持两种调度器: **本地调度器****全局调度器** 。 全局调度器专为大规模集群设计,能够基于实时工作负载指标在节点间实现二级负载均衡。

View File

@@ -1,3 +1,5 @@
[English](../parameters.md)
# FastDeploy参数说明
在使用FastDeploy部署模型包括离线推理、服务化部署涉及如下参数配置其实需要注意在使用离线推理时各参数配置即为如下参数名而在使用命令行启动服务时相应参数中的分隔符需要从```_```修改为```-```,如```max_model_len```在命令行中则为```--max-model-len```。

View File

@@ -1,3 +1,5 @@
[English](../../quantization/README.md)
# 量化
FastDeploy支持FP8、INT8、INT4、2-bit等多种量化推理精度支持模型权重、激活和KVCache 3种张量的不同精度推理可以满足低成本、低时延、长上下文等不同场景的推理需求。

View File

@@ -1,3 +1,5 @@
[English](../../quantization/online_quantization.md)
# 在线量化
在线量化是指推理引擎在加载 BF16 权重后对权重做量化而不是加载离线量化好的低精度权重。FastDeploy 支持将 BF16 在线量化到多种精度包括INT4, INT8 和 FP8.

View File

@@ -1,3 +1,5 @@
[English](../../quantization/wint2.md)
# WINT2量化
权重经过 [CCQ卷积编码量化](https://arxiv.org/pdf/2507.07145) 方法进行离线压缩。权重的实际存储数值类型为INT8每个INT8数值中打包了4个权重等价于每个权重2bits。激活不做量化。在推理过程中权重会被实时反量化并解码为BF16数值类型并使用BF16数值类型进行计算。

View File

@@ -1,3 +1,5 @@
[English](../supported_models.md)
# 支持模型列表
FastDeploy目前支持模型列表如下在FastDeploy部署时指定 ``model``参数为如下表格中的模型名即可自动下载模型权重均支持断点续传支持如下3种下载源

Some files were not shown because too many files have changed in this diff Show More