mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-11-01 12:22:53 +08:00
[Docx] add language (en/cn) switch links (#4470)
* add install docs * 修改文档 * 修改文档
This commit is contained in:
@@ -1,3 +1,5 @@
|
||||
[简体中文](zh/benchmark.md)
|
||||
|
||||
# Benchmark
|
||||
|
||||
FastDeploy extends the [vLLM benchmark](https://github.com/vllm-project/vllm/blob/main/benchmarks/) script with additional metrics, enabling more detailed performance benchmarking for FastDeploy.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/best_practices/ERNIE-4.5-0.3B-Paddle.md)
|
||||
|
||||
# ERNIE-4.5-0.3B
|
||||
## Environmental Preparation
|
||||
### 1.1 Hardware requirements
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md)
|
||||
|
||||
# ERNIE-4.5-21B-A3B
|
||||
## Environmental Preparation
|
||||
### 1.1 Hardware requirements
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md)
|
||||
|
||||
# ERNIE-4.5-21B-A3B
|
||||
## Environmental Preparation
|
||||
### 1.1 Hardware requirements
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md)
|
||||
|
||||
# ERNIE-4.5-300B-A47B
|
||||
## Environmental Preparation
|
||||
### 1.1 Hardware requirements
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
[简体中文](../zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md)
|
||||
|
||||
# ERNIE-4.5-VL-28B-A3B-Paddle
|
||||
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
[简体中文](../zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md)
|
||||
|
||||
# ERNIE-4.5-VL-424B-A47B-Paddle
|
||||
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/best_practices/FAQ.md)
|
||||
|
||||
# FAQ
|
||||
## 1.CUDA out of memory
|
||||
1. when starting the service:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/best_practices/README.md)
|
||||
|
||||
# Optimal Deployment
|
||||
|
||||
- [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md)
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/chunked_prefill.md)
|
||||
|
||||
# Chunked Prefill
|
||||
|
||||
Chunked Prefill employs a segmentation strategy that breaks down Prefill requests into smaller subtasks, which are then batched together with Decode requests. This approach better balances compute-intensive (Prefill) and memory-intensive (Decode) operations, optimizes GPU resource utilization, reduces computational overhead and memory footprint per Prefill, thereby lowering peak memory usage and avoiding out-of-memory issues.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/data_parallel_service.md)
|
||||
|
||||
# Data Parallelism
|
||||
Under the MOE model, enabling Expert Parallelism (EP) combined with Data Parallelism (DP), where EP distributes expert workloads and DP enables parallel request processing.
|
||||
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/disaggregated.md)
|
||||
|
||||
# Disaggregated Deployment
|
||||
|
||||
Large model inference consists of two phases: Prefill and Decode, which are compute-intensive and memory access-intensive respectively. Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency.
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
[简体中文](../zh/features/early_stop.md)
|
||||
|
||||
# Early Stopping
|
||||
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/graph_optimization.md)
|
||||
|
||||
# Graph optimization technology in FastDeploy
|
||||
|
||||
FastDeploy's `GraphOptimizationBackend` integrates a variety of graph optimization technologies:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/load_balance.md)
|
||||
|
||||
# Global Scheduler: Multi-Instance Load Balancing
|
||||
|
||||
## Design Overview
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/multi-node_deployment.md)
|
||||
|
||||
# Multi-Node Deployment
|
||||
|
||||
## Overview
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/plas_attention.md)
|
||||
|
||||
# PLAS
|
||||
|
||||
## Introduction
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/plugins.md)
|
||||
|
||||
# FastDeploy Plugin Mechanism Documentation
|
||||
|
||||
FastDeploy supports a plugin mechanism that allows users to extend functionality without modifying the core code. Plugins are automatically discovered and loaded through Python's `entry_points` mechanism.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/prefix_caching.md)
|
||||
|
||||
# Prefix Caching
|
||||
|
||||
Prefix Caching is a technique to optimize the inference efficiency of generative models. Its core idea is to cache intermediate computation results (KV Cache) of input sequences, avoiding redundant computations and thereby accelerating response times for multiple requests sharing the same prefix.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/reasoning_output.md)
|
||||
|
||||
# Reasoning Outputs
|
||||
|
||||
Reasoning models return an additional `reasoning_content` field in their output, which contains the reasoning steps that led to the final conclusion.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/sampling.md)
|
||||
|
||||
# Sampling Strategies
|
||||
|
||||
Sampling strategies are used to determine how to select the next token from the output probability distribution of a model. FastDeploy currently supports multiple sampling strategies including Top-p, Top-k_Top-p, and Min-p Sampling.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/speculative_decoding.md)
|
||||
|
||||
# 🔮 Speculative Decoding
|
||||
|
||||
This project implements an efficient **Speculative Decoding** inference framework based on PaddlePaddle. It supports **Multi-Token Proposing (MTP)** to accelerate large language model (LLM) generation, significantly reducing latency and improving throughput.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/features/structured_outputs.md)
|
||||
|
||||
# Structured Outputs
|
||||
|
||||
## Overview
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/get_started/README.md)
|
||||
|
||||
# Get Started
|
||||
|
||||
- [Deploy ERNIE-4.5-0.3B-Paddle in 10 Minutes](quick_start.md)
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/get_started/ernie-4.5-vl.md)
|
||||
|
||||
# Deploy ERNIE-4.5-VL-424B-A47B Multimodal Model
|
||||
|
||||
This document explains how to deploy the ERNIE-4.5-VL multimodal model, which supports users to interact with the model using multimodal data (including reasoning capabilities). Before starting the deployment, please ensure that your hardware environment meets the following requirements:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/get_started/ernie-4.5.md)
|
||||
|
||||
# Deploy ERNIE-4.5-300B-A47B Model
|
||||
|
||||
This document explains how to deploy the ERNIE-4.5 model. Before starting the deployment, please ensure that your hardware environment meets the following requirements:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../../zh/get_started/installation/Enflame_gcu.md)
|
||||
|
||||
# Running ERNIE 4.5 Series Models with FastDeploy
|
||||
|
||||
The Enflame S60 ([Learn about Enflame](https://www.enflame-tech.com/)) is a next-generation AI inference accelerator card designed for large-scale deployment in data centers. It meets the demands of large language models (LLMs), search/advertising/recommendation systems, and traditional models. Characterized by broad model coverage, user-friendliness, and high portability, it is widely applicable to mainstream inference scenarios such as image and text generation applications, search and recommendation systems, and text/image/speech recognition.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../../zh/get_started/installation/README.md)
|
||||
|
||||
# FastDeploy Installation
|
||||
|
||||
FastDeploy currently supports installation on the following hardware platforms:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../../zh/get_started/installation/hygon_dcu.md)
|
||||
|
||||
# Run ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B model on hygon machine
|
||||
The current version of the software merely serves as a demonstration demo for the hygon k100AI combined with the Fastdeploy inference framework for large models. There may be issues when running the latest ERNIE4.5 model, and we will conduct repairs and performance optimization in the future. Subsequent versions will provide customers with a more stable version.
|
||||
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../../zh/get_started/installation/iluvatar_gpu.md)
|
||||
|
||||
# Run ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B model on iluvatar machine
|
||||
|
||||
## Machine Preparation
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../../zh/get_started/installation/intel_gaudi.md)
|
||||
|
||||
# Intel Gaudi Installation for running ERNIE 4.5 Series Models
|
||||
|
||||
The following installation methods are available when your environment meets these requirements:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../../zh/get_started/installation/kunlunxin_xpu.md)
|
||||
|
||||
# Kunlunxin XPU
|
||||
|
||||
## Requirements
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../../zh/get_started/installation/metax_gpu.md)
|
||||
|
||||
# Metax GPU Installation for running ERNIE 4.5 Series Models
|
||||
|
||||
The following installation methods are available when your environment meets these requirements:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../../zh/get_started/installation/nvidia_gpu.md)
|
||||
|
||||
# NVIDIA CUDA GPU Installation
|
||||
|
||||
The following installation methods are available when your environment meets these requirements:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/get_started/quick_start.md)
|
||||
|
||||
# Deploy ERNIE-4.5-0.3B-Paddle in 10 Minutes
|
||||
|
||||
Before deployment, ensure your environment meets the following requirements:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/get_started/quick_start_qwen.md)
|
||||
|
||||
# Deploy QWEN3-0.6b in 10 Minutes
|
||||
|
||||
Before deployment, ensure your environment meets the following requirements:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/get_started/quick_start_vl.md)
|
||||
|
||||
# Deploy ERNIE-4.5-VL-28B-A3B-Paddle Multimodal Model in 10 Minutes
|
||||
|
||||
Before deployment, please ensure your environment meets the following requirements:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](zh/index.md)
|
||||
|
||||
# FastDeploy
|
||||
|
||||
**FastDeploy** is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers **production-ready, out-of-the-box deployment solutions** with core acceleration technologies:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](zh/offline_inference.md)
|
||||
|
||||
# Offline Inference
|
||||
|
||||
## 1. Usage
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/online_serving/README.md)
|
||||
|
||||
# OpenAI Protocol-Compatible API Server
|
||||
|
||||
FastDeploy provides a service-oriented deployment solution that is compatible with the OpenAI protocol. Users can quickly deploy it using the following command:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/online_serving/graceful_shutdown_service.md)
|
||||
|
||||
# Graceful Service Node Shutdown Solution
|
||||
|
||||
## 1. Core Objective
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/online_serving/metrics.md)
|
||||
|
||||
# Monitoring Metrics
|
||||
|
||||
After FastDeploy is launched, it supports continuous monitoring of the FastDeploy service status through Metrics. When starting FastDeploy, you can specify the port for the Metrics service by configuring the `metrics-port` parameter.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/online_serving/scheduler.md)
|
||||
|
||||
# Scheduler
|
||||
|
||||
FastDeploy currently supports two types of schedulers: **Local Scheduler** and **Global Scheduler**. The Global Scheduler is designed for large-scale clusters, enabling secondary load balancing across nodes based on real-time workload metrics.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](zh/parameters.md)
|
||||
|
||||
# FastDeploy Parameter Documentation
|
||||
|
||||
## Parameter Description
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/quantization/README.md)
|
||||
|
||||
# Quantization
|
||||
|
||||
FastDeploy supports various quantization inference precisions including FP8, INT8, INT4, 2-bits, etc. It supports different precision inference for weights, activations, and KVCache tensors, which can meet the inference requirements of different scenarios such as low cost, low latency, and long context.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/quantization/online_quantization.md)
|
||||
|
||||
# Online Quantization
|
||||
|
||||
Online quantization refers to the inference engine quantizing weights after loading BF16 weights, rather than loading pre-quantized low-precision weights. FastDeploy supports online quantization of BF16 to various precisions, including: INT4, INT8, and FP8.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/quantization/wint2.md)
|
||||
|
||||
# WINT2 Quantization
|
||||
|
||||
Weights are compressed offline using the [CCQ (Convolutional Coding Quantization)](https://arxiv.org/pdf/2507.07145) method. The actual stored numerical type of weights is INT8, with 4 weights packed into each INT8 value, equivalent to 2 bits per weight. Activations are not quantized. During inference, weights are dequantized and decoded in real-time to BF16 numerical type, and calculations are performed using BF16 numerical type.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](zh/supported_models.md)
|
||||
|
||||
# Supported Models
|
||||
|
||||
FastDeploy currently supports the following models, which can be downloaded automatically during FastDeploy deployment.Specify the ``model`` parameter as the model name in the table below to automatically download model weights (all supports resumable downloads). The following three download sources are supported:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/usage/code_overview.md)
|
||||
|
||||
# Code Overview
|
||||
|
||||
Below is an overview of the FastDeploy code structure and functionality organized by directory.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/usage/environment_variables.md)
|
||||
|
||||
# FastDeploy Environment Variables
|
||||
|
||||
FastDeploy's environment variables are defined in `fastdeploy/envs.py` at the root of the repository. Below is the documentation:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/usage/fastdeploy_unit_test_guide.md)
|
||||
|
||||
# FastDeploy Unit Test Specification
|
||||
1. Test Naming Conventions
|
||||
- Test files must start with test_.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/usage/kunlunxin_xpu_deployment.md)
|
||||
|
||||
## Supported Models
|
||||
|Model Name|Context Length|Quantization|XPUs Required|Deployment Commands|Minimum Version Required|
|
||||
|-|-|-|-|-|-|
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[简体中文](../zh/usage/log.md)
|
||||
|
||||
# Log Description
|
||||
|
||||
FastDeploy generates the following log files during deployment. Below is an explanation of each log's purpose.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../benchmark.md)
|
||||
|
||||
# Benchmark
|
||||
|
||||
FastDeploy基于[vLLM benchmark](https://github.com/vllm-project/vllm/blob/main/benchmarks/)脚本,增加了部分统计信息,可用于benchmark FastDeploy更详细的性能指标。
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../best_practices/ERNIE-4.5-0.3B-Paddle.md)
|
||||
|
||||
# ERNIE-4.5-0.3B
|
||||
## 一、环境准备
|
||||
### 1.1 支持情况
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../best_practices/ERNIE-4.5-21B-A3B-Paddle.md)
|
||||
|
||||
# ERNIE-4.5-21B-A3B
|
||||
## 一、环境准备
|
||||
### 1.1 支持情况
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../best_practices/ERNIE-4.5-21B-A3B-Thinking.md)
|
||||
|
||||
# ERNIE-4.5-21B-A3B-Thinking
|
||||
## 一、环境准备
|
||||
### 1.1 支持情况
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../best_practices/ERNIE-4.5-300B-A47B-Paddle.md)
|
||||
|
||||
# ERNIE-4.5-300B-A47B
|
||||
## 一、环境准备
|
||||
### 1.1 支持情况
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
[English](../../best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md)
|
||||
|
||||
# ERNIE-4.5-VL-28B-A3B-Paddle
|
||||
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
[English](../../best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md)
|
||||
|
||||
# ERNIE-4.5-VL-424B-A47B-Paddle
|
||||
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../best_practices/FAQ.md)
|
||||
|
||||
# 常见问题FAQ
|
||||
## 1.显存不足
|
||||
1. 启动服务时显存不足:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../best_practices/README.md)
|
||||
|
||||
# 最佳实践
|
||||
|
||||
- [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md)
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/chunked_prefill.md)
|
||||
|
||||
# Chunked Prefill 与 128K 长文推理部署
|
||||
|
||||
Chunked Prefill 采用分块策略,将预填充(Prefill)阶段请求拆解为小规模子任务,与解码(Decode)请求混合批处理执行。可以更好地平衡计算密集型(Prefill)和访存密集型(Decode)操作,优化GPU资源利用率,减少单次Prefill的计算量和显存占用,从而降低显存峰值,避免显存不足的问题。
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/data_parallel_service.md)
|
||||
|
||||
# 数据并行
|
||||
在MOE模型下,开启专家并行(EP)与数据并行(DP)相结合,EP 分摊专家负载,结合 DP 实现请求并行处理。
|
||||
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/disaggregated.md)
|
||||
|
||||
# 分离式部署
|
||||
|
||||
大模型推理分为两个部分Prefill和Decode阶段,分别为计算密集型(Prefill)和存储密集型(Decode)两部分。将Prefill 和 Decode 分开部署在一定场景下可以提高硬件利用率,有效提高吞吐,降低整句时延,
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
[English](../../features/early_stop.md)
|
||||
|
||||
# 早停功能
|
||||
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/graph_optimization.md)
|
||||
|
||||
# FastDeploy 中的图优化技术
|
||||
FastDeploy 的 `GraphOptimizationBackend` 中集成了多种图优化技术:
|
||||
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/load_balance.md)
|
||||
|
||||
# 全局调度器: 多实例负载均衡
|
||||
|
||||
## 设计方案
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/multi-node_deployment.md)
|
||||
|
||||
# 多节点部署
|
||||
|
||||
## 概述
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/plas_attention.md)
|
||||
|
||||
# PLAS
|
||||
|
||||
## 介绍
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/plugins.md)
|
||||
|
||||
# FastDeploy 插件机制说明文档
|
||||
|
||||
FastDeploy 支持插件机制,允许用户在不修改核心代码的前提下扩展功能。插件通过 Python 的 `entry_points` 机制实现自动发现与加载。
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/prefix_caching.md)
|
||||
|
||||
# Prefix Caching
|
||||
|
||||
Prefix Caching(前缀缓存)是一种优化生成式模型推理效率的技术,核心思想是通过缓存输入序列的中间计算结果(KV Cache),避免重复计算,从而加速具有相同前缀的多个请求的响应速度。
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/reasoning_output.md)
|
||||
|
||||
# 思考链内容
|
||||
|
||||
思考模型在输出中返回 `reasoning_content` 字段,表示思考链内容,即得出最终结论的思考步骤.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/sampling.md)
|
||||
|
||||
# 采样策略
|
||||
|
||||
采样策略用于决定如何从模型的输出概率分布中选择下一个token。FastDeploy目前支持 Top-p 、 Top-k_Top-p 和 Min-p Sampling 多种采样策略。
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/speculative_decoding.md)
|
||||
|
||||
# 🔮 投机解码
|
||||
本项目基于 PaddlePaddle 实现了高效的 **投机解码(Speculative Decoding)** 推理框架,支持多 Token 预测(Multi-token Proposing, MTP),用于加速大语言模型(LLM)的生成,显著降低时延并提升吞吐量。
|
||||
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../features/structured_outputs.md)
|
||||
|
||||
# Structured Outputs
|
||||
|
||||
## 概述
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../get_started/ernie-4.5-vl.md)
|
||||
|
||||
# ERNIE-4.5-VL多模态模型
|
||||
|
||||
本文档讲解如何部署ERNIE-4.5-VL多模态模型,支持用户使用多模态数据与模型进行对话交互(包含思考Reasoning),在开始部署前,请确保你的硬件环境满足如下条件:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../get_started/ernie-4.5.md)
|
||||
|
||||
# ERNIE-4.5模型
|
||||
|
||||
本文档讲解如何部署ERNIE-4.5模型,在开始部署前,请确保你的硬件环境满足如下条件:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../../get_started/installation/Enflame_gcu.md)
|
||||
|
||||
# 使用 FastDeploy 在燧原 S60 上运行 ERNIE 4.5 系列模型
|
||||
|
||||
燧原 S60([了解燧原](https://www.enflame-tech.com/))是面向数据中心大规模部署的新一代人工智能推理加速卡,满足大语言模型、搜广推及传统模型的需求,具有模型覆盖面广、易用性强、易迁移易部署等特点,可广泛应用于图像及文本生成等应用、搜索与推荐、文本、图像及语音识别等主流推理场景。
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../../get_started/installation/README.md)
|
||||
|
||||
# FastDeploy 安装
|
||||
|
||||
FastDeploy支持如下硬件平台:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../../get_started/installation/hygon_dcu.md)
|
||||
|
||||
# 使用 FastDeploy 在海光 K100AI 上运行 ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B
|
||||
当前版本软件只是作为K100AI + Fastdeploy 推理大模型的一个演示 demo,跑最新ERNIE4.5模型可能存在问题,后续进行修复和性能优化,给客户提供一个更稳定的版本。
|
||||
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../../get_started/installation/iluvatar_gpu.md)
|
||||
|
||||
# 如何在天数机器上运行 ERNIE-4.5-300B-A47B-BF16 & ERNIE-4.5-21B-A3B
|
||||
|
||||
## 准备机器
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../../get_started/installation/intel_gaudi.md)
|
||||
|
||||
# 使用 Intel Gaudi 运行ERNIE 4.5 系列模型
|
||||
|
||||
在环境满足如下条件前提下
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../../get_started/installation/kunlunxin_xpu.md)
|
||||
|
||||
# 昆仑芯 XPU
|
||||
|
||||
## 要求
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../../get_started/installation/metax_gpu.md)
|
||||
|
||||
# 使用 Metax GPU C550 运行ERNIE 4.5 系列模型
|
||||
|
||||
FastDeploy在Metax C550上对ERNIE 4.5系列模型进行了深度适配和优化,实现了推理入口和GPU的统一,无需修改即可完成推理任务的迁移。
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../../get_started/installation/nvidia_gpu.md)
|
||||
|
||||
# NVIDIA CUDA GPU Installation
|
||||
|
||||
在环境满足如下条件前提下
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../get_started/quick_start.md)
|
||||
|
||||
# 10分钟完成 ERNIE-4.5-0.3B-Paddle 模型部署
|
||||
|
||||
本文档讲解如何部署ERNIE-4.5-0.3B-Base-Paddle模型,在开始部署前,请确保你的硬件环境满足如下条件:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../get_started/quick_start_qwen.md)
|
||||
|
||||
# 10分钟完成 Qwen3-0.6b 模型部署
|
||||
|
||||
本文档讲解如何部署Qwen3-0.6b模型,在开始部署前,请确保你的硬件环境满足如下条件:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../get_started/quick_start_vl.md)
|
||||
|
||||
# 10分钟完成 ERNIE-4.5-VL-28B-A3B-Paddle 多模态模型部署
|
||||
|
||||
本文档讲解如何部署ERNIE-4.5-VL-28B-A3B-Paddle模型,在开始部署前,请确保你的硬件环境满足如下条件:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../index.md)
|
||||
|
||||
# FastDeploy
|
||||
|
||||
**FastDeploy** 是基于飞桨(PaddlePaddle)的大语言模型(LLM)与视觉语言模型(VLM)推理部署工具包,提供**开箱即用的生产级部署方案**,核心技术特性包括:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../offline_inference.md)
|
||||
|
||||
# 离线推理
|
||||
|
||||
## 1. 使用方式
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../online_serving/README.md)
|
||||
|
||||
# 兼容 OpenAI 协议的服务化部署
|
||||
|
||||
FastDeploy 提供与 OpenAI 协议兼容的服务化部署方案。用户可以通过如下命令快速进行部署:
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../online_serving/graceful_shutdown_service.md)
|
||||
|
||||
# 服务节点优雅关闭方案
|
||||
|
||||
## 1. 核心目标
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../online_serving/metrics.md)
|
||||
|
||||
# 监控 Metrics
|
||||
|
||||
在 FastDeploy 启动后,支持通过 Metrics 持续监控的 FastDeploy 的服务状态。启动 FastDeploy 时,可以通过配置 `metrics-port` 参数指定 Metircs 服务的端口。
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../online_serving/scheduler.md)
|
||||
|
||||
# 调度器
|
||||
|
||||
FastDeploy 目前支持两种调度器: **本地调度器** 和 **全局调度器** 。 全局调度器专为大规模集群设计,能够基于实时工作负载指标在节点间实现二级负载均衡。
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../parameters.md)
|
||||
|
||||
# FastDeploy参数说明
|
||||
|
||||
在使用FastDeploy部署模型(包括离线推理、服务化部署),涉及如下参数配置,其实需要注意,在使用离线推理时,各参数配置即为如下参数名;而在使用命令行启动服务时,相应参数中的分隔符需要从```_```修改为```-```,如```max_model_len```在命令行中则为```--max-model-len```。
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../quantization/README.md)
|
||||
|
||||
# 量化
|
||||
|
||||
FastDeploy支持FP8、INT8、INT4、2-bit等多种量化推理精度,支持模型权重、激活和KVCache 3种张量的不同精度推理,可以满足低成本、低时延、长上下文等不同场景的推理需求。
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../quantization/online_quantization.md)
|
||||
|
||||
# 在线量化
|
||||
|
||||
在线量化是指推理引擎在加载 BF16 权重后对权重做量化,而不是加载离线量化好的低精度权重。FastDeploy 支持将 BF16 在线量化到多种精度,包括:INT4, INT8 和 FP8.
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../../quantization/wint2.md)
|
||||
|
||||
# WINT2量化
|
||||
|
||||
权重经过 [CCQ(卷积编码量化)](https://arxiv.org/pdf/2507.07145) 方法进行离线压缩。权重的实际存储数值类型为INT8,每个INT8数值中打包了4个权重,等价于每个权重2bits。激活不做量化。在推理过程中,权重会被实时反量化并解码为BF16数值类型,并使用BF16数值类型进行计算。
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
[English](../supported_models.md)
|
||||
|
||||
# 支持模型列表
|
||||
|
||||
FastDeploy目前支持模型列表如下,在FastDeploy部署时,指定 ``model``参数为如下表格中的模型名,即可自动下载模型权重(均支持断点续传),支持如下3种下载源,
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user