diff --git a/docs/benchmark.md b/docs/benchmark.md index 46283b627..1a2e6f880 100644 --- a/docs/benchmark.md +++ b/docs/benchmark.md @@ -1,3 +1,5 @@ +[简体中文](zh/benchmark.md) + # Benchmark FastDeploy extends the [vLLM benchmark](https://github.com/vllm-project/vllm/blob/main/benchmarks/) script with additional metrics, enabling more detailed performance benchmarking for FastDeploy. diff --git a/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md b/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md index dbd244cb9..fc7158127 100644 --- a/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md @@ -1,3 +1,5 @@ +[简体中文](../zh/best_practices/ERNIE-4.5-0.3B-Paddle.md) + # ERNIE-4.5-0.3B ## Environmental Preparation ### 1.1 Hardware requirements diff --git a/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md b/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md index 306b4e715..4185a3041 100644 --- a/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md @@ -1,3 +1,5 @@ +[简体中文](../zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md) + # ERNIE-4.5-21B-A3B ## Environmental Preparation ### 1.1 Hardware requirements diff --git a/docs/best_practices/ERNIE-4.5-21B-A3B-Thinking.md b/docs/best_practices/ERNIE-4.5-21B-A3B-Thinking.md index 639ac6063..05328ff08 100644 --- a/docs/best_practices/ERNIE-4.5-21B-A3B-Thinking.md +++ b/docs/best_practices/ERNIE-4.5-21B-A3B-Thinking.md @@ -1,3 +1,5 @@ +[简体中文](../zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md) + # ERNIE-4.5-21B-A3B ## Environmental Preparation ### 1.1 Hardware requirements diff --git a/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md b/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md index 1aba169a0..13668f8b9 100644 --- a/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md @@ -1,3 +1,5 @@ +[简体中文](../zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md) + # ERNIE-4.5-300B-A47B ## Environmental Preparation ### 1.1 Hardware requirements diff --git a/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md index ba4681a65..536c006ba 100644 --- a/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md @@ -1,3 +1,4 @@ +[简体中文](../zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) # ERNIE-4.5-VL-28B-A3B-Paddle diff --git a/docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md b/docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md index 670438968..d2ae9c42b 100644 --- a/docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md @@ -1,3 +1,4 @@ +[简体中文](../zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) # ERNIE-4.5-VL-424B-A47B-Paddle diff --git a/docs/best_practices/FAQ.md b/docs/best_practices/FAQ.md index 71e80ce05..851bfbd68 100644 --- a/docs/best_practices/FAQ.md +++ b/docs/best_practices/FAQ.md @@ -1,3 +1,5 @@ +[简体中文](../zh/best_practices/FAQ.md) + # FAQ ## 1.CUDA out of memory 1. when starting the service: diff --git a/docs/best_practices/README.md b/docs/best_practices/README.md index 6cd1a0260..f2bd584cf 100644 --- a/docs/best_practices/README.md +++ b/docs/best_practices/README.md @@ -1,3 +1,5 @@ +[简体中文](../zh/best_practices/README.md) + # Optimal Deployment - [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md) diff --git a/docs/features/chunked_prefill.md b/docs/features/chunked_prefill.md index 6bfc9b8ae..cdbbc3e27 100644 --- a/docs/features/chunked_prefill.md +++ b/docs/features/chunked_prefill.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/chunked_prefill.md) + # Chunked Prefill Chunked Prefill employs a segmentation strategy that breaks down Prefill requests into smaller subtasks, which are then batched together with Decode requests. This approach better balances compute-intensive (Prefill) and memory-intensive (Decode) operations, optimizes GPU resource utilization, reduces computational overhead and memory footprint per Prefill, thereby lowering peak memory usage and avoiding out-of-memory issues. diff --git a/docs/features/data_parallel_service.md b/docs/features/data_parallel_service.md index 4f496d3f3..2854bcf47 100644 --- a/docs/features/data_parallel_service.md +++ b/docs/features/data_parallel_service.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/data_parallel_service.md) + # Data Parallelism Under the MOE model, enabling Expert Parallelism (EP) combined with Data Parallelism (DP), where EP distributes expert workloads and DP enables parallel request processing. diff --git a/docs/features/disaggregated.md b/docs/features/disaggregated.md index 44f29b5d9..58ecaeb24 100644 --- a/docs/features/disaggregated.md +++ b/docs/features/disaggregated.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/disaggregated.md) + # Disaggregated Deployment Large model inference consists of two phases: Prefill and Decode, which are compute-intensive and memory access-intensive respectively. Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency. diff --git a/docs/features/early_stop.md b/docs/features/early_stop.md index f0e0e2686..1f0bb87b6 100644 --- a/docs/features/early_stop.md +++ b/docs/features/early_stop.md @@ -1,3 +1,4 @@ +[简体中文](../zh/features/early_stop.md) # Early Stopping diff --git a/docs/features/graph_optimization.md b/docs/features/graph_optimization.md index 09d93f105..78198333b 100644 --- a/docs/features/graph_optimization.md +++ b/docs/features/graph_optimization.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/graph_optimization.md) + # Graph optimization technology in FastDeploy FastDeploy's `GraphOptimizationBackend` integrates a variety of graph optimization technologies: diff --git a/docs/features/load_balance.md b/docs/features/load_balance.md index 1ab8014d8..f5c20b52d 100644 --- a/docs/features/load_balance.md +++ b/docs/features/load_balance.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/load_balance.md) + # Global Scheduler: Multi-Instance Load Balancing ## Design Overview diff --git a/docs/features/multi-node_deployment.md b/docs/features/multi-node_deployment.md index 04bda9fc2..e6c04cc82 100644 --- a/docs/features/multi-node_deployment.md +++ b/docs/features/multi-node_deployment.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/multi-node_deployment.md) + # Multi-Node Deployment ## Overview diff --git a/docs/features/plas_attention.md b/docs/features/plas_attention.md index b096fceeb..1d981ffc3 100644 --- a/docs/features/plas_attention.md +++ b/docs/features/plas_attention.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/plas_attention.md) + # PLAS ## Introduction diff --git a/docs/features/plugins.md b/docs/features/plugins.md index ed63ed594..2477f9690 100644 --- a/docs/features/plugins.md +++ b/docs/features/plugins.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/plugins.md) + # FastDeploy Plugin Mechanism Documentation FastDeploy supports a plugin mechanism that allows users to extend functionality without modifying the core code. Plugins are automatically discovered and loaded through Python's `entry_points` mechanism. diff --git a/docs/features/prefix_caching.md b/docs/features/prefix_caching.md index 0a58336de..7eb29caaf 100644 --- a/docs/features/prefix_caching.md +++ b/docs/features/prefix_caching.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/prefix_caching.md) + # Prefix Caching Prefix Caching is a technique to optimize the inference efficiency of generative models. Its core idea is to cache intermediate computation results (KV Cache) of input sequences, avoiding redundant computations and thereby accelerating response times for multiple requests sharing the same prefix. diff --git a/docs/features/reasoning_output.md b/docs/features/reasoning_output.md index f98262d62..4acf512d1 100644 --- a/docs/features/reasoning_output.md +++ b/docs/features/reasoning_output.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/reasoning_output.md) + # Reasoning Outputs Reasoning models return an additional `reasoning_content` field in their output, which contains the reasoning steps that led to the final conclusion. diff --git a/docs/features/sampling.md b/docs/features/sampling.md index 4b2774fc2..0e3098aec 100644 --- a/docs/features/sampling.md +++ b/docs/features/sampling.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/sampling.md) + # Sampling Strategies Sampling strategies are used to determine how to select the next token from the output probability distribution of a model. FastDeploy currently supports multiple sampling strategies including Top-p, Top-k_Top-p, and Min-p Sampling. diff --git a/docs/features/speculative_decoding.md b/docs/features/speculative_decoding.md index 4f0b20503..ae9546c4c 100644 --- a/docs/features/speculative_decoding.md +++ b/docs/features/speculative_decoding.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/speculative_decoding.md) + # 🔮 Speculative Decoding This project implements an efficient **Speculative Decoding** inference framework based on PaddlePaddle. It supports **Multi-Token Proposing (MTP)** to accelerate large language model (LLM) generation, significantly reducing latency and improving throughput. diff --git a/docs/features/structured_outputs.md b/docs/features/structured_outputs.md index f7ee424cb..7c81bf3e1 100644 --- a/docs/features/structured_outputs.md +++ b/docs/features/structured_outputs.md @@ -1,3 +1,5 @@ +[简体中文](../zh/features/structured_outputs.md) + # Structured Outputs ## Overview diff --git a/docs/get_started/README.md b/docs/get_started/README.md index ded6baf66..9c80666d0 100644 --- a/docs/get_started/README.md +++ b/docs/get_started/README.md @@ -1,3 +1,5 @@ +[简体中文](../zh/get_started/README.md) + # Get Started - [Deploy ERNIE-4.5-0.3B-Paddle in 10 Minutes](quick_start.md) diff --git a/docs/get_started/ernie-4.5-vl.md b/docs/get_started/ernie-4.5-vl.md index cba75a8c7..bfd76fcaf 100644 --- a/docs/get_started/ernie-4.5-vl.md +++ b/docs/get_started/ernie-4.5-vl.md @@ -1,3 +1,5 @@ +[简体中文](../zh/get_started/ernie-4.5-vl.md) + # Deploy ERNIE-4.5-VL-424B-A47B Multimodal Model This document explains how to deploy the ERNIE-4.5-VL multimodal model, which supports users to interact with the model using multimodal data (including reasoning capabilities). Before starting the deployment, please ensure that your hardware environment meets the following requirements: diff --git a/docs/get_started/ernie-4.5.md b/docs/get_started/ernie-4.5.md index 9a6f92ecd..8e9f6f39f 100644 --- a/docs/get_started/ernie-4.5.md +++ b/docs/get_started/ernie-4.5.md @@ -1,3 +1,5 @@ +[简体中文](../zh/get_started/ernie-4.5.md) + # Deploy ERNIE-4.5-300B-A47B Model This document explains how to deploy the ERNIE-4.5 model. Before starting the deployment, please ensure that your hardware environment meets the following requirements: diff --git a/docs/get_started/installation/Enflame_gcu.md b/docs/get_started/installation/Enflame_gcu.md index 1801cf6d0..5e7b16323 100644 --- a/docs/get_started/installation/Enflame_gcu.md +++ b/docs/get_started/installation/Enflame_gcu.md @@ -1,3 +1,5 @@ +[简体中文](../../zh/get_started/installation/Enflame_gcu.md) + # Running ERNIE 4.5 Series Models with FastDeploy The Enflame S60 ([Learn about Enflame](https://www.enflame-tech.com/)) is a next-generation AI inference accelerator card designed for large-scale deployment in data centers. It meets the demands of large language models (LLMs), search/advertising/recommendation systems, and traditional models. Characterized by broad model coverage, user-friendliness, and high portability, it is widely applicable to mainstream inference scenarios such as image and text generation applications, search and recommendation systems, and text/image/speech recognition. diff --git a/docs/get_started/installation/README.md b/docs/get_started/installation/README.md index 76dba9d00..2f90a446d 100644 --- a/docs/get_started/installation/README.md +++ b/docs/get_started/installation/README.md @@ -1,3 +1,5 @@ +[简体中文](../../zh/get_started/installation/README.md) + # FastDeploy Installation FastDeploy currently supports installation on the following hardware platforms: diff --git a/docs/get_started/installation/hygon_dcu.md b/docs/get_started/installation/hygon_dcu.md index 245ee4457..5484af051 100644 --- a/docs/get_started/installation/hygon_dcu.md +++ b/docs/get_started/installation/hygon_dcu.md @@ -1,3 +1,5 @@ +[简体中文](../../zh/get_started/installation/hygon_dcu.md) + # Run ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B model on hygon machine The current version of the software merely serves as a demonstration demo for the hygon k100AI combined with the Fastdeploy inference framework for large models. There may be issues when running the latest ERNIE4.5 model, and we will conduct repairs and performance optimization in the future. Subsequent versions will provide customers with a more stable version. diff --git a/docs/get_started/installation/iluvatar_gpu.md b/docs/get_started/installation/iluvatar_gpu.md index 393f250a1..bc5146f62 100644 --- a/docs/get_started/installation/iluvatar_gpu.md +++ b/docs/get_started/installation/iluvatar_gpu.md @@ -1,3 +1,5 @@ +[简体中文](../../zh/get_started/installation/iluvatar_gpu.md) + # Run ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B model on iluvatar machine ## Machine Preparation diff --git a/docs/get_started/installation/intel_gaudi.md b/docs/get_started/installation/intel_gaudi.md index 5e4e9edc1..3696e5d46 100644 --- a/docs/get_started/installation/intel_gaudi.md +++ b/docs/get_started/installation/intel_gaudi.md @@ -1,3 +1,5 @@ +[简体中文](../../zh/get_started/installation/intel_gaudi.md) + # Intel Gaudi Installation for running ERNIE 4.5 Series Models The following installation methods are available when your environment meets these requirements: diff --git a/docs/get_started/installation/kunlunxin_xpu.md b/docs/get_started/installation/kunlunxin_xpu.md index e9ff418c0..b024ff962 100644 --- a/docs/get_started/installation/kunlunxin_xpu.md +++ b/docs/get_started/installation/kunlunxin_xpu.md @@ -1,3 +1,5 @@ +[简体中文](../../zh/get_started/installation/kunlunxin_xpu.md) + # Kunlunxin XPU ## Requirements diff --git a/docs/get_started/installation/metax_gpu.md b/docs/get_started/installation/metax_gpu.md index eb4ea84c9..68c3aafc1 100644 --- a/docs/get_started/installation/metax_gpu.md +++ b/docs/get_started/installation/metax_gpu.md @@ -1,3 +1,5 @@ +[简体中文](../../zh/get_started/installation/metax_gpu.md) + # Metax GPU Installation for running ERNIE 4.5 Series Models The following installation methods are available when your environment meets these requirements: diff --git a/docs/get_started/installation/nvidia_gpu.md b/docs/get_started/installation/nvidia_gpu.md index 6c109d088..0a8e4f9ef 100644 --- a/docs/get_started/installation/nvidia_gpu.md +++ b/docs/get_started/installation/nvidia_gpu.md @@ -1,3 +1,5 @@ +[简体中文](../../zh/get_started/installation/nvidia_gpu.md) + # NVIDIA CUDA GPU Installation The following installation methods are available when your environment meets these requirements: diff --git a/docs/get_started/quick_start.md b/docs/get_started/quick_start.md index 8034f6828..ceb5f304d 100644 --- a/docs/get_started/quick_start.md +++ b/docs/get_started/quick_start.md @@ -1,3 +1,5 @@ +[简体中文](../zh/get_started/quick_start.md) + # Deploy ERNIE-4.5-0.3B-Paddle in 10 Minutes Before deployment, ensure your environment meets the following requirements: diff --git a/docs/get_started/quick_start_qwen.md b/docs/get_started/quick_start_qwen.md index c0510fb6f..4f6d834ef 100644 --- a/docs/get_started/quick_start_qwen.md +++ b/docs/get_started/quick_start_qwen.md @@ -1,3 +1,5 @@ +[简体中文](../zh/get_started/quick_start_qwen.md) + # Deploy QWEN3-0.6b in 10 Minutes Before deployment, ensure your environment meets the following requirements: diff --git a/docs/get_started/quick_start_vl.md b/docs/get_started/quick_start_vl.md index b9c50a1c2..db3935284 100644 --- a/docs/get_started/quick_start_vl.md +++ b/docs/get_started/quick_start_vl.md @@ -1,3 +1,5 @@ +[简体中文](../zh/get_started/quick_start_vl.md) + # Deploy ERNIE-4.5-VL-28B-A3B-Paddle Multimodal Model in 10 Minutes Before deployment, please ensure your environment meets the following requirements: diff --git a/docs/index.md b/docs/index.md index f688c9b0c..d8c724f51 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,3 +1,5 @@ +[简体中文](zh/index.md) + # FastDeploy **FastDeploy** is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers **production-ready, out-of-the-box deployment solutions** with core acceleration technologies: diff --git a/docs/offline_inference.md b/docs/offline_inference.md index a8efef5ca..5e8446f01 100644 --- a/docs/offline_inference.md +++ b/docs/offline_inference.md @@ -1,3 +1,5 @@ +[简体中文](zh/offline_inference.md) + # Offline Inference ## 1. Usage diff --git a/docs/online_serving/README.md b/docs/online_serving/README.md index 587919095..8423530ff 100644 --- a/docs/online_serving/README.md +++ b/docs/online_serving/README.md @@ -1,3 +1,5 @@ +[简体中文](../zh/online_serving/README.md) + # OpenAI Protocol-Compatible API Server FastDeploy provides a service-oriented deployment solution that is compatible with the OpenAI protocol. Users can quickly deploy it using the following command: diff --git a/docs/online_serving/graceful_shutdown_service.md b/docs/online_serving/graceful_shutdown_service.md index ea1b13e12..28f2db289 100644 --- a/docs/online_serving/graceful_shutdown_service.md +++ b/docs/online_serving/graceful_shutdown_service.md @@ -1,3 +1,5 @@ +[简体中文](../zh/online_serving/graceful_shutdown_service.md) + # Graceful Service Node Shutdown Solution ## 1. Core Objective diff --git a/docs/online_serving/metrics.md b/docs/online_serving/metrics.md index 9b03191f7..49af50700 100644 --- a/docs/online_serving/metrics.md +++ b/docs/online_serving/metrics.md @@ -1,3 +1,5 @@ +[简体中文](../zh/online_serving/metrics.md) + # Monitoring Metrics After FastDeploy is launched, it supports continuous monitoring of the FastDeploy service status through Metrics. When starting FastDeploy, you can specify the port for the Metrics service by configuring the `metrics-port` parameter. diff --git a/docs/online_serving/scheduler.md b/docs/online_serving/scheduler.md index 8ce9fa4cd..f3703fa11 100644 --- a/docs/online_serving/scheduler.md +++ b/docs/online_serving/scheduler.md @@ -1,3 +1,5 @@ +[简体中文](../zh/online_serving/scheduler.md) + # Scheduler FastDeploy currently supports two types of schedulers: **Local Scheduler** and **Global Scheduler**. The Global Scheduler is designed for large-scale clusters, enabling secondary load balancing across nodes based on real-time workload metrics. diff --git a/docs/parameters.md b/docs/parameters.md index 7070c8fd7..e5466e74a 100644 --- a/docs/parameters.md +++ b/docs/parameters.md @@ -1,3 +1,5 @@ +[简体中文](zh/parameters.md) + # FastDeploy Parameter Documentation ## Parameter Description diff --git a/docs/quantization/README.md b/docs/quantization/README.md index d564223b1..af41499c0 100644 --- a/docs/quantization/README.md +++ b/docs/quantization/README.md @@ -1,3 +1,5 @@ +[简体中文](../zh/quantization/README.md) + # Quantization FastDeploy supports various quantization inference precisions including FP8, INT8, INT4, 2-bits, etc. It supports different precision inference for weights, activations, and KVCache tensors, which can meet the inference requirements of different scenarios such as low cost, low latency, and long context. diff --git a/docs/quantization/online_quantization.md b/docs/quantization/online_quantization.md index bf8b9a536..9f590a85a 100644 --- a/docs/quantization/online_quantization.md +++ b/docs/quantization/online_quantization.md @@ -1,3 +1,5 @@ +[简体中文](../zh/quantization/online_quantization.md) + # Online Quantization Online quantization refers to the inference engine quantizing weights after loading BF16 weights, rather than loading pre-quantized low-precision weights. FastDeploy supports online quantization of BF16 to various precisions, including: INT4, INT8, and FP8. diff --git a/docs/quantization/wint2.md b/docs/quantization/wint2.md index e7c586632..d18c03a15 100644 --- a/docs/quantization/wint2.md +++ b/docs/quantization/wint2.md @@ -1,3 +1,5 @@ +[简体中文](../zh/quantization/wint2.md) + # WINT2 Quantization Weights are compressed offline using the [CCQ (Convolutional Coding Quantization)](https://arxiv.org/pdf/2507.07145) method. The actual stored numerical type of weights is INT8, with 4 weights packed into each INT8 value, equivalent to 2 bits per weight. Activations are not quantized. During inference, weights are dequantized and decoded in real-time to BF16 numerical type, and calculations are performed using BF16 numerical type. diff --git a/docs/supported_models.md b/docs/supported_models.md index e6823c0ff..5ed4517d2 100644 --- a/docs/supported_models.md +++ b/docs/supported_models.md @@ -1,3 +1,5 @@ +[简体中文](zh/supported_models.md) + # Supported Models FastDeploy currently supports the following models, which can be downloaded automatically during FastDeploy deployment.Specify the ``model`` parameter as the model name in the table below to automatically download model weights (all supports resumable downloads). The following three download sources are supported: diff --git a/docs/usage/code_overview.md b/docs/usage/code_overview.md index 8008fc677..578053dd7 100644 --- a/docs/usage/code_overview.md +++ b/docs/usage/code_overview.md @@ -1,3 +1,5 @@ +[简体中文](../zh/usage/code_overview.md) + # Code Overview Below is an overview of the FastDeploy code structure and functionality organized by directory. diff --git a/docs/usage/environment_variables.md b/docs/usage/environment_variables.md index 4783e3d9f..8b4930967 100644 --- a/docs/usage/environment_variables.md +++ b/docs/usage/environment_variables.md @@ -1,3 +1,5 @@ +[简体中文](../zh/usage/environment_variables.md) + # FastDeploy Environment Variables FastDeploy's environment variables are defined in `fastdeploy/envs.py` at the root of the repository. Below is the documentation: diff --git a/docs/usage/fastdeploy_unit_test_guide.md b/docs/usage/fastdeploy_unit_test_guide.md index 98184724b..52b8ce993 100644 --- a/docs/usage/fastdeploy_unit_test_guide.md +++ b/docs/usage/fastdeploy_unit_test_guide.md @@ -1,3 +1,5 @@ +[简体中文](../zh/usage/fastdeploy_unit_test_guide.md) + # FastDeploy Unit Test Specification 1. Test Naming Conventions - Test files must start with test_. diff --git a/docs/usage/kunlunxin_xpu_deployment.md b/docs/usage/kunlunxin_xpu_deployment.md index 33feef89e..a376d53bf 100644 --- a/docs/usage/kunlunxin_xpu_deployment.md +++ b/docs/usage/kunlunxin_xpu_deployment.md @@ -1,3 +1,5 @@ +[简体中文](../zh/usage/kunlunxin_xpu_deployment.md) + ## Supported Models |Model Name|Context Length|Quantization|XPUs Required|Deployment Commands|Minimum Version Required| |-|-|-|-|-|-| diff --git a/docs/usage/log.md b/docs/usage/log.md index 6f7312365..1c2dbf90c 100644 --- a/docs/usage/log.md +++ b/docs/usage/log.md @@ -1,3 +1,5 @@ +[简体中文](../zh/usage/log.md) + # Log Description FastDeploy generates the following log files during deployment. Below is an explanation of each log's purpose. diff --git a/docs/zh/benchmark.md b/docs/zh/benchmark.md index be5c83325..e4a58d93b 100644 --- a/docs/zh/benchmark.md +++ b/docs/zh/benchmark.md @@ -1,3 +1,5 @@ +[English](../benchmark.md) + # Benchmark FastDeploy基于[vLLM benchmark](https://github.com/vllm-project/vllm/blob/main/benchmarks/)脚本,增加了部分统计信息,可用于benchmark FastDeploy更详细的性能指标。 diff --git a/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md index 498f46e9c..5c935cc71 100644 --- a/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md @@ -1,3 +1,5 @@ +[English](../../best_practices/ERNIE-4.5-0.3B-Paddle.md) + # ERNIE-4.5-0.3B ## 一、环境准备 ### 1.1 支持情况 diff --git a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md index a0f649b2e..61f6d27b6 100644 --- a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md @@ -1,3 +1,5 @@ +[English](../../best_practices/ERNIE-4.5-21B-A3B-Paddle.md) + # ERNIE-4.5-21B-A3B ## 一、环境准备 ### 1.1 支持情况 diff --git a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md index cc4cc9a5b..0dc0db527 100644 --- a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md +++ b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md @@ -1,3 +1,5 @@ +[English](../../best_practices/ERNIE-4.5-21B-A3B-Thinking.md) + # ERNIE-4.5-21B-A3B-Thinking ## 一、环境准备 ### 1.1 支持情况 diff --git a/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md index 43f1b261c..824af9e98 100644 --- a/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md @@ -1,3 +1,5 @@ +[English](../../best_practices/ERNIE-4.5-300B-A47B-Paddle.md) + # ERNIE-4.5-300B-A47B ## 一、环境准备 ### 1.1 支持情况 diff --git a/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md index 9377378da..ca881d9b4 100644 --- a/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md @@ -1,3 +1,4 @@ +[English](../../best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) # ERNIE-4.5-VL-28B-A3B-Paddle diff --git a/docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md index 75328ae08..d53aa647a 100644 --- a/docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md @@ -1,3 +1,4 @@ +[English](../../best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) # ERNIE-4.5-VL-424B-A47B-Paddle diff --git a/docs/zh/best_practices/FAQ.md b/docs/zh/best_practices/FAQ.md index 6cf65552c..5045131b5 100644 --- a/docs/zh/best_practices/FAQ.md +++ b/docs/zh/best_practices/FAQ.md @@ -1,3 +1,5 @@ +[English](../../best_practices/FAQ.md) + # 常见问题FAQ ## 1.显存不足 1. 启动服务时显存不足: diff --git a/docs/zh/best_practices/README.md b/docs/zh/best_practices/README.md index 5540f23b6..bb1016247 100644 --- a/docs/zh/best_practices/README.md +++ b/docs/zh/best_practices/README.md @@ -1,3 +1,5 @@ +[English](../../best_practices/README.md) + # 最佳实践 - [ERNIE-4.5-0.3B-Paddle.md](ERNIE-4.5-0.3B-Paddle.md) diff --git a/docs/zh/features/chunked_prefill.md b/docs/zh/features/chunked_prefill.md index 272748505..a51209dc2 100644 --- a/docs/zh/features/chunked_prefill.md +++ b/docs/zh/features/chunked_prefill.md @@ -1,3 +1,5 @@ +[English](../../features/chunked_prefill.md) + # Chunked Prefill 与 128K 长文推理部署 Chunked Prefill 采用分块策略,将预填充(Prefill)阶段请求拆解为小规模子任务,与解码(Decode)请求混合批处理执行。可以更好地平衡计算密集型(Prefill)和访存密集型(Decode)操作,优化GPU资源利用率,减少单次Prefill的计算量和显存占用,从而降低显存峰值,避免显存不足的问题。 diff --git a/docs/zh/features/data_parallel_service.md b/docs/zh/features/data_parallel_service.md index 8951a7679..6bfbdcd8f 100644 --- a/docs/zh/features/data_parallel_service.md +++ b/docs/zh/features/data_parallel_service.md @@ -1,3 +1,5 @@ +[English](../../features/data_parallel_service.md) + # 数据并行 在MOE模型下,开启专家并行(EP)与数据并行(DP)相结合,EP 分摊专家负载,结合 DP 实现请求并行处理。 diff --git a/docs/zh/features/disaggregated.md b/docs/zh/features/disaggregated.md index e67c36d85..909925ea6 100644 --- a/docs/zh/features/disaggregated.md +++ b/docs/zh/features/disaggregated.md @@ -1,3 +1,5 @@ +[English](../../features/disaggregated.md) + # 分离式部署 大模型推理分为两个部分Prefill和Decode阶段,分别为计算密集型(Prefill)和存储密集型(Decode)两部分。将Prefill 和 Decode 分开部署在一定场景下可以提高硬件利用率,有效提高吞吐,降低整句时延, diff --git a/docs/zh/features/early_stop.md b/docs/zh/features/early_stop.md index 9f0118b1c..720134ee2 100644 --- a/docs/zh/features/early_stop.md +++ b/docs/zh/features/early_stop.md @@ -1,3 +1,4 @@ +[English](../../features/early_stop.md) # 早停功能 diff --git a/docs/zh/features/graph_optimization.md b/docs/zh/features/graph_optimization.md index f25a2d302..09e725625 100644 --- a/docs/zh/features/graph_optimization.md +++ b/docs/zh/features/graph_optimization.md @@ -1,3 +1,5 @@ +[English](../../features/graph_optimization.md) + # FastDeploy 中的图优化技术 FastDeploy 的 `GraphOptimizationBackend` 中集成了多种图优化技术: diff --git a/docs/zh/features/load_balance.md b/docs/zh/features/load_balance.md index 3886a0c4d..854012799 100644 --- a/docs/zh/features/load_balance.md +++ b/docs/zh/features/load_balance.md @@ -1,3 +1,5 @@ +[English](../../features/load_balance.md) + # 全局调度器: 多实例负载均衡 ## 设计方案 diff --git a/docs/zh/features/multi-node_deployment.md b/docs/zh/features/multi-node_deployment.md index 909a7a70f..8e2c2408a 100644 --- a/docs/zh/features/multi-node_deployment.md +++ b/docs/zh/features/multi-node_deployment.md @@ -1,3 +1,5 @@ +[English](../../features/multi-node_deployment.md) + # 多节点部署 ## 概述 diff --git a/docs/zh/features/plas_attention.md b/docs/zh/features/plas_attention.md index 0d8fcb2b9..da7193960 100644 --- a/docs/zh/features/plas_attention.md +++ b/docs/zh/features/plas_attention.md @@ -1,3 +1,5 @@ +[English](../../features/plas_attention.md) + # PLAS ## 介绍 diff --git a/docs/zh/features/plugins.md b/docs/zh/features/plugins.md index e16010816..297d46550 100644 --- a/docs/zh/features/plugins.md +++ b/docs/zh/features/plugins.md @@ -1,3 +1,5 @@ +[English](../../features/plugins.md) + # FastDeploy 插件机制说明文档 FastDeploy 支持插件机制,允许用户在不修改核心代码的前提下扩展功能。插件通过 Python 的 `entry_points` 机制实现自动发现与加载。 diff --git a/docs/zh/features/prefix_caching.md b/docs/zh/features/prefix_caching.md index b6020483f..e3244a58d 100644 --- a/docs/zh/features/prefix_caching.md +++ b/docs/zh/features/prefix_caching.md @@ -1,3 +1,5 @@ +[English](../../features/prefix_caching.md) + # Prefix Caching Prefix Caching(前缀缓存)是一种优化生成式模型推理效率的技术,核心思想是通过缓存输入序列的中间计算结果(KV Cache),避免重复计算,从而加速具有相同前缀的多个请求的响应速度。 diff --git a/docs/zh/features/reasoning_output.md b/docs/zh/features/reasoning_output.md index cd32e4c6c..8bcddf514 100644 --- a/docs/zh/features/reasoning_output.md +++ b/docs/zh/features/reasoning_output.md @@ -1,3 +1,5 @@ +[English](../../features/reasoning_output.md) + # 思考链内容 思考模型在输出中返回 `reasoning_content` 字段,表示思考链内容,即得出最终结论的思考步骤. diff --git a/docs/zh/features/sampling.md b/docs/zh/features/sampling.md index df04d1c28..9b530cb6c 100644 --- a/docs/zh/features/sampling.md +++ b/docs/zh/features/sampling.md @@ -1,3 +1,5 @@ +[English](../../features/sampling.md) + # 采样策略 采样策略用于决定如何从模型的输出概率分布中选择下一个token。FastDeploy目前支持 Top-p 、 Top-k_Top-p 和 Min-p Sampling 多种采样策略。 diff --git a/docs/zh/features/speculative_decoding.md b/docs/zh/features/speculative_decoding.md index 58b70742c..7e81b55fa 100644 --- a/docs/zh/features/speculative_decoding.md +++ b/docs/zh/features/speculative_decoding.md @@ -1,3 +1,5 @@ +[English](../../features/speculative_decoding.md) + # 🔮 投机解码 本项目基于 PaddlePaddle 实现了高效的 **投机解码(Speculative Decoding)** 推理框架,支持多 Token 预测(Multi-token Proposing, MTP),用于加速大语言模型(LLM)的生成,显著降低时延并提升吞吐量。 diff --git a/docs/zh/features/structured_outputs.md b/docs/zh/features/structured_outputs.md index cafda804c..50c010c14 100644 --- a/docs/zh/features/structured_outputs.md +++ b/docs/zh/features/structured_outputs.md @@ -1,3 +1,5 @@ +[English](../../features/structured_outputs.md) + # Structured Outputs ## 概述 diff --git a/docs/zh/get_started/ernie-4.5-vl.md b/docs/zh/get_started/ernie-4.5-vl.md index 437a127d9..485c28535 100644 --- a/docs/zh/get_started/ernie-4.5-vl.md +++ b/docs/zh/get_started/ernie-4.5-vl.md @@ -1,3 +1,5 @@ +[English](../../get_started/ernie-4.5-vl.md) + # ERNIE-4.5-VL多模态模型 本文档讲解如何部署ERNIE-4.5-VL多模态模型,支持用户使用多模态数据与模型进行对话交互(包含思考Reasoning),在开始部署前,请确保你的硬件环境满足如下条件: diff --git a/docs/zh/get_started/ernie-4.5.md b/docs/zh/get_started/ernie-4.5.md index 99810bffa..be7a9144d 100644 --- a/docs/zh/get_started/ernie-4.5.md +++ b/docs/zh/get_started/ernie-4.5.md @@ -1,3 +1,5 @@ +[English](../../get_started/ernie-4.5.md) + # ERNIE-4.5模型 本文档讲解如何部署ERNIE-4.5模型,在开始部署前,请确保你的硬件环境满足如下条件: diff --git a/docs/zh/get_started/installation/Enflame_gcu.md b/docs/zh/get_started/installation/Enflame_gcu.md index cc1042e75..67a6b4cb0 100644 --- a/docs/zh/get_started/installation/Enflame_gcu.md +++ b/docs/zh/get_started/installation/Enflame_gcu.md @@ -1,3 +1,5 @@ +[English](../../../get_started/installation/Enflame_gcu.md) + # 使用 FastDeploy 在燧原 S60 上运行 ERNIE 4.5 系列模型 燧原 S60([了解燧原](https://www.enflame-tech.com/))是面向数据中心大规模部署的新一代人工智能推理加速卡,满足大语言模型、搜广推及传统模型的需求,具有模型覆盖面广、易用性强、易迁移易部署等特点,可广泛应用于图像及文本生成等应用、搜索与推荐、文本、图像及语音识别等主流推理场景。 diff --git a/docs/zh/get_started/installation/README.md b/docs/zh/get_started/installation/README.md index 4c1b6016d..4ccb26ab4 100644 --- a/docs/zh/get_started/installation/README.md +++ b/docs/zh/get_started/installation/README.md @@ -1,3 +1,5 @@ +[English](../../../get_started/installation/README.md) + # FastDeploy 安装 FastDeploy支持如下硬件平台: diff --git a/docs/zh/get_started/installation/hygon_dcu.md b/docs/zh/get_started/installation/hygon_dcu.md index d9bdae0dd..21cc35ad8 100644 --- a/docs/zh/get_started/installation/hygon_dcu.md +++ b/docs/zh/get_started/installation/hygon_dcu.md @@ -1,3 +1,5 @@ +[English](../../../get_started/installation/hygon_dcu.md) + # 使用 FastDeploy 在海光 K100AI 上运行 ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B 当前版本软件只是作为K100AI + Fastdeploy 推理大模型的一个演示 demo,跑最新ERNIE4.5模型可能存在问题,后续进行修复和性能优化,给客户提供一个更稳定的版本。 diff --git a/docs/zh/get_started/installation/iluvatar_gpu.md b/docs/zh/get_started/installation/iluvatar_gpu.md index 1ece14ea2..46c6fa006 100644 --- a/docs/zh/get_started/installation/iluvatar_gpu.md +++ b/docs/zh/get_started/installation/iluvatar_gpu.md @@ -1,3 +1,5 @@ +[English](../../../get_started/installation/iluvatar_gpu.md) + # 如何在天数机器上运行 ERNIE-4.5-300B-A47B-BF16 & ERNIE-4.5-21B-A3B ## 准备机器 diff --git a/docs/zh/get_started/installation/intel_gaudi.md b/docs/zh/get_started/installation/intel_gaudi.md index 5f0577b3b..e769ee814 100644 --- a/docs/zh/get_started/installation/intel_gaudi.md +++ b/docs/zh/get_started/installation/intel_gaudi.md @@ -1,3 +1,5 @@ +[English](../../../get_started/installation/intel_gaudi.md) + # 使用 Intel Gaudi 运行ERNIE 4.5 系列模型 在环境满足如下条件前提下 diff --git a/docs/zh/get_started/installation/kunlunxin_xpu.md b/docs/zh/get_started/installation/kunlunxin_xpu.md index de1e2c8d0..918d75f2d 100644 --- a/docs/zh/get_started/installation/kunlunxin_xpu.md +++ b/docs/zh/get_started/installation/kunlunxin_xpu.md @@ -1,3 +1,5 @@ +[English](../../../get_started/installation/kunlunxin_xpu.md) + # 昆仑芯 XPU ## 要求 diff --git a/docs/zh/get_started/installation/metax_gpu.md b/docs/zh/get_started/installation/metax_gpu.md index db06ebc77..43ec504bf 100644 --- a/docs/zh/get_started/installation/metax_gpu.md +++ b/docs/zh/get_started/installation/metax_gpu.md @@ -1,3 +1,5 @@ +[English](../../../get_started/installation/metax_gpu.md) + # 使用 Metax GPU C550 运行ERNIE 4.5 系列模型 FastDeploy在Metax C550上对ERNIE 4.5系列模型进行了深度适配和优化,实现了推理入口和GPU的统一,无需修改即可完成推理任务的迁移。 diff --git a/docs/zh/get_started/installation/nvidia_gpu.md b/docs/zh/get_started/installation/nvidia_gpu.md index 179abd751..30af18e4b 100644 --- a/docs/zh/get_started/installation/nvidia_gpu.md +++ b/docs/zh/get_started/installation/nvidia_gpu.md @@ -1,3 +1,5 @@ +[English](../../../get_started/installation/nvidia_gpu.md) + # NVIDIA CUDA GPU Installation 在环境满足如下条件前提下 diff --git a/docs/zh/get_started/quick_start.md b/docs/zh/get_started/quick_start.md index 178c7ba02..e9cd0b709 100644 --- a/docs/zh/get_started/quick_start.md +++ b/docs/zh/get_started/quick_start.md @@ -1,3 +1,5 @@ +[English](../../get_started/quick_start.md) + # 10分钟完成 ERNIE-4.5-0.3B-Paddle 模型部署 本文档讲解如何部署ERNIE-4.5-0.3B-Base-Paddle模型,在开始部署前,请确保你的硬件环境满足如下条件: diff --git a/docs/zh/get_started/quick_start_qwen.md b/docs/zh/get_started/quick_start_qwen.md index ee22650e7..c678a7b6c 100644 --- a/docs/zh/get_started/quick_start_qwen.md +++ b/docs/zh/get_started/quick_start_qwen.md @@ -1,3 +1,5 @@ +[English](../../get_started/quick_start_qwen.md) + # 10分钟完成 Qwen3-0.6b 模型部署 本文档讲解如何部署Qwen3-0.6b模型,在开始部署前,请确保你的硬件环境满足如下条件: diff --git a/docs/zh/get_started/quick_start_vl.md b/docs/zh/get_started/quick_start_vl.md index b031378ac..27fa52384 100644 --- a/docs/zh/get_started/quick_start_vl.md +++ b/docs/zh/get_started/quick_start_vl.md @@ -1,3 +1,5 @@ +[English](../../get_started/quick_start_vl.md) + # 10分钟完成 ERNIE-4.5-VL-28B-A3B-Paddle 多模态模型部署 本文档讲解如何部署ERNIE-4.5-VL-28B-A3B-Paddle模型,在开始部署前,请确保你的硬件环境满足如下条件: diff --git a/docs/zh/index.md b/docs/zh/index.md index 54a43bee1..7d86d3f5c 100644 --- a/docs/zh/index.md +++ b/docs/zh/index.md @@ -1,3 +1,5 @@ +[English](../index.md) + # FastDeploy **FastDeploy** 是基于飞桨(PaddlePaddle)的大语言模型(LLM)与视觉语言模型(VLM)推理部署工具包,提供**开箱即用的生产级部署方案**,核心技术特性包括: diff --git a/docs/zh/offline_inference.md b/docs/zh/offline_inference.md index 7eb271588..45e28efb1 100644 --- a/docs/zh/offline_inference.md +++ b/docs/zh/offline_inference.md @@ -1,3 +1,5 @@ +[English](../offline_inference.md) + # 离线推理 ## 1. 使用方式 diff --git a/docs/zh/online_serving/README.md b/docs/zh/online_serving/README.md index 35f5e1349..587a0a9fa 100644 --- a/docs/zh/online_serving/README.md +++ b/docs/zh/online_serving/README.md @@ -1,3 +1,5 @@ +[English](../../online_serving/README.md) + # 兼容 OpenAI 协议的服务化部署 FastDeploy 提供与 OpenAI 协议兼容的服务化部署方案。用户可以通过如下命令快速进行部署: diff --git a/docs/zh/online_serving/graceful_shutdown_service.md b/docs/zh/online_serving/graceful_shutdown_service.md index 90f0133aa..a1d44edda 100644 --- a/docs/zh/online_serving/graceful_shutdown_service.md +++ b/docs/zh/online_serving/graceful_shutdown_service.md @@ -1,3 +1,5 @@ +[English](../../online_serving/graceful_shutdown_service.md) + # 服务节点优雅关闭方案 ## 1. 核心目标 diff --git a/docs/zh/online_serving/metrics.md b/docs/zh/online_serving/metrics.md index 77e21f6aa..82244183b 100644 --- a/docs/zh/online_serving/metrics.md +++ b/docs/zh/online_serving/metrics.md @@ -1,3 +1,5 @@ +[English](../../online_serving/metrics.md) + # 监控 Metrics 在 FastDeploy 启动后,支持通过 Metrics 持续监控的 FastDeploy 的服务状态。启动 FastDeploy 时,可以通过配置 `metrics-port` 参数指定 Metircs 服务的端口。 diff --git a/docs/zh/online_serving/scheduler.md b/docs/zh/online_serving/scheduler.md index afbd819ba..c4d2ae82c 100644 --- a/docs/zh/online_serving/scheduler.md +++ b/docs/zh/online_serving/scheduler.md @@ -1,3 +1,5 @@ +[English](../../online_serving/scheduler.md) + # 调度器 FastDeploy 目前支持两种调度器: **本地调度器** 和 **全局调度器** 。 全局调度器专为大规模集群设计,能够基于实时工作负载指标在节点间实现二级负载均衡。 diff --git a/docs/zh/parameters.md b/docs/zh/parameters.md index 72300638a..9599c04d6 100644 --- a/docs/zh/parameters.md +++ b/docs/zh/parameters.md @@ -1,3 +1,5 @@ +[English](../parameters.md) + # FastDeploy参数说明 在使用FastDeploy部署模型(包括离线推理、服务化部署),涉及如下参数配置,其实需要注意,在使用离线推理时,各参数配置即为如下参数名;而在使用命令行启动服务时,相应参数中的分隔符需要从```_```修改为```-```,如```max_model_len```在命令行中则为```--max-model-len```。 diff --git a/docs/zh/quantization/README.md b/docs/zh/quantization/README.md index 77705c1e0..8429a519a 100644 --- a/docs/zh/quantization/README.md +++ b/docs/zh/quantization/README.md @@ -1,3 +1,5 @@ +[English](../../quantization/README.md) + # 量化 FastDeploy支持FP8、INT8、INT4、2-bit等多种量化推理精度,支持模型权重、激活和KVCache 3种张量的不同精度推理,可以满足低成本、低时延、长上下文等不同场景的推理需求。 diff --git a/docs/zh/quantization/online_quantization.md b/docs/zh/quantization/online_quantization.md index 515258227..f32d77446 100644 --- a/docs/zh/quantization/online_quantization.md +++ b/docs/zh/quantization/online_quantization.md @@ -1,3 +1,5 @@ +[English](../../quantization/online_quantization.md) + # 在线量化 在线量化是指推理引擎在加载 BF16 权重后对权重做量化,而不是加载离线量化好的低精度权重。FastDeploy 支持将 BF16 在线量化到多种精度,包括:INT4, INT8 和 FP8. diff --git a/docs/zh/quantization/wint2.md b/docs/zh/quantization/wint2.md index 00e55a979..bb283dfbc 100644 --- a/docs/zh/quantization/wint2.md +++ b/docs/zh/quantization/wint2.md @@ -1,3 +1,5 @@ +[English](../../quantization/wint2.md) + # WINT2量化 权重经过 [CCQ(卷积编码量化)](https://arxiv.org/pdf/2507.07145) 方法进行离线压缩。权重的实际存储数值类型为INT8,每个INT8数值中打包了4个权重,等价于每个权重2bits。激活不做量化。在推理过程中,权重会被实时反量化并解码为BF16数值类型,并使用BF16数值类型进行计算。 diff --git a/docs/zh/supported_models.md b/docs/zh/supported_models.md index 8507765fb..5caa83ab8 100644 --- a/docs/zh/supported_models.md +++ b/docs/zh/supported_models.md @@ -1,3 +1,5 @@ +[English](../supported_models.md) + # 支持模型列表 FastDeploy目前支持模型列表如下,在FastDeploy部署时,指定 ``model``参数为如下表格中的模型名,即可自动下载模型权重(均支持断点续传),支持如下3种下载源, diff --git a/docs/zh/usage/code_overview.md b/docs/zh/usage/code_overview.md index 170652a5e..e6d6e6522 100644 --- a/docs/zh/usage/code_overview.md +++ b/docs/zh/usage/code_overview.md @@ -1,3 +1,5 @@ +[English](../../usage/code_overview.md) + # 代码说明 下边按照目录结构来介绍一下每个 FastDeploy 的代码结构及代码功能。 diff --git a/docs/zh/usage/environment_variables.md b/docs/zh/usage/environment_variables.md index 73e19f3bd..41953ffb7 100644 --- a/docs/zh/usage/environment_variables.md +++ b/docs/zh/usage/environment_variables.md @@ -1,3 +1,5 @@ +[English](../../usage/environment_variables.md) + # FastDeploy 环境变量说明 FastDeploy 的环境变量保存在了代码库根目录下 fastdeploy/envs.py 文件中,以下是其对应的中文版说明: diff --git a/docs/zh/usage/faq.md b/docs/zh/usage/faq.md index 371b6678b..65df426ac 100644 --- a/docs/zh/usage/faq.md +++ b/docs/zh/usage/faq.md @@ -1,3 +1,5 @@ +[English](../../usage/faq.md) + 1. 服务可以支持多大并发? - 服务部署时推荐配置环境变量export ENABLE_V1_KVCACHE_SCHEDULER=1 - 服务在启动时需要配置```max-num-seqs```,此参数用于表示Decode阶段的最大Batch数,如果并发超过此值,则超出的请求会排队等待处理, 常规情况下你可以将```max-num-seqs```配置为128,保持在较高的范围,实际并发由发压客户端来决定。 diff --git a/docs/zh/usage/fastdeploy_unit_test_guide.md b/docs/zh/usage/fastdeploy_unit_test_guide.md index cfa499104..1e59197b3 100644 --- a/docs/zh/usage/fastdeploy_unit_test_guide.md +++ b/docs/zh/usage/fastdeploy_unit_test_guide.md @@ -1,3 +1,5 @@ +[English](../../usage/fastdeploy_unit_test_guide.md) + # Fastdeploy 单测规范 1. 测试命名规范 - 测试文件以 test_ 开头; diff --git a/docs/zh/usage/kunlunxin_xpu_deployment.md b/docs/zh/usage/kunlunxin_xpu_deployment.md index c32297c86..9a87bf5d0 100644 --- a/docs/zh/usage/kunlunxin_xpu_deployment.md +++ b/docs/zh/usage/kunlunxin_xpu_deployment.md @@ -1,3 +1,5 @@ +[English](../../usage/kunlunxin_xpu_deployment.md) + ## 支持的模型 |模型名|上下文长度|量化|所需卡数|部署命令|最低版本要求| |-|-|-|-|-|-| diff --git a/docs/zh/usage/log.md b/docs/zh/usage/log.md index c9b287523..4c54cc7a0 100644 --- a/docs/zh/usage/log.md +++ b/docs/zh/usage/log.md @@ -1,3 +1,5 @@ +[English](../../usage/log.md) + # 日志说明 FastDeploy 在部署过程中,会产生如下日志文件,各日志含义说明