mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2025-12-24 13:28:13 +08:00
Sync v2.0 version of code to github repo
This commit is contained in:
25
docs/usage/code_overview.md
Normal file
25
docs/usage/code_overview.md
Normal file
@@ -0,0 +1,25 @@
|
||||
# Code Overview
|
||||
|
||||
Below is an overview of the FastDeploy code structure and functionality organized by directory.
|
||||
|
||||
- ```custom_ops```: Contains C++ operators used by FastDeploy for large model inference. Operators for different hardware are placed in corresponding subdirectories (e.g., `cpu_ops`, `gpu_ops`). The root-level `setup_*.py` files are used to compile these C++ operators.
|
||||
- ```dockerfiles```: Stores Dockerfiles for building FastDeploy runtime environment images.
|
||||
- ```docs```: Documentation related to the FastDeploy codebase.
|
||||
- ```fastdeploy```
|
||||
- ```agent```: Scripts for launching large model services.
|
||||
- ```cache_manager```: Cache management module for large models.
|
||||
- ```engine```: Core engine classes for managing large model execution.
|
||||
- ```entrypoints```: User-facing APIs for interaction.
|
||||
- ```input```: Input processing module, including preprocessing, multimodal input handling, tokenization, etc.
|
||||
- ```model_executor```
|
||||
- ```layers```: Layer modules required for large model architecture.
|
||||
- ```model_runner```: Model inference execution module.
|
||||
- ```models```: Built-in large model classes in FastDeploy.
|
||||
- ```ops```: Python-callable operator modules compiled from `custom_ops`, organized by hardware platform.
|
||||
- ```output```: Post-processing for large model outputs.
|
||||
- ```platforms```: Platform-specific modules for underlying hardware support.
|
||||
- ```scheduler```: Request scheduling module for large models.
|
||||
- ```metrics```: Core component for collecting, managing, and exporting Prometheus metrics, tracking key runtime performance data (e.g., request latency, resource utilization, successful request counts).
|
||||
- ```splitwise```: Modules related to PD disaggragation deployment.
|
||||
- ```scripts```/```tools```: Utility scripts for FastDeploy operations (e.g., compilation, unit testing, code style fixes).
|
||||
- ```test```: Code for unit testing and validation.
|
||||
72
docs/usage/environment_variables.md
Normal file
72
docs/usage/environment_variables.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# FastDeploy Environment Variables
|
||||
|
||||
FastDeploy's environment variables are defined in `fastdeploy/envs.py` at the root of the repository. Below is the documentation:
|
||||
|
||||
```python
|
||||
environment_variables: dict[str, Callable[[], Any]] = {
|
||||
# CUDA architecture versions used when building FastDeploy (string list, e.g. [80,90])
|
||||
"FD_BUILDING_ARCS":
|
||||
lambda: os.getenv("FD_BUILDING_ARCS", "[]"),
|
||||
|
||||
# Log directory
|
||||
"FD_LOG_DIR":
|
||||
lambda: os.getenv("FD_LOG_DIR", "log"),
|
||||
|
||||
# Enable debug mode (0 or 1)
|
||||
"FD_DEBUG":
|
||||
lambda: os.getenv("FD_DEBUG", "0"),
|
||||
|
||||
# FastDeploy log retention days
|
||||
"FD_LOG_BACKUP_COUNT":
|
||||
lambda: os.getenv("FD_LOG_BACKUP_COUNT", "7"),
|
||||
|
||||
# Model download cache directory
|
||||
"FD_MODEL_CACHE":
|
||||
lambda: os.getenv("FD_MODEL_CACHE", None),
|
||||
|
||||
# Maximum number of stop sequences
|
||||
"FD_MAX_STOP_SEQS_NUM":
|
||||
lambda: os.getenv("FD_MAX_STOP_SEQS_NUM", "5"),
|
||||
|
||||
# Maximum length of stop sequences
|
||||
"FD_STOP_SEQS_MAX_LEN":
|
||||
lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"),
|
||||
|
||||
# GPU devices to use (comma-separated string, e.g. 0,1,2)
|
||||
"CUDA_VISIBLE_DEVICES":
|
||||
lambda: os.getenv("CUDA_VISIBLE_DEVICES", None),
|
||||
|
||||
# Whether to use HuggingFace tokenizer (0 or 1)
|
||||
"FD_USE_HF_TOKENIZER":
|
||||
lambda: os.getenv("FD_USE_HF_TOKENIZER", 0),
|
||||
|
||||
# ZMQ send high-water mark (HWM) during initialization
|
||||
"FD_ZMQ_SNDHWM":
|
||||
lambda: os.getenv("FD_ZMQ_SNDHWM", 10000),
|
||||
|
||||
# Directory for caching KV quantization parameters
|
||||
"FD_CACHE_PARAMS":
|
||||
lambda: os.getenv("FD_CACHE_PARAMS", "none"),
|
||||
|
||||
# Attention backend ("NATIVE_ATTN", "APPEND_ATTN", or "MLA_ATTN")
|
||||
"FD_ATTENTION_BACKEND":
|
||||
lambda: os.getenv("FD_ATTENTION_BACKEND", "APPEND_ATTN"),
|
||||
|
||||
# Sampling class ("base", "air", or "rejection")
|
||||
"FD_SAMPLING_CLASS":
|
||||
lambda: os.getenv("FD_SAMPLING_CLASS", "base"),
|
||||
|
||||
# MoE backend ("cutlass", "marlin", or "triton")
|
||||
"FD_MOE_BACKEND":
|
||||
lambda: os.getenv("FD_MOE_BACKEND", "cutlass"),
|
||||
|
||||
# Triton kernel JIT compilation directory
|
||||
"FD_TRITON_KERNEL_CACHE_DIR":
|
||||
lambda: os.getenv("FD_TRITON_KERNEL_CACHE_DIR", None),
|
||||
|
||||
# Switch from standalone PD to centralized inference (0 or 1)
|
||||
"FD_PD_CHANGEABLE":
|
||||
lambda: os.getenv("FD_PD_CHANGEABLE", "1"),
|
||||
|
||||
}
|
||||
```
|
||||
38
docs/usage/log.md
Normal file
38
docs/usage/log.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Log Description
|
||||
|
||||
FastDeploy generates the following log files during deployment. Below is an explanation of each log's purpose.
|
||||
By default, logs are stored in the `log` directory under the execution path. To specify a custom directory, set the environment variable `FD_LOG_DIR`.
|
||||
|
||||
## Inference Service Logs
|
||||
* `backup_env.*.json` : Records environment variables set during instance startup. The number of files matches the number of GPU cards.
|
||||
* `envlog.*` : Logs environment variables set during instance startup. The number of files matches the number of GPU cards.
|
||||
* `console.log` : Records model startup time and other information. This log is also printed to the console.
|
||||
* `data_processor.log` : Logs input/output data encoding and decoding details.
|
||||
* `fastdeploy.log` : Records configuration information during instance startup, as well as request and response details during runtime.
|
||||
* `workerlog.*` : Tracks model loading progress and inference operator errors. Each GPU card has a corresponding file.
|
||||
* `worker_process.log` : Logs engine inference data for each iteration.
|
||||
* `prefix_cache_manager.log` : Records KV Cache logical index allocation for each request and cache hit status.
|
||||
* `launch_worker.log` : Logs model startup information and error messages.
|
||||
* `gpu_worker.log` : Records KV Cache block count information during profiling.
|
||||
* `gpu_model_runner.log` : Contains model details and loading time.
|
||||
|
||||
## Online Inference Client Logs
|
||||
* `api_server.log` : Logs startup parameters and received request information.
|
||||
|
||||
## Scheduler Logs
|
||||
* `scheduler.log` : Records scheduler information, including node status and request allocation details.
|
||||
|
||||
## Speculative Decoding Logs
|
||||
* `speculate.log` : Contains speculative decoding-related information.
|
||||
|
||||
## Prefix Caching Logs
|
||||
* `cache_queue_manager.log` : Logs startup parameters and received request information.
|
||||
* `cache_transfer_manager.log` : Logs startup parameters and received request information.
|
||||
* `launch_cache_manager.log` : Records cache transfer startup parameters and error messages.
|
||||
|
||||
## PD Disaggragation Logs
|
||||
* `cache_messager.log` : Logs transmission protocols and messages used by the P instance.
|
||||
* `splitwise_connector.log` : Records data received from P/D instances and connection establishment details.
|
||||
|
||||
## CudaGraph Logs
|
||||
* `cudagraph_piecewise_backend.log` : Logs CudaGraph startup and error information.
|
||||
Reference in New Issue
Block a user