Sync v2.0 version of code to github repo

This commit is contained in:
Jiang-Jia-Jun
2025-06-29 23:29:37 +00:00
parent d151496038
commit 92c2cfa2e7
597 changed files with 78776 additions and 22905 deletions

View File

@@ -0,0 +1,25 @@
# Code Overview
Below is an overview of the FastDeploy code structure and functionality organized by directory.
- ```custom_ops```: Contains C++ operators used by FastDeploy for large model inference. Operators for different hardware are placed in corresponding subdirectories (e.g., `cpu_ops`, `gpu_ops`). The root-level `setup_*.py` files are used to compile these C++ operators.
- ```dockerfiles```: Stores Dockerfiles for building FastDeploy runtime environment images.
- ```docs```: Documentation related to the FastDeploy codebase.
- ```fastdeploy```
- ```agent```: Scripts for launching large model services.
- ```cache_manager```: Cache management module for large models.
- ```engine```: Core engine classes for managing large model execution.
- ```entrypoints```: User-facing APIs for interaction.
- ```input```: Input processing module, including preprocessing, multimodal input handling, tokenization, etc.
- ```model_executor```
- ```layers```: Layer modules required for large model architecture.
- ```model_runner```: Model inference execution module.
- ```models```: Built-in large model classes in FastDeploy.
- ```ops```: Python-callable operator modules compiled from `custom_ops`, organized by hardware platform.
- ```output```: Post-processing for large model outputs.
- ```platforms```: Platform-specific modules for underlying hardware support.
- ```scheduler```: Request scheduling module for large models.
- ```metrics```: Core component for collecting, managing, and exporting Prometheus metrics, tracking key runtime performance data (e.g., request latency, resource utilization, successful request counts).
- ```splitwise```: Modules related to PD disaggragation deployment.
- ```scripts```/```tools```: Utility scripts for FastDeploy operations (e.g., compilation, unit testing, code style fixes).
- ```test```: Code for unit testing and validation.

View File

@@ -0,0 +1,72 @@
# FastDeploy Environment Variables
FastDeploy's environment variables are defined in `fastdeploy/envs.py` at the root of the repository. Below is the documentation:
```python
environment_variables: dict[str, Callable[[], Any]] = {
# CUDA architecture versions used when building FastDeploy (string list, e.g. [80,90])
"FD_BUILDING_ARCS":
lambda: os.getenv("FD_BUILDING_ARCS", "[]"),
# Log directory
"FD_LOG_DIR":
lambda: os.getenv("FD_LOG_DIR", "log"),
# Enable debug mode (0 or 1)
"FD_DEBUG":
lambda: os.getenv("FD_DEBUG", "0"),
# FastDeploy log retention days
"FD_LOG_BACKUP_COUNT":
lambda: os.getenv("FD_LOG_BACKUP_COUNT", "7"),
# Model download cache directory
"FD_MODEL_CACHE":
lambda: os.getenv("FD_MODEL_CACHE", None),
# Maximum number of stop sequences
"FD_MAX_STOP_SEQS_NUM":
lambda: os.getenv("FD_MAX_STOP_SEQS_NUM", "5"),
# Maximum length of stop sequences
"FD_STOP_SEQS_MAX_LEN":
lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"),
# GPU devices to use (comma-separated string, e.g. 0,1,2)
"CUDA_VISIBLE_DEVICES":
lambda: os.getenv("CUDA_VISIBLE_DEVICES", None),
# Whether to use HuggingFace tokenizer (0 or 1)
"FD_USE_HF_TOKENIZER":
lambda: os.getenv("FD_USE_HF_TOKENIZER", 0),
# ZMQ send high-water mark (HWM) during initialization
"FD_ZMQ_SNDHWM":
lambda: os.getenv("FD_ZMQ_SNDHWM", 10000),
# Directory for caching KV quantization parameters
"FD_CACHE_PARAMS":
lambda: os.getenv("FD_CACHE_PARAMS", "none"),
# Attention backend ("NATIVE_ATTN", "APPEND_ATTN", or "MLA_ATTN")
"FD_ATTENTION_BACKEND":
lambda: os.getenv("FD_ATTENTION_BACKEND", "APPEND_ATTN"),
# Sampling class ("base", "air", or "rejection")
"FD_SAMPLING_CLASS":
lambda: os.getenv("FD_SAMPLING_CLASS", "base"),
# MoE backend ("cutlass", "marlin", or "triton")
"FD_MOE_BACKEND":
lambda: os.getenv("FD_MOE_BACKEND", "cutlass"),
# Triton kernel JIT compilation directory
"FD_TRITON_KERNEL_CACHE_DIR":
lambda: os.getenv("FD_TRITON_KERNEL_CACHE_DIR", None),
# Switch from standalone PD to centralized inference (0 or 1)
"FD_PD_CHANGEABLE":
lambda: os.getenv("FD_PD_CHANGEABLE", "1"),
}
```

38
docs/usage/log.md Normal file
View File

@@ -0,0 +1,38 @@
# Log Description
FastDeploy generates the following log files during deployment. Below is an explanation of each log's purpose.
By default, logs are stored in the `log` directory under the execution path. To specify a custom directory, set the environment variable `FD_LOG_DIR`.
## Inference Service Logs
* `backup_env.*.json` : Records environment variables set during instance startup. The number of files matches the number of GPU cards.
* `envlog.*` : Logs environment variables set during instance startup. The number of files matches the number of GPU cards.
* `console.log` : Records model startup time and other information. This log is also printed to the console.
* `data_processor.log` : Logs input/output data encoding and decoding details.
* `fastdeploy.log` : Records configuration information during instance startup, as well as request and response details during runtime.
* `workerlog.*` : Tracks model loading progress and inference operator errors. Each GPU card has a corresponding file.
* `worker_process.log` : Logs engine inference data for each iteration.
* `prefix_cache_manager.log` : Records KV Cache logical index allocation for each request and cache hit status.
* `launch_worker.log` : Logs model startup information and error messages.
* `gpu_worker.log` : Records KV Cache block count information during profiling.
* `gpu_model_runner.log` : Contains model details and loading time.
## Online Inference Client Logs
* `api_server.log` : Logs startup parameters and received request information.
## Scheduler Logs
* `scheduler.log` : Records scheduler information, including node status and request allocation details.
## Speculative Decoding Logs
* `speculate.log` : Contains speculative decoding-related information.
## Prefix Caching Logs
* `cache_queue_manager.log` : Logs startup parameters and received request information.
* `cache_transfer_manager.log` : Logs startup parameters and received request information.
* `launch_cache_manager.log` : Records cache transfer startup parameters and error messages.
## PD Disaggragation Logs
* `cache_messager.log` : Logs transmission protocols and messages used by the P instance.
* `splitwise_connector.log` : Records data received from P/D instances and connection establishment details.
## CudaGraph Logs
* `cudagraph_piecewise_backend.log` : Logs CudaGraph startup and error information.