Sync v2.0 version of code to github repo

2025-12-24 13:28:13 +08:00 · 2025-06-29 23:29:37 +00:00
parent d151496038
commit 92c2cfa2e7
597 changed files with 78776 additions and 22905 deletions
--- a/docs/usage/code_overview.md
+++ b/docs/usage/code_overview.md
@@ -0,0 +1,25 @@
+# Code Overview
+
+Below is an overview of the FastDeploy code structure and functionality organized by directory.
+
+- ```custom_ops```: Contains C++ operators used by FastDeploy for large model inference. Operators for different hardware are placed in corresponding subdirectories (e.g., `cpu_ops`, `gpu_ops`). The root-level `setup_*.py` files are used to compile these C++ operators.
+- ```dockerfiles```: Stores Dockerfiles for building FastDeploy runtime environment images.
+- ```docs```: Documentation related to the FastDeploy codebase.
+- ```fastdeploy```
+  - ```agent```: Scripts for launching large model services.
+  - ```cache_manager```: Cache management module for large models.
+  - ```engine```: Core engine classes for managing large model execution.
+  - ```entrypoints```: User-facing APIs for interaction.
+  - ```input```: Input processing module, including preprocessing, multimodal input handling, tokenization, etc.
+  - ```model_executor```
+    - ```layers```: Layer modules required for large model architecture.
+    - ```model_runner```: Model inference execution module.
+    - ```models```: Built-in large model classes in FastDeploy.
+    - ```ops```: Python-callable operator modules compiled from `custom_ops`, organized by hardware platform.
+  - ```output```: Post-processing for large model outputs.
+  - ```platforms```: Platform-specific modules for underlying hardware support.
+  - ```scheduler```: Request scheduling module for large models.
+  - ```metrics```: Core component for collecting, managing, and exporting Prometheus metrics, tracking key runtime performance data (e.g., request latency, resource utilization, successful request counts).
+  - ```splitwise```: Modules related to PD disaggragation deployment.
+- ```scripts```/```tools```: Utility scripts for FastDeploy operations (e.g., compilation, unit testing, code style fixes).
+- ```test```: Code for unit testing and validation.
--- a/docs/usage/environment_variables.md
+++ b/docs/usage/environment_variables.md
@@ -0,0 +1,72 @@
+# FastDeploy Environment Variables
+
+FastDeploy's environment variables are defined in `fastdeploy/envs.py` at the root of the repository. Below is the documentation:
+
+```python
+environment_variables: dict[str, Callable[[], Any]] = {
+    # CUDA architecture versions used when building FastDeploy (string list, e.g. [80,90])
+    "FD_BUILDING_ARCS":
+    lambda: os.getenv("FD_BUILDING_ARCS", "[]"),
+
+    # Log directory
+    "FD_LOG_DIR":
+    lambda: os.getenv("FD_LOG_DIR", "log"),
+
+    # Enable debug mode (0 or 1)
+    "FD_DEBUG":
+    lambda: os.getenv("FD_DEBUG", "0"),
+
+    # FastDeploy log retention days
+    "FD_LOG_BACKUP_COUNT":
+    lambda: os.getenv("FD_LOG_BACKUP_COUNT", "7"),
+
+    # Model download cache directory
+    "FD_MODEL_CACHE":
+    lambda: os.getenv("FD_MODEL_CACHE", None),
+
+    # Maximum number of stop sequences
+    "FD_MAX_STOP_SEQS_NUM":
+    lambda: os.getenv("FD_MAX_STOP_SEQS_NUM", "5"),
+
+    # Maximum length of stop sequences
+    "FD_STOP_SEQS_MAX_LEN":
+    lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"),
+
+    # GPU devices to use (comma-separated string, e.g. 0,1,2)
+    "CUDA_VISIBLE_DEVICES":
+    lambda: os.getenv("CUDA_VISIBLE_DEVICES", None),
+
+    # Whether to use HuggingFace tokenizer (0 or 1)
+    "FD_USE_HF_TOKENIZER":
+    lambda: os.getenv("FD_USE_HF_TOKENIZER", 0),
+
+    # ZMQ send high-water mark (HWM) during initialization
+    "FD_ZMQ_SNDHWM":
+    lambda: os.getenv("FD_ZMQ_SNDHWM", 10000),
+
+    # Directory for caching KV quantization parameters
+    "FD_CACHE_PARAMS":
+    lambda: os.getenv("FD_CACHE_PARAMS", "none"),
+
+    # Attention backend ("NATIVE_ATTN", "APPEND_ATTN", or "MLA_ATTN")
+    "FD_ATTENTION_BACKEND":
+    lambda: os.getenv("FD_ATTENTION_BACKEND", "APPEND_ATTN"),
+
+    # Sampling class ("base", "air", or "rejection")
+    "FD_SAMPLING_CLASS":
+    lambda: os.getenv("FD_SAMPLING_CLASS", "base"),
+
+    # MoE backend ("cutlass", "marlin", or "triton")
+    "FD_MOE_BACKEND":
+    lambda: os.getenv("FD_MOE_BACKEND", "cutlass"),
+
+    # Triton kernel JIT compilation directory
+    "FD_TRITON_KERNEL_CACHE_DIR":
+    lambda: os.getenv("FD_TRITON_KERNEL_CACHE_DIR", None),
+
+    # Switch from standalone PD to centralized inference (0 or 1)
+    "FD_PD_CHANGEABLE":
+    lambda: os.getenv("FD_PD_CHANGEABLE", "1"),
+  
+}
+```
--- a/docs/usage/log.md
+++ b/docs/usage/log.md
@@ -0,0 +1,38 @@
+# Log Description
+
+FastDeploy generates the following log files during deployment. Below is an explanation of each log's purpose.  
+By default, logs are stored in the `log` directory under the execution path. To specify a custom directory, set the environment variable `FD_LOG_DIR`.
+
+## Inference Service Logs
+* `backup_env.*.json` : Records environment variables set during instance startup. The number of files matches the number of GPU cards.
+* `envlog.*` : Logs environment variables set during instance startup. The number of files matches the number of GPU cards.
+* `console.log` : Records model startup time and other information. This log is also printed to the console.
+* `data_processor.log` : Logs input/output data encoding and decoding details.
+* `fastdeploy.log` : Records configuration information during instance startup, as well as request and response details during runtime.
+* `workerlog.*` : Tracks model loading progress and inference operator errors. Each GPU card has a corresponding file.
+* `worker_process.log` : Logs engine inference data for each iteration.
+* `prefix_cache_manager.log` : Records KV Cache logical index allocation for each request and cache hit status.
+* `launch_worker.log` : Logs model startup information and error messages.
+* `gpu_worker.log` : Records KV Cache block count information during profiling.
+* `gpu_model_runner.log` : Contains model details and loading time.
+
+## Online Inference Client Logs
+* `api_server.log` : Logs startup parameters and received request information.
+
+## Scheduler Logs
+* `scheduler.log` : Records scheduler information, including node status and request allocation details.
+
+## Speculative Decoding Logs
+* `speculate.log` : Contains speculative decoding-related information.
+
+## Prefix Caching Logs
+* `cache_queue_manager.log` : Logs startup parameters and received request information.
+* `cache_transfer_manager.log` : Logs startup parameters and received request information.
+* `launch_cache_manager.log` : Records cache transfer startup parameters and error messages.
+
+## PD Disaggragation Logs
+* `cache_messager.log` : Logs transmission protocols and messages used by the P instance.
+* `splitwise_connector.log` : Records data received from P/D instances and connection establishment details.
+
+## CudaGraph Logs
+* `cudagraph_piecewise_backend.log` : Logs CudaGraph startup and error information.