diff --git a/docs/cli/README.md b/docs/cli/README.md new file mode 100644 index 000000000..6d4266732 --- /dev/null +++ b/docs/cli/README.md @@ -0,0 +1,40 @@ +# FastDeploy CLI User Guide + +## Introduction + +**FastDeploy CLI** is a command-line tool provided by the FastDeploy inference framework, designed for **running, deploying, and testing AI model inference tasks**. It allows developers to quickly perform model loading, API calls, service deployment, performance benchmarking, and environment information collection directly from the command line. + +With FastDeploy CLI, you can: + +* 🚀 **Run and validate model inference**: Generate chat responses or text completions directly in the command line (`chat`, `complete`). +* 🧩 **Deploy models as services**: Start an OpenAI-compatible API service with a single command (`serve`). +* 📊 **Perform performance and evaluation tests**: Conduct latency, throughput, and task benchmarks (`bench`). +* ⚙️ **Collect environment information**: Output system, framework, GPU, and FastDeploy version information (`collect-env`). +* 📁 **Run batch inference tasks**: Supports batch input/output from files or URLs (`run-batch`). +* 🔡 **Manage model tokenizers**: Encode/decode text and tokens, or export vocabulary (`tokenizer`). + +--- + +### View Help Information + +```bash +fastdeploy --help +``` + +### Available Commands + +```bash +fastdeploy {chat, complete, serve, bench, collect-env, run-batch, tokenizer} +``` + +--- + +| Command Name | Description | Detailed Documentation | +| ------------- | ------------------------------------------------------------------------------------------------ | -------------------------------------------------- | +| `chat` | Run interactive chat generation tasks in the command line to verify chat model inference results | [View chat command details](chat.md) | +| `complete` | Perform text completion tasks and test various language model outputs | [View complete command details](complete.md) | +| `serve` | Launch a local inference service compatible with the OpenAI API protocol | [View serve command details](serve.md) | +| `bench` | Evaluate model performance (latency, throughput) and accuracy | [View bench command details](bench.md) | +| `collect-env` | Collect and print system, GPU, dependency, and FastDeploy environment information | [View collect-env command details](collect-env.md) | +| `run-batch` | Run batch inference tasks with file or URL input/output | [View run-batch command details](run-batch.md) | +| `tokenizer` | Encode/decode text and tokens, and export vocabulary | [View tokenizer command details](tokenizer.md) | diff --git a/docs/cli/bench.md b/docs/cli/bench.md new file mode 100644 index 000000000..7561e0c29 --- /dev/null +++ b/docs/cli/bench.md @@ -0,0 +1,119 @@ +# bench: Benchmark Testing + +## 1. bench latency: Offline Latency Test + +### Parameters + +| Parameter | Description | Default | +| -------------------- | ------------------------------------------- | ------- | +| --input-len | Input sequence length (tokens) | 32 | +| --output-len | Output sequence length (tokens) | 128 | +| --batch-size | Batch size | 8 | +| --n | Number of sequences generated per prompt | 1 | +| --use-beam-search | Whether to use beam search | False | +| --num-iters-warmup | Number of warmup iterations | 10 | +| --num-iters | Number of actual test iterations | 30 | +| --profile | Whether to enable performance profiling | False | +| --output-json | Path to save latency results as a JSON file | None | +| --disable-detokenize | Whether to disable detokenization | False | + +### Example + +``` +# Run latency benchmark on the inference engine +fastdeploy bench latency --model baidu/ERNIE-4.5-0.3B-Paddle +``` + +## 2. bench serve: Online Latency and Throughput Test + +### Parameters + +| Parameter | Description | Default | +| ----------------- | ------------------------------------- | ---------------------- | +| --backend | Backend type | "openai-chat" | +| --base-url | Base URL of the server or API | None | +| --host | Host address | "127.0.0.1" | +| --port | Port | 8000 | +| --endpoint | API endpoint path | "/v1/chat/completions" | +| --model | Model name | Required | +| --dataset-name | Dataset name | "sharegpt" | +| --dataset-path | Path to dataset | None | +| --num-prompts | Number of prompts to process | 1000 | +| --request-rate | Requests per second | inf | +| --max-concurrency | Maximum concurrency | None | +| --top-p | Sampling top-p (OpenAI backend) | None | +| --top-k | Sampling top-k (OpenAI backend) | None | +| --temperature | Sampling temperature (OpenAI backend) | None | + +### Example + +``` +# Run online performance test +fastdeploy bench serve --backend openai-chat \ + --model baidu/ERNIE-4.5-0.3B-Paddle \ + --endpoint /v1/chat/completions \ + --host 0.0.0.0 \ + --port 8891 \ + --dataset-name EBChat \ + --dataset-path /datasets/filtered_sharedgpt_2000_input_1136_output_200.json \ + --percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \ + --metric-percentiles 80,95,99,99.9,99.95,99.99 \ + --num-prompts 1 \ + --max-concurrency 1 \ + --save-result +``` + +## 3. bench throughput: Throughput Test + +### Parameters + +| Parameter | Description | Default | +| -------------------- | ---------------------------------------- | ------------ | +| --backend | Inference backend | "fastdeploy" | +| --dataset-name | Dataset name | "random" | +| --model | Model name | Required | +| --input-len | Input sequence length | None | +| --output-len | Output sequence length | None | +| --prefix-len | Prefix length | 0 | +| --n | Number of sequences generated per prompt | 1 | +| --num-prompts | Number of prompts | 50 | +| --output-json | Path to save results as a JSON file | None | +| --disable-detokenize | Whether to disable detokenization | False | +| --lora-path | Path to LoRA adapter | None | + +### Example + +``` +# Run throughput benchmark on the inference engine +fastdeploy bench throughput --model baidu/ERNIE-4.5-0.3B-Paddle \ +--backend fastdeploy-chat \ +--dataset-name EBChat \ +--dataset-path /datasets/filtered_sharedgpt_2000_input_1136_output_200.json \ +--max-model-len 32768 +``` + +## 4. bench eval: Online Task Evaluation + +### Parameters + +| Parameter | Description | Default | +| ----------------- | ------------------------------- | ------- | +| --model, -m | Model name | "hf" | +| --tasks, -t | List of evaluation tasks | None | +| --model_args, -a | Model arguments | "" | +| --num_fewshot, -f | Number of few-shot examples | None | +| --samples, -E | Number of samples | None | +| --batch_size, -b | Batch size | 1 | +| --device | Device | None | +| --output_path, -o | Output file path | None | +| --write_out, -w | Whether to write output results | False | + +### Example + +``` +# Run task evaluation on an online service +fastdeploy bench eval --model local-completions \ + --model_args pretrained=./baidu/ERNIE-4.5-0.3B-Paddle,base_url=http://0.0.0.0:8490/v1/completions + --write_out \ + --tasks ceval-valid_accountant +``` diff --git a/docs/cli/chat.md b/docs/cli/chat.md new file mode 100644 index 000000000..d4d3d1dc2 --- /dev/null +++ b/docs/cli/chat.md @@ -0,0 +1,26 @@ +# chat: Conversational Generation + +`fastdeploy chat` interacts with a running API server to generate chat responses. + +## Parameters + +| Parameter | Description | Default | +| --------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------- | +| --url | URL of the running OpenAI-compatible RESTful API server | [http://localhost:8000/v1](http://localhost:8000/v1) | +| --model-name | Name of the model to use for prompt completion; defaults to the first model listed in the models API | None | +| --api-key | API key for OpenAI services; overrides environment variable if provided | None | +| --system-prompt | Specifies the system prompt used in the chat template | None | +| -q, --quick | Sends a single prompt as a MESSAGE, prints the response, and exits | None | + +## Examples + +``` +# Connect directly to a local API +fastdeploy chat + +# Specify an API URL +fastdeploy chat --url http://{fastdeploy-serve-host}:{fastdeploy-serve-port}/v1 + +# Send a single quick prompt +fastdeploy chat --quick "hi" +``` diff --git a/docs/cli/collect-env.md b/docs/cli/collect-env.md new file mode 100644 index 000000000..8717233ff --- /dev/null +++ b/docs/cli/collect-env.md @@ -0,0 +1,320 @@ +# collect-env: Environment Information Collector + +`collect-env` is used to gather information about the system, GPU, deep learning frameworks, and FastDeploy environment. +This subcommand requires no additional arguments — simply run it directly. + +## Usage + +``` +fastdeploy collect-env +``` + +## Information Collected + +**1. System Information** + +* `os`: Operating system + + * Linux: `lsb_release -a` or `cat /etc/*-release` + * Windows: `wmic os get Caption` + * macOS: `sw_vers -productVersion` +* `gcc_version`: GCC version, retrieved by `gcc --version` +* `clang_version`: Clang version, retrieved by `clang --version` +* `cmake_version`: CMake version, retrieved by `cmake --version` +* `libc_version`: GNU C library version (Linux only), retrieved by `platform.libc_ver()` + +**2. PyTorch Information** + +* `torch_version`: PyTorch version +* `is_debug_build`: Whether it’s a Debug build +* `cuda_compiled_version`: CUDA version used to compile PyTorch +* `hip_compiled_version`: HIP version used to compile PyTorch (for AMD GPUs) + +**3. Paddle Information** + +* `paddle_version`: Paddle version +* `paddle_compiled_version`: CUDA version used to compile Paddle + +**4. Python Environment** + +* `python_version`: Python version +* `python_platform`: Detailed platform information + +**5. CUDA / GPU Information** + +* `is_cuda_available`: Whether CUDA is available +* `cuda_runtime_version`: CUDA runtime version +* `cuda_module_loading`: CUDA module loading policy (`CUDA_MODULE_LOADING` environment variable) +* `nvidia_gpu_models`: GPU model(s) +* `nvidia_driver_version`: NVIDIA driver version +* `cudnn_version`: cuDNN version +* `caching_allocator_config`: CUDA caching allocator configuration (`PYTORCH_CUDA_ALLOC_CONF` environment variable) +* `is_xnnpack_available`: Whether XNNPACK is available + +**6. CPU Information** + +* `cpu_info`: Detailed CPU information (retrieved via `lscpu` or Windows system commands) + +**7. Relevant Library Versions** + +* `pip_packages`: Key library versions collected via `python -m pip list --format=freeze` +* `conda_packages`: Key library versions collected via `conda list` + +**8. FastDeploy-Specific Information** + +* `fastdeploy_version`: FastDeploy version (development builds include Git commit hash) +* `fastdeploy_build_flags`: Build flags (e.g., targeted CUDA architectures from `FD_BUILDING_ARCS`) +* `gpu_topo`: GPU topology (retrieved via `nvidia-smi topo -m`) + +**9. Environment Variables** + +* `env_vars`: Environment variables starting with `TORCH`, `CUDA`, `NCCL`, or FastDeploy-specific prefixes + + * Sensitive variables containing `secret`, `token`, etc., are filtered out. + +--- + +## Example Output +``` +============================== + System Info +============================== +OS : Ubuntu 20.04.6 LTS (x86_64) +GCC version : (GCC) 12.2.0 +Clang version : 3.8.0 (tags/RELEASE_380/final) +CMake version : version 3.18.0 +Libc version : glibc-2.31 + +============================== + PyTorch Info +============================== +PyTorch version : 2.5.1+cu118 +Is debug build : False +CUDA used to build PyTorch : 11.8 + +============================== + Paddle Info +============================== +Paddle version : 3.1.0 +CUDA used to build paddle : 12.6 + +============================== + Python Environment +============================== +Python version : 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] (64-bit runtime) +Python platform : Linux-5.10.0-1.0.0.28-x86_64-with-glibc2.31 + +============================== + CUDA / GPU Info +============================== +Is CUDA available : True +CUDA runtime version : 12.3.103 +CUDA_MODULE_LOADING set to : LAZY +GPU models and configuration : +GPU 0: NVIDIA A100-SXM4-40GB +GPU 1: NVIDIA A100-SXM4-40GB +GPU 2: NVIDIA A100-SXM4-40GB +GPU 3: NVIDIA A100-SXM4-40GB +GPU 4: NVIDIA A100-SXM4-40GB +GPU 5: NVIDIA A100-SXM4-40GB +GPU 6: NVIDIA A100-SXM4-40GB +GPU 7: NVIDIA A100-SXM4-40GB + +Nvidia driver version : 525.125.06 +cuDNN version : Could not collect +Is XNNPACK available : True + +============================== + CPU Info +============================== +Architecture: x86_64 +CPU op-mode(s): 32-bit, 64-bit +Byte Order: Little Endian +Address sizes: 46 bits physical, 48 bits virtual +CPU(s): 160 +On-line CPU(s) list: 0-159 +Thread(s) per core: 2 +Core(s) per socket: 20 +Socket(s): 4 +NUMA node(s): 4 +Vendor ID: GenuineIntel +CPU family: 6 +Model: 85 +Model name: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz +Stepping: 7 +CPU MHz: 3199.750 +CPU max MHz: 3900.0000 +CPU min MHz: 1000.0000 +BogoMIPS: 5000.00 +Virtualization: VT-x +L1d cache: 2.5 MiB +L1i cache: 2.5 MiB +L2 cache: 80 MiB +L3 cache: 110 MiB +NUMA node0 CPU(s): 0-19,80-99 +NUMA node1 CPU(s): 20-39,100-119 +NUMA node2 CPU(s): 40-59,120-139 +NUMA node3 CPU(s): 60-79,140-159 +Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled +Vulnerability L1tf: Not affected +Vulnerability Mds: Not affected +Vulnerability Meltdown: Not affected +Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp +Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization +Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling +Vulnerability Srbds: Not affected +Vulnerability Tsx async abort: Mitigation; TSX disabled +Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku avx512_vnni md_clear flush_l1d arch_capabilities + +============================== +Versions of relevant libraries +============================== +[pip3] aiozmq==1.0.0 +[pip3] flake8==7.2.0 +[pip3] numpy==1.26.4 +[pip3] nvidia-cublas-cu11==11.11.3.6 +[pip3] nvidia-cublas-cu12==12.6.4.1 +[pip3] nvidia-cuda-cccl-cu12==12.6.77 +[pip3] nvidia-cuda-cupti-cu11==11.8.87 +[pip3] nvidia-cuda-cupti-cu12==12.6.80 +[pip3] nvidia-cuda-nvrtc-cu11==11.8.89 +[pip3] nvidia-cuda-nvrtc-cu12==12.6.77 +[pip3] nvidia-cuda-runtime-cu11==11.8.89 +[pip3] nvidia-cuda-runtime-cu12==12.6.77 +[pip3] nvidia-cudnn-cu11==9.1.0.70 +[pip3] nvidia-cudnn-cu12==9.5.1.17 +[pip3] nvidia-cufft-cu11==10.9.0.58 +[pip3] nvidia-cufft-cu12==11.3.0.4 +[pip3] nvidia-cufile-cu12==1.11.1.6 +[pip3] nvidia-curand-cu11==10.3.0.86 +[pip3] nvidia-curand-cu12==10.3.7.77 +[pip3] nvidia-cusolver-cu11==11.4.1.48 +[pip3] nvidia-cusolver-cu12==11.7.1.2 +[pip3] nvidia-cusparse-cu11==11.7.5.86 +[pip3] nvidia-cusparse-cu12==12.5.4.2 +[pip3] nvidia-cusparselt-cu12==0.6.3 +[pip3] nvidia-ml-py==12.575.51 +[pip3] nvidia-nccl-cu11==2.21.5 +[pip3] nvidia-nccl-cu12==2.25.1 +[pip3] nvidia-nvjitlink-cu12==12.6.85 +[pip3] nvidia-nvtx-cu11==11.8.86 +[pip3] nvidia-nvtx-cu12==12.6.77 +[pip3] onnx==1.18.0 +[pip3] onnxoptimizer==0.3.13 +[pip3] paddle2onnx==2.0.1 +[pip3] pynvml==12.0.0 +[pip3] pyzmq==26.4.0 +[pip3] torch==2.5.1+cu118 +[pip3] torchaudio==2.5.1+cu118 +[pip3] torchvision==0.20.1+cu118 +[pip3] transformers==4.55.4 +[pip3] triton==3.3.0 +[pip3] use_triton_in_paddle==0.1.0 +[pip3] zmq==0.0.0 +[conda] aiozmq 1.0.0 pypi_0 pypi +[conda] numpy 1.26.4 pypi_0 pypi +[conda] nvidia-cublas-cu11 11.11.3.6 pypi_0 pypi +[conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi +[conda] nvidia-cuda-cccl-cu12 12.6.77 pypi_0 pypi +[conda] nvidia-cuda-cupti-cu11 11.8.87 pypi_0 pypi +[conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi +[conda] nvidia-cuda-nvrtc-cu11 11.8.89 pypi_0 pypi +[conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi +[conda] nvidia-cuda-runtime-cu11 11.8.89 pypi_0 pypi +[conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi +[conda] nvidia-cudnn-cu11 9.1.0.70 pypi_0 pypi +[conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi +[conda] nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi +[conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi +[conda] nvidia-cufile-cu12 1.11.1.6 pypi_0 pypi +[conda] nvidia-curand-cu11 10.3.0.86 pypi_0 pypi +[conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi +[conda] nvidia-cusolver-cu11 11.4.1.48 pypi_0 pypi +[conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi +[conda] nvidia-cusparse-cu11 11.7.5.86 pypi_0 pypi +[conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi +[conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi +[conda] nvidia-ml-py 12.575.51 pypi_0 pypi +[conda] nvidia-nccl-cu11 2.21.5 pypi_0 pypi +[conda] nvidia-nccl-cu12 2.25.1 pypi_0 pypi +[conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi +[conda] nvidia-nvtx-cu11 11.8.86 pypi_0 pypi +[conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi +[conda] pynvml 12.0.0 pypi_0 pypi +[conda] pyzmq 26.4.0 pypi_0 pypi +[conda] torch 2.5.1+cu118 pypi_0 pypi +[conda] torchaudio 2.5.1+cu118 pypi_0 pypi +[conda] torchvision 0.20.1+cu118 pypi_0 pypi +[conda] transformers 4.55.4 pypi_0 pypi +[conda] triton 3.3.0 pypi_0 pypi +[conda] use-triton-in-paddle 0.1.0 pypi_0 pypi +[conda] zmq 0.0.0 pypi_0 pypi + +============================== + FastDeploy Info +============================== +FastDeply Version : 2.0.0a0 +FastDeply Build Flags: + CUDA Archs: []; +GPU Topology: + GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 CPU Affinity NUMA Affinity +GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB SYS SYS 0-19,80-99 0 +GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB SYS SYS 0-19,80-99 0 +GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS NODE PXB 20-39,100-119 1 +GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS NODE PXB 20-39,100-119 1 +GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS 40-59,120-139 2 +GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS 40-59,120-139 2 +GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS 60-79,140-159 3 +GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS 60-79,140-159 3 +NIC0 PXB PXB SYS SYS SYS SYS SYS SYS X SYS SYS +NIC1 SYS SYS NODE NODE SYS SYS SYS SYS SYS X NODE +NIC2 SYS SYS PXB PXB SYS SYS SYS SYS SYS NODE X + +Legend: + + X = Self + SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) + NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node + PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) + PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) + PIX = Connection traversing at most a single PCIe bridge + NV# = Connection traversing a bonded set of # NVLinks + +NIC Legend: + + NIC0: mlx5_0 + NIC1: mlx5_1 + NIC2: mlx5_2 + +============================== + Environment Variables +============================== +NVIDIA_VISIBLE_DEVICES=GPU-0fe14fa3-b286-3d79-b223-1912257b4d64,GPU-282b567f-d2c4-f472-5c0d-975a7d96e1a7,GPU-a9d7e24d-1bb2-eb83-63fb-40584754f4be,GPU-924f3dc2-1b05-c35d-12f5-53d9458a1bd2,GPU-57591c1d-c444-18b8-c29d-f44cbaae8142,GPU-a28a9121-042a-81cf-d759-83ce1e3b962a,GPU-c124b75e-2768-6b7d-41fa-46dbf0159c87,GPU-b196a47d-c21e-1ec3-8003-5d776173ec7c +NCCL_P2P_DISABLE=0 +NVIDIA_REQUIRE_CUDA=cuda>=12.3 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536 +NCCL_IB_CUDA_SUPPORT=0 +NVIDIA_LIB=/usr/local/nvidia/lib64 +NCCL_VERSION=2.19.3-1 +NCCL_SOCKET_IFNAME=xgbe1 +NVIDIA_GDRCOPY=enabled +NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH +NVIDIA_DRIVER_CAPABILITIES=compute,utility +NCCL_DEBUG=INFO +NCCL_LIBRARY_PATH=/usr/local/nccl +NVIDIA_VISIBLE_GPUS_UUID=GPU-0fe14fa3-b286-3d79-b223-1912257b4d64,GPU-282b567f-d2c4-f472-5c0d-975a7d96e1a7,GPU-a9d7e24d-1bb2-eb83-63fb-40584754f4be,GPU-924f3dc2-1b05-c35d-12f5-53d9458a1bd2,GPU-57591c1d-c444-18b8-c29d-f44cbaae8142,GPU-a28a9121-042a-81cf-d759-83ce1e3b962a,GPU-c124b75e-2768-6b7d-41fa-46dbf0159c87,GPU-b196a47d-c21e-1ec3-8003-5d776173ec7c +NVIDIA_PRODUCT_NAME=CUDA +NCCL_IB_GID_INDEX=3 +CUDA_VERSION=12.3.1 +NVIDIA_TOOLS=/home/opt/cuda_tools +NCCL_DEBUG_FILE=/root/paddlejob/workspace/log/nccl.%h.%p.log +NCCL_IB_QPS_PER_CONNECTION=2 +NCCL_IB_CONNECT_RETRY_CNT=15 +NCCL_ERROR_FILE=/root/paddlejob/workspace/log/err.%h.%p.log +NCCL_IB_TIMEOUT=22 +CUDNN_VERSION=9.0.0 +NCCL_IB_DISABLE=0 +NVIDIA_VISIBLE_GPUS_SLOT=6,7,0,1,2,3,4,5 +NCCL_IB_ADAPTIVE_ROUTING=1 +OMP_NUM_THREADS=1 +CUDA_MODULE_LOADING=LAZY +``` diff --git a/docs/cli/complete.md b/docs/cli/complete.md new file mode 100644 index 000000000..9a25e7b27 --- /dev/null +++ b/docs/cli/complete.md @@ -0,0 +1,26 @@ +# complete: Text Completion Generation + +`fastdeploy complete` generates text completions based on a given prompt. + +## Parameters + +| Parameter | Description | Default | +| --------------- | ------------------------------------------------------------------ | ---------------------------------------------------- | +| --url | URL of the running OpenAI-compatible RESTful API server | [http://localhost:8000/v1](http://localhost:8000/v1) | +| --model-name | Name of the model used for prompt completion | None | +| --api-key | API key for OpenAI services | None | +| --system-prompt | Specifies the system prompt used in the chat template | None | +| -q, --quick | Sends a single prompt as a MESSAGE, prints the response, and exits | None | + +## Examples + +``` +# Connect directly to a local API +fastdeploy complete + +# Specify an API URL +fastdeploy complete --url http://{fastdeploy-serve-host}:{fastdeploy-serve-port}/v1 + +# Generate a quick completion +fastdeploy complete --quick "The future of AI is" +``` diff --git a/docs/cli/run-batch.md b/docs/cli/run-batch.md new file mode 100644 index 000000000..2e3cbba19 --- /dev/null +++ b/docs/cli/run-batch.md @@ -0,0 +1,42 @@ +# run-batch: Batch Inference + +## Description + +Run inference tasks in batch mode. Supports reading input requests from local files or remote URLs, and outputs results to a file or uploads them to a remote destination. + +## Usage + +``` +fastdeploy run-batch --model MODEL --input-file INPUT --output-file OUTPUT [parameters] +``` + +## Parameters + +| Parameter | Description | Default | +| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | +| -i, --input-file | Path or URL of the input file. Supports local file paths or HTTP/HTTPS URLs. If a local file path is provided, each line should contain one request. If a URL is provided, the file must be accessible via HTTP GET. | None | +| -o, --output-file | Path or URL of the output file. Supports local file paths or HTTP/HTTPS URLs. If a local file path is provided, each line will contain one response. If a URL is provided, the file must support HTTP PUT uploads. | None | +| --output-tmp-dir | Temporary directory used to store the output file before uploading it to the output URL. | None | +| --model | Path to the model | None | + +For more details on additional parameters, see the [FastDeploy Parameter Documentation](../parameters.md) + +## Example + +``` +fastdeploy run-batch -i Input.json -o Output.json --model baidu/ERNIE-4.5-0.3B-Paddle +``` + +## Example Input File (Input.json) + +``` +{"custom_id": "req-00001", "method": "POST", "url": "/v1/chat/completions", "body": {"messages": [{"role": "user", "content": "Tell me a fun fact. (id=1)"}], "temperature": 0.7, "max_tokens": 50}} +{"custom_id": "req-00002", "method": "POST", "url": "/v1/chat/completions", "body": {"messages": [{"role": "user", "content": "What's the weather like today? (id=2)"}], "temperature": 0.7, "max_tokens": 50}} +``` + +## Example Output File (Output.json) + +``` +{"id":"fastdeploy-84601f40de3e48aeb3fe4d2ca328c32e","custom_id":"req-00001","response":{"status_code":200,"request_id":"fastdeploy-batch-0c18b71f5349453eaf00ae04659a21a0","body":{"id":"chatcmpl-024e9267-3d44-4594-91da-b5033c856da9","object":"chat.completion","created":1761203881,"model":"/root/PaddlePaddle/ERNIE-4.5-0.3B-Paddle","choices":[{"index":0,"message":{"role":"assistant","content":"Here's a fun fact about a funny animal:\n\n**The Elephant in the Room**\n\nElephants are known for their ability to inflate themselves with air pressure. Imagine a giant elephant standing upright, its trunk filling","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":18,"total_tokens":68,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0}}}},"error":null} +{"id":"fastdeploy-04cdfbd5b51e43408971b16be4439888","custom_id":"req-00002","response":{"status_code":200,"request_id":"fastdeploy-batch-dd7a9bebd2964acba6713c5dcb4b4aa6","body":{"id":"chatcmpl-452e9d0b-6c04-4b6f-9a2c-7d961f2dc605","object":"chat.completion","created":1761203881,"model":"/root/PaddlePaddle/ERNIE-4.5-0.3B-Paddle","choices":[{"index":0,"message":{"role":"assistant","content":"Based on your query 'What's the weather like today? (id=2)', I need to know what kind of information you are looking for. Do you want to know about a specific weather condition (e.g., sunny, rainy, cloudy) or the current detailed forecast?","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":20,"total_tokens":70,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0}}}},"error":null} +``` diff --git a/docs/cli/serve.md b/docs/cli/serve.md new file mode 100644 index 000000000..754db1553 --- /dev/null +++ b/docs/cli/serve.md @@ -0,0 +1,23 @@ +# serve: API Service Deployment + +`fastdeploy serve` provides service deployment compatible with the OpenAI API protocol. + +## Parameters + +The following table lists the available options: + +| Option | Description | Default | +| ---------- | -------------------------------------------------------- | ------- | +| `--config` | Read CLI options from a configuration file (YAML format) | None | + +For more parameter details, see: [FastDeploy Parameter Documentation](../parameters.md) + +## Examples + +```bash +# Start the FastDeploy API server +fastdeploy serve --model baidu/ERNIE-4.5-0.3B-Paddle + +# Start the server with a specified port +fastdeploy serve --model baidu/ERNIE-4.5-0.3B-Paddle --port 8000 +``` diff --git a/docs/cli/tokenizer.md b/docs/cli/tokenizer.md new file mode 100644 index 000000000..6572722f2 --- /dev/null +++ b/docs/cli/tokenizer.md @@ -0,0 +1,60 @@ +# tokenizer + +## Description + +The **Tokenizer** subcommand provides encoding and decoding functionality between text and token sequences. It also allows viewing or exporting model vocabulary information. Both text and multimodal models are supported. + +## Usage + +``` +fastdeploy tokenizer --model MODEL (--encode TEXT | --decode TOKENS | --vocab-size | --info) +``` + +## Parameters + +| Parameter | Description | Default | +| ----------------------------- | ------------------------------------------------------------------------------ | ------- | +| --model, -m | Model path or name | None | +| --encode, -e | Encode text into a list of tokens | None | +| --decode, -d | Decode a list of tokens back into text | None | +| --vocab-size, -vs | Display the vocabulary size | None | +| --info, -i | Display detailed tokenizer information (special tokens, IDs, max length, etc.) | None | +| --vocab-export FILE, -ve FILE | Export the vocabulary to a file | None | + +## Examples + +``` +# 1. Encode text into tokens +# Convert input text into a token sequence recognizable by the model +fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --encode "Hello, world!" + +# 2. Decode tokens into text +# Convert a token sequence back into readable text +fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --decode "[1, 2, 3]" + +# 3. View vocabulary size +# Output the total number of tokens in the model’s vocabulary +fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --vocab-size + +# 4. View tokenizer details +# Includes special symbols, ID mappings, max token length, etc. +fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --info + +# 5. Export vocabulary to a file +# Save the tokenizer’s vocabulary to a local file +fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --vocab-export ./vocab.txt + +# 6. Support for multimodal models +# Decode tokens for a multimodal model +fastdeploy tokenizer --model baidu/EB-VL-Lite-d --decode "[5300, 96382]" + +# 7. Combine multiple functions +# Encode, decode, view vocabulary, and export vocabulary in a single command +fastdeploy tokenizer \ + -m baidu/ERNIE-4.5-0.3B-PT \ + -e "你好哇" \ + -d "[5300, 96382]" \ + -i \ + -vs \ + -ve vocab.json +``` diff --git a/docs/zh/cli/README.md b/docs/zh/cli/README.md new file mode 100644 index 000000000..12454ea9e --- /dev/null +++ b/docs/zh/cli/README.md @@ -0,0 +1,32 @@ +# FastDeploy CLI 使用说明文档 + +## 简介 +**FastDeploy CLI** 是 FastDeploy 推理框架提供的命令行工具,用于**运行、部署和测试 AI 模型的推理任务**。它帮助开发者在命令行中快速完成模型加载、接口调用、服务启动、性能评测以及环境信息收集等常见工作。 + +通过 FastDeploy CLI,您可以: + +* 🚀 **运行与验证模型推理**:直接在命令行中进行对话生成或文本补全(`chat`、`complete`) +* 🧩 **服务化部署模型**:一键启动与 OpenAI 协议兼容的 API 服务(`serve`) +* 📊 **执行性能与效果评测**:进行延迟、吞吐、任务评估等基准测试(`bench`) +* ⚙️ **收集运行环境信息**:输出系统、框架、GPU 及 FastDeploy 版本配置(`collect-env`) +* 📁 **批量运行推理任务**:支持文件或 URL 输入输出的批处理模式(`run-batch`) +* 🔡 **管理模型的 Tokenizer**:执行文本与 token 的编码、解码及词表导出(`tokenizer`) + +### 查看帮助信息 +``` +fastdeploy --help +``` +### 可用命令 +``` +fastdeploy {chat, complete, serve, bench, collect-env, run-batch, tokenizer} +``` + +| 命令名称 | 主要功能说明 | 详细说明链接 | +| ------------- | -------------------------- | ----------------------------------- | +| `chat` | 在命令行中进行对话生成任务,用于验证聊天模型推理效果 | [查看 chat 命令说明](chat.md) | +| `complete` | 进行文本补全任务,支持多种语言模型输出测试 | [查看 complete 命令说明](complete.md) | +| `serve` | 启动与 OpenAI 协议兼容的本地推理服务 | [查看 serve 命令说明](serve.md) | +| `bench` | 对模型进行性能(延迟、吞吐)或精度评测 | [查看 bench 命令说明](bench.md) | +| `collect-env` | 收集并打印系统、GPU、依赖等运行环境信息 | [查看 collect-env 命令说明](collect-env.md) | +| `run-batch` | 批量执行推理任务,支持文件/URL输入输出 | [查看 run-batch 命令说明](run-batch.md) | +| `tokenizer` | 执行文本与 token 的编码、解码及词表导出 | [查看 tokenizer 命令说明](tokenizer.md) | diff --git a/docs/zh/cli/bench.md b/docs/zh/cli/bench.md new file mode 100644 index 000000000..26cfdeaf8 --- /dev/null +++ b/docs/zh/cli/bench.md @@ -0,0 +1,106 @@ +# bench:基准测试 +## 1 bench latency:离线延迟测试 +### 参数 +|参数|说明|默认值| +|-|-|-| +|--input-len|输入序列长度(token)|32| +|--output-len|输出序列长度(token)|128| +|--batch-size|批量大小|8| +|--n|每个提示生成序列数|1| +|--use-beam-search|是否使用束搜索|False| +|--num-iters-warmup|预热迭代次数|10| +|--num-iters|实际测试迭代次数|30| +|--profile|是否进行性能分析|False| +|--output-json|保存延迟结果 JSON 文件路径|None| +|--disable-detokenize|是否禁用 detokenization|False| + +### 示例 +``` +#对推理引擎进行延迟测试 +fastdeploy bench latency --model baidu/ERNIE-4.5-0.3B-Paddle +``` + +## 2 bench serve:在线延迟与吞吐量测试 +### 参数 +|参数|说明|默认值| +|-|-|-| +|--backend|后端类型|"openai-chat"| +|--base-url|服务器或 API 基础 URL|None| +|--host|主机地址|"127.0.0.1"| +|--port|端口|8000| +|--endpoint|API 路径|"/v1/chat/completions"| +|--model|模型名称|必需| +|--dataset-name|数据集名称|"sharegpt"| +|--dataset-path|数据集路径|None| +|--num-prompts|处理提示数|1000| +|--request-rate|每秒请求数|inf| +|--max-concurrency|最大并发数|None| +|--top-p|采样 top-p (OpenAI 后端)|None| +|--top-k|采样 top-k (OpenAI 后端)|None| +|--temperature|采样温度 (OpenAI 后端)|None| + +### 示例 +``` +#对在线服务进行性能测试 +fastdeploy bench serve --backend openai-chat \ + --model baidu/ERNIE-4.5-0.3B-Paddle \ + --endpoint /v1/chat/completions \ + --host 0.0.0.0 \ + --port 8891 \ + --dataset-name EBChat \ + --dataset-path /datasets/filtered_sharedgpt_2000_input_1136_output_200.json \ + --percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \ + --metric-percentiles 80,95,99,99.9,99.95,99.99 \ + --num-prompts 1 \ + --max-concurrency 1 \ + --save-result +``` + +## 3 bench throughput:吞吐量测试 +### 参数 +|参数|说明|默认值| +|-|-|-| +|--backend|推理后端|"fastdeploy"| +|--dataset-name|数据集|"random"| +|--model|模型名称|必需| +|--input-len|输入序列长度|None| +|--output-len|输出序列长度|None| +|--prefix-len|前缀长度|0| +|--n|每个 prompt 生成序列数|1| +|--num-prompts|prompt 数量|50| +|--output-json|保存 JSON 文件路径|None| +|--disable-detokenize|是否禁用 detokenization|False| +|--lora-path|LoRA adapter 路径|None| + +### 示例 +``` +#对推理引擎进行吞吐量测试 +fastdeploy bench throughput --model baidu/ERNIE-4.5-0.3B-Paddle \ +--backend fastdeploy-chat \ +--dataset-name EBChat \ +--dataset-path /datasets/filtered_sharedgpt_2000_input_1136_output_200.json \ +--max-model-len 32768 +``` + +## 4 bench eval:在线任务效果评估 +### 参数 +|参数|说明|默认值| +|-|-|-| +|--model, -m|模型名称|"hf"| +|--tasks, -t|任务列表|None| +|--model_args, -a|模型参数|""| +|--num_fewshot, -f|Few-shot 样本数量|None| +|--samples, -E|样本数量|None| +|--batch_size, -b|批量大小|1| +|--device|设备|None| +|--output_path, -o|输出路径|None| +|--write_out, -w|是否写出结果|False| + +### 示例 +``` +#对服务进行相关任务的效果评估 +fastdeploy bench eval +--model local-completions +--model_args pretrained=./baidu/ERNIE-4.5-0.3B-Paddle,base_url=http://0.0.0.0:8490/v1/completions +--write_out --tasks ceval-valid_accountant +``` diff --git a/docs/zh/cli/chat.md b/docs/zh/cli/chat.md new file mode 100644 index 000000000..5d055a85c --- /dev/null +++ b/docs/zh/cli/chat.md @@ -0,0 +1,23 @@ +# chat:对话式生成 +`fastdeploy chat` 与正在运行的 API 服务器交互,生成对话。 + +## 参数 +|参数|说明|默认值| +|-|-|-| +|--url|正在运行的 OpenAI-兼容 RESTful API 服务器的 URL|http://localhost:8000/v1| +|--model-name|提示完成中使用的模型名称,默认为列表模型 API 中的第一个模型|None| +|--api-key|用于 OpenAI 服务的 API 密钥,提供时会覆盖环境变量|None| +|--system-prompt|用于在 chat template 中指定 system prompt|None| +|-q, --quick|以 MESSAGE 形式发送单个提示并打印响应,然后退出|None| + +## 示例 +``` +# 直接连接本地主机 API +fastdeploy chat + +# 指定 API URL +fastdeploy chat --url http://{fastdeploy-serve-host}:{fastdeploy-serve-port}/v1 + +# 只需一个提示即可快速聊天 +fastdeploy chat --quick "hi" +``` diff --git a/docs/zh/cli/collect-env.md b/docs/zh/cli/collect-env.md new file mode 100644 index 000000000..bd56c2b59 --- /dev/null +++ b/docs/zh/cli/collect-env.md @@ -0,0 +1,312 @@ +# collect-env:环境信息收集 +`collect-env` 用于收集系统、GPU、深度学习框架及 FastDeploy 的相关环境信息。子命令没有额外参数,直接执行即可。 + +## 使用方式 +``` +fastdeploy collect-env +``` +## 收集的信息 +**1. 系统信息** + +* `os`:操作系统 + * Linux:`lsb_release -a` 或 `cat /etc/*-release` + * Windows:`wmic os get Caption` + * macOS:`sw_vers -productVersion` + +* `gcc_version`:GCC 版本,通过 `gcc --version` 获取 +* `clang_version`:Clang 版本,通过 `clang --version` 获取 +* `cmake_version`:CMake 版本,通过 `cmake --version` 获取 +* `libc_version`:GNU C 库版本(仅 Linux),通过 `platform.libc_ver()` 获取 + +**2. PyTorch 信息** + +* `torch_version`:PyTorch 版本 +* `is_debug_build`:是否为 Debug 模式 +* `cuda_compiled_version`:编译 PyTorch 时的 CUDA 版本 +* `hip_compiled_version`:编译 PyTorch 时的 HIP 版本(AMD GPU) + +**3. Paddle 信息** + +* `paddle_version`:Paddle 版本 +* `paddle_compiled_version`:编译 Paddle 时的 CUDA 版本 + +**4. Python 环境** + +* `python_version`:Python 版本 +* `python_platform`:平台详细信息 + +**5. CUDA / GPU 信息** + +* `is_cuda_available`:CUDA 是否可用 +* `cuda_runtime_version`:CUDA 运行时版本 +* `cuda_module_loading`:CUDA 模块加载策略(环境变量 `CUDA_MODULE_LOADING`) +* `nvidia_gpu_models`:GPU 型号 +* `nvidia_driver_version`:NVIDIA 驱动版本 +* `cudnn_version`:cuDNN 版本 +* `caching_allocator_config`:CUDA 缓存分配器配置(环境变量 `PYTORCH_CUDA_ALLOC_CONF`) +* `is_xnnpack_available`:XNNPACK 是否可用 + +**6. CPU 信息** + +* `cpu_info`:CPU 详细信息(通过 `lscpu` 或 Windows 系统命令获取) + +**7. 相关库版本** + +* `pip_packages`:通过 `python -m pip list --format=freeze` 收集关键库版本 +* `conda_packages`:通过 `conda list` 收集关键库版本 + +**8. FastDeploy 特定信息** + +* `fastdeploy_version`:FastDeploy 版本(开发版包含 Git 提交哈希) +* `fastdeploy_build_flags`:构建标志(显示 `fastdeploy` 针对的 CUDA 架构,环境变量 `FD_BUILDING_ARCS`) +* `gpu_topo`:GPU 拓扑结构(通过 `nvidia-smi topo -m` 获取) + +**9. 环境变量** + +* `env_vars`:收集以 `TORCH`、`CUDA`、`NCCL` 等开头,以及 FastDeploy 自定义的环境变量 + * 会过滤包含 `secret`、`token` 等敏感信息 + +## 输出示例 +``` +============================== + System Info +============================== +OS : Ubuntu 20.04.6 LTS (x86_64) +GCC version : (GCC) 12.2.0 +Clang version : 3.8.0 (tags/RELEASE_380/final) +CMake version : version 3.18.0 +Libc version : glibc-2.31 + +============================== + PyTorch Info +============================== +PyTorch version : 2.5.1+cu118 +Is debug build : False +CUDA used to build PyTorch : 11.8 + +============================== + Paddle Info +============================== +Paddle version : 3.1.0 +CUDA used to build paddle : 12.6 + +============================== + Python Environment +============================== +Python version : 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] (64-bit runtime) +Python platform : Linux-5.10.0-1.0.0.28-x86_64-with-glibc2.31 + +============================== + CUDA / GPU Info +============================== +Is CUDA available : True +CUDA runtime version : 12.3.103 +CUDA_MODULE_LOADING set to : LAZY +GPU models and configuration : +GPU 0: NVIDIA A100-SXM4-40GB +GPU 1: NVIDIA A100-SXM4-40GB +GPU 2: NVIDIA A100-SXM4-40GB +GPU 3: NVIDIA A100-SXM4-40GB +GPU 4: NVIDIA A100-SXM4-40GB +GPU 5: NVIDIA A100-SXM4-40GB +GPU 6: NVIDIA A100-SXM4-40GB +GPU 7: NVIDIA A100-SXM4-40GB + +Nvidia driver version : 525.125.06 +cuDNN version : Could not collect +Is XNNPACK available : True + +============================== + CPU Info +============================== +Architecture: x86_64 +CPU op-mode(s): 32-bit, 64-bit +Byte Order: Little Endian +Address sizes: 46 bits physical, 48 bits virtual +CPU(s): 160 +On-line CPU(s) list: 0-159 +Thread(s) per core: 2 +Core(s) per socket: 20 +Socket(s): 4 +NUMA node(s): 4 +Vendor ID: GenuineIntel +CPU family: 6 +Model: 85 +Model name: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz +Stepping: 7 +CPU MHz: 3199.750 +CPU max MHz: 3900.0000 +CPU min MHz: 1000.0000 +BogoMIPS: 5000.00 +Virtualization: VT-x +L1d cache: 2.5 MiB +L1i cache: 2.5 MiB +L2 cache: 80 MiB +L3 cache: 110 MiB +NUMA node0 CPU(s): 0-19,80-99 +NUMA node1 CPU(s): 20-39,100-119 +NUMA node2 CPU(s): 40-59,120-139 +NUMA node3 CPU(s): 60-79,140-159 +Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled +Vulnerability L1tf: Not affected +Vulnerability Mds: Not affected +Vulnerability Meltdown: Not affected +Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp +Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization +Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling +Vulnerability Srbds: Not affected +Vulnerability Tsx async abort: Mitigation; TSX disabled +Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku avx512_vnni md_clear flush_l1d arch_capabilities + +============================== +Versions of relevant libraries +============================== +[pip3] aiozmq==1.0.0 +[pip3] flake8==7.2.0 +[pip3] numpy==1.26.4 +[pip3] nvidia-cublas-cu11==11.11.3.6 +[pip3] nvidia-cublas-cu12==12.6.4.1 +[pip3] nvidia-cuda-cccl-cu12==12.6.77 +[pip3] nvidia-cuda-cupti-cu11==11.8.87 +[pip3] nvidia-cuda-cupti-cu12==12.6.80 +[pip3] nvidia-cuda-nvrtc-cu11==11.8.89 +[pip3] nvidia-cuda-nvrtc-cu12==12.6.77 +[pip3] nvidia-cuda-runtime-cu11==11.8.89 +[pip3] nvidia-cuda-runtime-cu12==12.6.77 +[pip3] nvidia-cudnn-cu11==9.1.0.70 +[pip3] nvidia-cudnn-cu12==9.5.1.17 +[pip3] nvidia-cufft-cu11==10.9.0.58 +[pip3] nvidia-cufft-cu12==11.3.0.4 +[pip3] nvidia-cufile-cu12==1.11.1.6 +[pip3] nvidia-curand-cu11==10.3.0.86 +[pip3] nvidia-curand-cu12==10.3.7.77 +[pip3] nvidia-cusolver-cu11==11.4.1.48 +[pip3] nvidia-cusolver-cu12==11.7.1.2 +[pip3] nvidia-cusparse-cu11==11.7.5.86 +[pip3] nvidia-cusparse-cu12==12.5.4.2 +[pip3] nvidia-cusparselt-cu12==0.6.3 +[pip3] nvidia-ml-py==12.575.51 +[pip3] nvidia-nccl-cu11==2.21.5 +[pip3] nvidia-nccl-cu12==2.25.1 +[pip3] nvidia-nvjitlink-cu12==12.6.85 +[pip3] nvidia-nvtx-cu11==11.8.86 +[pip3] nvidia-nvtx-cu12==12.6.77 +[pip3] onnx==1.18.0 +[pip3] onnxoptimizer==0.3.13 +[pip3] paddle2onnx==2.0.1 +[pip3] pynvml==12.0.0 +[pip3] pyzmq==26.4.0 +[pip3] torch==2.5.1+cu118 +[pip3] torchaudio==2.5.1+cu118 +[pip3] torchvision==0.20.1+cu118 +[pip3] transformers==4.55.4 +[pip3] triton==3.3.0 +[pip3] use_triton_in_paddle==0.1.0 +[pip3] zmq==0.0.0 +[conda] aiozmq 1.0.0 pypi_0 pypi +[conda] numpy 1.26.4 pypi_0 pypi +[conda] nvidia-cublas-cu11 11.11.3.6 pypi_0 pypi +[conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi +[conda] nvidia-cuda-cccl-cu12 12.6.77 pypi_0 pypi +[conda] nvidia-cuda-cupti-cu11 11.8.87 pypi_0 pypi +[conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi +[conda] nvidia-cuda-nvrtc-cu11 11.8.89 pypi_0 pypi +[conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi +[conda] nvidia-cuda-runtime-cu11 11.8.89 pypi_0 pypi +[conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi +[conda] nvidia-cudnn-cu11 9.1.0.70 pypi_0 pypi +[conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi +[conda] nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi +[conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi +[conda] nvidia-cufile-cu12 1.11.1.6 pypi_0 pypi +[conda] nvidia-curand-cu11 10.3.0.86 pypi_0 pypi +[conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi +[conda] nvidia-cusolver-cu11 11.4.1.48 pypi_0 pypi +[conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi +[conda] nvidia-cusparse-cu11 11.7.5.86 pypi_0 pypi +[conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi +[conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi +[conda] nvidia-ml-py 12.575.51 pypi_0 pypi +[conda] nvidia-nccl-cu11 2.21.5 pypi_0 pypi +[conda] nvidia-nccl-cu12 2.25.1 pypi_0 pypi +[conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi +[conda] nvidia-nvtx-cu11 11.8.86 pypi_0 pypi +[conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi +[conda] pynvml 12.0.0 pypi_0 pypi +[conda] pyzmq 26.4.0 pypi_0 pypi +[conda] torch 2.5.1+cu118 pypi_0 pypi +[conda] torchaudio 2.5.1+cu118 pypi_0 pypi +[conda] torchvision 0.20.1+cu118 pypi_0 pypi +[conda] transformers 4.55.4 pypi_0 pypi +[conda] triton 3.3.0 pypi_0 pypi +[conda] use-triton-in-paddle 0.1.0 pypi_0 pypi +[conda] zmq 0.0.0 pypi_0 pypi + +============================== + FastDeploy Info +============================== +FastDeply Version : 2.0.0a0 +FastDeply Build Flags: + CUDA Archs: []; +GPU Topology: + GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 CPU Affinity NUMA Affinity +GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB SYS SYS 0-19,80-99 0 +GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB SYS SYS 0-19,80-99 0 +GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS NODE PXB 20-39,100-119 1 +GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS NODE PXB 20-39,100-119 1 +GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS 40-59,120-139 2 +GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS 40-59,120-139 2 +GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS 60-79,140-159 3 +GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS 60-79,140-159 3 +NIC0 PXB PXB SYS SYS SYS SYS SYS SYS X SYS SYS +NIC1 SYS SYS NODE NODE SYS SYS SYS SYS SYS X NODE +NIC2 SYS SYS PXB PXB SYS SYS SYS SYS SYS NODE X + +Legend: + + X = Self + SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) + NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node + PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) + PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) + PIX = Connection traversing at most a single PCIe bridge + NV# = Connection traversing a bonded set of # NVLinks + +NIC Legend: + + NIC0: mlx5_0 + NIC1: mlx5_1 + NIC2: mlx5_2 + +============================== + Environment Variables +============================== +NVIDIA_VISIBLE_DEVICES=GPU-0fe14fa3-b286-3d79-b223-1912257b4d64,GPU-282b567f-d2c4-f472-5c0d-975a7d96e1a7,GPU-a9d7e24d-1bb2-eb83-63fb-40584754f4be,GPU-924f3dc2-1b05-c35d-12f5-53d9458a1bd2,GPU-57591c1d-c444-18b8-c29d-f44cbaae8142,GPU-a28a9121-042a-81cf-d759-83ce1e3b962a,GPU-c124b75e-2768-6b7d-41fa-46dbf0159c87,GPU-b196a47d-c21e-1ec3-8003-5d776173ec7c +NCCL_P2P_DISABLE=0 +NVIDIA_REQUIRE_CUDA=cuda>=12.3 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536 +NCCL_IB_CUDA_SUPPORT=0 +NVIDIA_LIB=/usr/local/nvidia/lib64 +NCCL_VERSION=2.19.3-1 +NCCL_SOCKET_IFNAME=xgbe1 +NVIDIA_GDRCOPY=enabled +NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH +NVIDIA_DRIVER_CAPABILITIES=compute,utility +NCCL_DEBUG=INFO +NCCL_LIBRARY_PATH=/usr/local/nccl +NVIDIA_VISIBLE_GPUS_UUID=GPU-0fe14fa3-b286-3d79-b223-1912257b4d64,GPU-282b567f-d2c4-f472-5c0d-975a7d96e1a7,GPU-a9d7e24d-1bb2-eb83-63fb-40584754f4be,GPU-924f3dc2-1b05-c35d-12f5-53d9458a1bd2,GPU-57591c1d-c444-18b8-c29d-f44cbaae8142,GPU-a28a9121-042a-81cf-d759-83ce1e3b962a,GPU-c124b75e-2768-6b7d-41fa-46dbf0159c87,GPU-b196a47d-c21e-1ec3-8003-5d776173ec7c +NVIDIA_PRODUCT_NAME=CUDA +NCCL_IB_GID_INDEX=3 +CUDA_VERSION=12.3.1 +NVIDIA_TOOLS=/home/opt/cuda_tools +NCCL_DEBUG_FILE=/root/paddlejob/workspace/log/nccl.%h.%p.log +NCCL_IB_QPS_PER_CONNECTION=2 +NCCL_IB_CONNECT_RETRY_CNT=15 +NCCL_ERROR_FILE=/root/paddlejob/workspace/log/err.%h.%p.log +NCCL_IB_TIMEOUT=22 +CUDNN_VERSION=9.0.0 +NCCL_IB_DISABLE=0 +NVIDIA_VISIBLE_GPUS_SLOT=6,7,0,1,2,3,4,5 +NCCL_IB_ADAPTIVE_ROUTING=1 +OMP_NUM_THREADS=1 +CUDA_MODULE_LOADING=LAZY +``` diff --git a/docs/zh/cli/complete.md b/docs/zh/cli/complete.md new file mode 100644 index 000000000..cd4cf89f2 --- /dev/null +++ b/docs/zh/cli/complete.md @@ -0,0 +1,23 @@ +# complete:补全式生成 +`fastdeploy complete` 根据给定提示词生成文本完成。 + +## 参数 +|参数|说明|默认值| +|-|-|-| +|--url|正在运行的 OpenAI-兼容 RESTful API 服务器的 URL|http://localhost:8000/v1| +|--model-name|提示完成中使用的模型名称|None| +|--api-key|用于 OpenAI 服务的 API 密钥|None| +|--system-prompt|在 chat template 中指定 system prompt|None| +|-q, --quick|以 MESSAGE 形式发送单个提示并打印响应|None| + +## 示例 +``` +# 直接连接本地主机 API +fastdeploy complete + +# 指定 API URL +fastdeploy complete --url http://{fastdeploy-serve-host}:{fastdeploy-serve-port}/v1 + +# 快速完成 +fastdeploy complete --quick "The future of AI is" +``` diff --git a/docs/zh/cli/run-batch.md b/docs/zh/cli/run-batch.md new file mode 100644 index 000000000..dc944a10b --- /dev/null +++ b/docs/zh/cli/run-batch.md @@ -0,0 +1,36 @@ +# run-batch:批处理 +## 说明 +批量运行推理任务。支持从本地文件或远程 URL 读取输入请求,并将推理结果输出到文件或上传至远程 URL + +## 用法 +``` + fastdeploy run-batch --model MODEL --input-file INPUT --output-file OUTPUT [参数] +``` + +## 参数 +|参数|说明|默认值| +|-|-|-| +|-i, --input-file|单个输入文件的路径或 URL。目前支持本地文件路径或 HTTP/HTTPS 协议。如果是文件路径,文件里每行一个请求;如果指定了 URL,文件应可通过 HTTP GET 访问。|None| +|-o, --output-file|单个输出文件的路径或 URL。目前支持本地文件路径或网络 URL(HTTP/HTTPS)。如果是输出文件路径,每行一个响应;如果指定了 URL,文件应可通过 HTTP PUT 上传。|None| +|--output-tmp-dir|在将输出文件上传到输出 URL 之前,用于存放输出文件的临时目录。|None| +|--model|模型路径|None| + +更多参数说明见:[FastDeploy 参数文档](../parameters.md) + +## 示例 +``` +fastdeploy run-batch -i Input.json -o Output.json --model baidu/ERNIE-4.5-0.3B-Paddle +``` + +## 输入文件格式示例(Input.json) +``` +{"custom_id": "req-00001", "method": "POST", "url": "/v1/chat/completions", "body": {"messages": [{"role": "user", "content": "Tell me a fun fact. (id=1)"}], "temperature": 0.7, "max_tokens": 50}} +{"custom_id": "req-00002", "method": "POST", "url": "/v1/chat/completions", "body": {"messages": [{"role": "user", "content": "What's the weather like today? (id=2)"}], "temperature": 0.7, "max_tokens": 50}} +``` + +## 输出文件格式示例(Output.json) +``` +{"id":"fastdeploy-84601f40de3e48aeb3fe4d2ca328c32e","custom_id":"req-00001","response":{"status_code":200,"request_id":"fastdeploy-batch-0c18b71f5349453eaf00ae04659a21a0","body":{"id":"chatcmpl-024e9267-3d44-4594-91da-b5033c856da9","object":"chat.completion","created":1761203881,"model":"/root/PaddlePaddle/ERNIE-4.5-0.3B-Paddle","choices":[{"index":0,"message":{"role":"assistant","content":"Here's a fun fact about a funny animal:\n\n**The Elephant in the Room**\n\nElephants are known for their ability to inflate themselves with air pressure. Imagine a giant elephant standing upright, its trunk filling","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":18,"total_tokens":68,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0}}}},"error":null} +{"id":"fastdeploy-04cdfbd5b51e43408971b16be4439888","custom_id":"req-00002","response":{"status_code":200,"request_id":"fastdeploy-batch-dd7a9bebd2964acba6713c5dcb4b4aa6","body":{"id":"chatcmpl-452e9d0b-6c04-4b6f-9a2c-7d961f2dc605","object":"chat.completion","created":1761203881,"model":"/root/PaddlePaddle/ERNIE-4.5-0.3B-Paddle","choices":[{"index":0,"message":{"role":"assistant","content":"根据您提供的查询语句“What's the weather like today? (id=2)”,我需要先了解您想要查询的内容。请问您想查询的是某个特定天气状况(如晴天、下雨、阴天等)还是当前具体","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":20,"total_tokens":70,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0}}}},"error":null} + +``` diff --git a/docs/zh/cli/serve.md b/docs/zh/cli/serve.md new file mode 100644 index 000000000..e9910fd62 --- /dev/null +++ b/docs/zh/cli/serve.md @@ -0,0 +1,20 @@ +# serve:API 服务化 +`fastdeploy serve` 提供与 OpenAI 协议兼容的服务化部署。 + +## 参数 +以下是根据您的说明生成的表格: + +|选项|说明|默认| +|-|-|-| +|--config|从配置文件读取 CLI 选项(YAML 格式)|None| + +更多参数说明见:[FastDeploy 参数文档](../parameters.md) + +## 示例 +``` +# 启动 FastDeploy API 服务器 +fastdeploy serve --model baidu/ERNIE-4.5-0.3B-Paddle + +# 指定端口启动 +fastdeploy serve --model baidu/ERNIE-4.5-0.3B-Paddle --port 8000 +``` diff --git a/docs/zh/cli/tokenizer.md b/docs/zh/cli/tokenizer.md new file mode 100644 index 000000000..139a06122 --- /dev/null +++ b/docs/zh/cli/tokenizer.md @@ -0,0 +1,57 @@ + +# tokenizer +## 说明 + Tokenizer 子命令提供文本与 token 序列之间的编码与解码功能,并可查看或导出模型的词表信息。支持文本模型与多模态模型。 + +## 用法 +``` +fastdeploy tokenizer --model MODEL (--encode TEXT | --decode TOKENS | --vocab-size | --info) +``` + +## 参数 +|参数|说明|默认值| +|-|-|-| +|--model, -m|模型路径或名称|None| +|--encode, -e|将文本编码为 token 列表|None| +|--decode, -d|将 token 列表解码为文本|None| +|--vocab-size, -vs|查看词表大小|None| +|--info, -i|查看 tokenizer 详细信息(特殊符号、ID、最大长度等)|None| +|--vocab-export FILE, -ve FILE|导出词表到文件|None| + +## 示例 +``` +# 1. 编码文本为 tokens +# 将输入文本转换为模型可识别的 token 序列 +fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --encode "Hello, world!" + +# 2. 解码 tokens 为文本 +# 将 token 序列转换回可读文本 +fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --decode "[1, 2, 3]" + +# 3. 查看词表大小 +# 输出模型 tokenizer 的总词表数量 +fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --vocab-size + +# 4. 查看 tokenizer 详细信息 +# 包括特殊符号、ID 映射、最大长度等信息 +fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --info + +# 5. 导出词表到文件 +# 将 tokenizer 的词表保存到本地文件 +fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --vocab-export ./vocab.txt + +# 6. 支持多模模型 +# 对多模态模型进行解码 +fastdeploy tokenizer --model baidu/EB-VL-Lite-d --decode "[5300, 96382]" + +# 7. 多功能组合使用 +# 可以同时进行编码、解码、查看词表、导出词表等操作 +fastdeploy tokenizer \ + -m baidu/ERNIE-4.5-0.3B-PT \ + -e "你好哇" \ + -d "[5300, 96382]" \ + -i \ + -vs \ + -ve vocab.json + +``` diff --git a/mkdocs.yml b/mkdocs.yml index 714a6dc32..4ef31d529 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -64,8 +64,14 @@ plugins: Scheduler: 调度器 Graceful Shutdown: 服务优雅关闭 Offline Inference: 离线推理 - Best Practices: 最佳实践 - ERNIE-4.5-0.3B: ERNIE-4.5-0.3B + CLI: CLI 使用说明 + Chat: Chat命令 + Complete: Complete命令 + Server: Server命令 + Bench: Bench命令 + Collect Env: Collect Env命令 + Run Batch: Run Batch命令 + Tokenizer: Tokenizer命令 ERNIE-4.5-21B-A3B: ERNIE-4.5-21B-A3B ERNIE-4.5-21B-A3B-Thinking: ERNIE-4.5-21B-A3B-Thinking ERNIE-4.5-300B-A47B: ERNIE-4.5-300B-A47B @@ -152,3 +158,12 @@ nav: - Log Description: usage/log.md - Code Overview: usage/code_overview.md - Environment Variables: usage/environment_variables.md + - CLI: + - Overview: cli/README.md + - Chat: cli/chat.md + - Complete: cli/complete.md + - Server: cli/server.md + - Collect Env: cli/collect_env.md + - Bench: cli/bench.md + - Run Batch: cli/run_batch.md + - Tokenizer: cli/tokenizer.md