FastDeploy/tools/deep_gemm_pre-compile/README.md

# DeepGEMM Pre-compilation Tool

This tool provides pre-compilation functionality for DeepGEMM kernels to optimize performance.

## Usage

### 1. Using Shell Script (Recommended)
```bash
bash pre_compile.sh \
    [MODEL_PATH] \
    [TP_SIZE] \
    [EP_SIZE] \
    [HAS_SHARED_EXPERTS] \
    [OUTPUT_FILE]
```

The script will:
1. Generate configurations
2. Pre-compile all kernels

### 2. Alternative: Manual Steps
If you need more control, you can run the steps manually:

#### Generate Configuration
```bash
python generate_config.py \
    --model /path/to/model \
    --tensor-parallel-size [TP_SIZE] \
    --expert-parallel-size [EP_SIZE] \
    --has-shared-experts [True/False] \
    --output [CONFIG_FILE]
```

Arguments:
- `--model`: Path to model directory containing config.json
- `--tensor-parallel-size`: Tensor parallel size (default: 1)
- `--expert-parallel-size`: Expert parallel size (default: 8)
- `--has-shared-experts`: Whether model has shared experts (default: False)
- `--output`: Output config file path (default: ./deep_gemm_pre_compile_config.jsonl)

#### Pre-compile Kernels
```bash
python pre_compile.py \
    --config-file [CONFIG_FILE] \
    --expert-parallel-size [EP_SIZE] \
    --num-threads [NUM_THREADS]
```

Arguments:
- `--config-file`: Path to config file generated in step 1
- `--expert-parallel-size`: Expert parallel size (must match step 1)
- `--num-threads`: Number of compilation threads (default: CPU cores)

## Environment Variables
- `PRE_COMPILE_LOG_LEVEL`: Set log level (DEBUG/INFO/WARNING/ERROR)
- `DG_CACHE_DIR`: Cache directory for compiled kernels (default: ./deep_gemm_cache)

## Notes
- For best performance, set `--num-threads` to the number of available CPU cores
- The compilation process may take significant time depending on configuration size
- Compiled kernels will be cached in `DG_CACHE_DIR`