FastDeploy/docs/cli/bench.md

# bench: Benchmark Testing

## 1. bench latency: Offline Latency Test

### Parameters

| Parameter            | Description                                 | Default |
| -------------------- | ------------------------------------------- | ------- |
| --input-len          | Input sequence length (tokens)              | 32      |
| --output-len         | Output sequence length (tokens)             | 128     |
| --batch-size         | Batch size                                  | 8       |
| --n                  | Number of sequences generated per prompt    | 1       |
| --use-beam-search    | Whether to use beam search                  | False   |
| --num-iters-warmup   | Number of warmup iterations                 | 10      |
| --num-iters          | Number of actual test iterations            | 30      |
| --profile            | Whether to enable performance profiling     | False   |
| --output-json        | Path to save latency results as a JSON file | None    |
| --disable-detokenize | Whether to disable detokenization           | False   |

### Example

```
# Run latency benchmark on the inference engine
fastdeploy bench latency --model baidu/ERNIE-4.5-0.3B-Paddle
```

## 2. bench serve: Online Latency and Throughput Test

### Parameters

| Parameter         | Description                           | Default                |
| ----------------- | ------------------------------------- | ---------------------- |
| --backend         | Backend type                          | "openai-chat"          |
| --base-url        | Base URL of the server or API         | None                   |
| --host            | Host address                          | "127.0.0.1"            |
| --port            | Port                                  | 8000                   |
| --endpoint        | API endpoint path                     | "/v1/chat/completions" |
| --model           | Model name                            | Required               |
| --dataset-name    | Dataset name                          | "sharegpt"             |
| --dataset-path    | Path to dataset                       | None                   |
| --num-prompts     | Number of prompts to process          | 1000                   |
| --request-rate    | Requests per second                   | inf                    |
| --max-concurrency | Maximum concurrency                   | None                   |
| --top-p           | Sampling top-p (OpenAI backend)       | None                   |
| --top-k           | Sampling top-k (OpenAI backend)       | None                   |
| --temperature     | Sampling temperature (OpenAI backend) | None                   |

### Example

```
# Run online performance test
fastdeploy bench serve --backend openai-chat \
  --model baidu/ERNIE-4.5-0.3B-Paddle \
  --endpoint /v1/chat/completions \
  --host 0.0.0.0 \
  --port 8891 \
  --dataset-name EBChat \
  --dataset-path /datasets/filtered_sharedgpt_2000_input_1136_output_200.json \
  --percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
  --metric-percentiles 80,95,99,99.9,99.95,99.99 \
  --num-prompts 1 \
  --max-concurrency 1 \
  --save-result
```

## 3. bench throughput: Throughput Test

### Parameters

| Parameter            | Description                              | Default      |
| -------------------- | ---------------------------------------- | ------------ |
| --backend            | Inference backend                        | "fastdeploy" |
| --dataset-name       | Dataset name                             | "random"     |
| --model              | Model name                               | Required     |
| --input-len          | Input sequence length                    | None         |
| --output-len         | Output sequence length                   | None         |
| --prefix-len         | Prefix length                            | 0            |
| --n                  | Number of sequences generated per prompt | 1            |
| --num-prompts        | Number of prompts                        | 50           |
| --output-json        | Path to save results as a JSON file      | None         |
| --disable-detokenize | Whether to disable detokenization        | False        |
| --lora-path          | Path to LoRA adapter                     | None         |

### Example

```
# Run throughput benchmark on the inference engine
fastdeploy bench throughput --model baidu/ERNIE-4.5-0.3B-Paddle \
--backend fastdeploy-chat \
--dataset-name EBChat \
--dataset-path /datasets/filtered_sharedgpt_2000_input_1136_output_200.json \
--max-model-len 32768
```

## 4. bench eval: Online Task Evaluation

### Parameters

| Parameter         | Description                     | Default |
| ----------------- | ------------------------------- | ------- |
| --model, -m       | Model name                      | "hf"    |
| --tasks, -t       | List of evaluation tasks        | None    |
| --model_args, -a  | Model arguments                 | ""      |
| --num_fewshot, -f | Number of few-shot examples     | None    |
| --samples, -E     | Number of samples               | None    |
| --batch_size, -b  | Batch size                      | 1       |
| --device          | Device                          | None    |
| --output_path, -o | Output file path                | None    |
| --write_out, -w   | Whether to write output results | False   |

### Example

```
# Run task evaluation on an online service
fastdeploy bench eval --model local-completions \
  --model_args pretrained=./baidu/ERNIE-4.5-0.3B-Paddle,base_url=http://0.0.0.0:8490/v1/completions
  --write_out \
  --tasks ceval-valid_accountant
```