Files
FastDeploy/docs/cli/tokenizer.md
2025-10-28 10:35:11 +08:00

61 lines
2.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# tokenizer
## Description
The **Tokenizer** subcommand provides encoding and decoding functionality between text and token sequences. It also allows viewing or exporting model vocabulary information. Both text and multimodal models are supported.
## Usage
```
fastdeploy tokenizer --model MODEL (--encode TEXT | --decode TOKENS | --vocab-size | --info)
```
## Parameters
| Parameter | Description | Default |
| ----------------------------- | ------------------------------------------------------------------------------ | ------- |
| --model, -m | Model path or name | None |
| --encode, -e | Encode text into a list of tokens | None |
| --decode, -d | Decode a list of tokens back into text | None |
| --vocab-size, -vs | Display the vocabulary size | None |
| --info, -i | Display detailed tokenizer information (special tokens, IDs, max length, etc.) | None |
| --vocab-export FILE, -ve FILE | Export the vocabulary to a file | None |
## Examples
```
# 1. Encode text into tokens
# Convert input text into a token sequence recognizable by the model
fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --encode "Hello, world!"
# 2. Decode tokens into text
# Convert a token sequence back into readable text
fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --decode "[1, 2, 3]"
# 3. View vocabulary size
# Output the total number of tokens in the models vocabulary
fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --vocab-size
# 4. View tokenizer details
# Includes special symbols, ID mappings, max token length, etc.
fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --info
# 5. Export vocabulary to a file
# Save the tokenizers vocabulary to a local file
fastdeploy tokenizer --model baidu/ERNIE-4.5-0.3B-Paddle --vocab-export ./vocab.txt
# 6. Support for multimodal models
# Decode tokens for a multimodal model
fastdeploy tokenizer --model baidu/EB-VL-Lite-d --decode "[5300, 96382]"
# 7. Combine multiple functions
# Encode, decode, view vocabulary, and export vocabulary in a single command
fastdeploy tokenizer \
-m baidu/ERNIE-4.5-0.3B-PT \
-e "你好哇" \
-d "[5300, 96382]" \
-i \
-vs \
-ve vocab.json
```