BPE tokenizer implemented entirely in Zig.
See the complete example integration with LLMs in the nnx-lm.
zig v0.13.0
git clone https://github.com/jaco-bro/tokenizer
cd tokenizer
zig build exe --release=fast
zig-out/bin/tokenizer_exe [--model MODEL_NAME] COMMAND INPUT
zig build run -- [--model MODEL_NAME] COMMAND INPUT
zig build run -- --encode "hello world"
zig build run -- --decode "{14990, 1879}"
zig build run -- --model "phi-4-4bit" --encode "hello world"
zig build run -- --model "phi-4-4bit" --decode "15339 1917"
Tokenizer is also pip-installable for use from Python:
pip install tokenizerz
python
Usage:
>>> import tokenizerz
>>> tokenizer = tokenizerz.Tokenizer()
Directory 'Qwen2.5-Coder-0.5B' created successfully.
DL% UL% Dled Uled Xfers Live Total Current Left Speed
100 -- 6866k 0 1 0 0:00:01 0:00:01 --:--:-- 4904k
Download successful.
>>> tokens = tokenizer.encode("Hello, world!")
>>> print(tokens)
[9707, 11, 1879, 0]
>>> tokenizer.decode(tokens)
'Hello, world!'
Shell:
bpe --encode "hello world"