Skip to content

jaco-bro/tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tokenizer

BPE tokenizer implemented entirely in Zig.

See the complete example integration with LLMs in the nnx-lm.

Requirement

zig v0.13.0

Install

git clone https://github.com/jaco-bro/tokenizer
cd tokenizer
zig build exe --release=fast

Usage

  • zig-out/bin/tokenizer_exe [--model MODEL_NAME] COMMAND INPUT
  • zig build run -- [--model MODEL_NAME] COMMAND INPUT
zig build run -- --encode "hello world"
zig build run -- --decode "{14990, 1879}"
zig build run -- --model "phi-4-4bit" --encode "hello world"
zig build run -- --model "phi-4-4bit" --decode "15339 1917"

Python (optional)

Tokenizer is also pip-installable for use from Python:

pip install tokenizerz
python

Usage:

>>> import tokenizerz
>>> tokenizer = tokenizerz.Tokenizer()
Directory 'Qwen2.5-Coder-0.5B' created successfully.
DL% UL%  Dled  Uled  Xfers  Live Total     Current  Left    Speed
100 --  6866k     0     1     0   0:00:01  0:00:01 --:--:-- 4904k
Download successful.
>>> tokens = tokenizer.encode("Hello, world!")
>>> print(tokens)
[9707, 11, 1879, 0]
>>> tokenizer.decode(tokens)
'Hello, world!'

Shell:

bpe --encode "hello world"