Skip to content

✨ BPE-Tokenizer for university module Foundational Generative Models.

License

Notifications You must be signed in to change notification settings

jonasliendl/bpe_tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BPE-Tokenizer

This is a simple BPE-Tokenizer I've created for the electoral module Foundational Generative Models at my university. I will document everything here in this repo and on the website I'm going to build where you can also test the tokenizer on different versions with different methods.

🛣️ Roadmap

Status Feature
Text Loader supporting .TXT files
Training process function
Tokenize text function
Export function for vocabulary
🕣 Website to show off tokenizer
Support Case-Sensitivity
Support UTF-8 text
☑️ Handle unknown characters using the Out-of-Vocabulary method
Train Tokenizer on English and German text
Add token decoding
☑️ Support additional languages like Durch, Spanish, Polish, French, etc.
☑️ Support additional file formats like CSV or Excel for Text Loader
Convert to library
🕣 Include ability to add special tokens

About

✨ BPE-Tokenizer for university module Foundational Generative Models.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages