This is a simple BPE-Tokenizer I've created for the electoral module Foundational Generative Models at my university. I will document everything here in this repo and on the website I'm going to build where you can also test the tokenizer on different versions with different methods.
Status | Feature |
---|---|
✅ | Text Loader supporting .TXT files |
✅ | Training process function |
✅ | Tokenize text function |
✅ | Export function for vocabulary |
🕣 | Website to show off tokenizer |
✅ | Support Case-Sensitivity |
✅ | Support UTF-8 text |
☑️ | Handle unknown characters using the Out-of-Vocabulary method |
✅ | Train Tokenizer on English and German text |
✅ | Add token decoding |
☑️ | Support additional languages like Durch, Spanish, Polish, French, etc. |
☑️ | Support additional file formats like CSV or Excel for Text Loader |
✅ | Convert to library |
🕣 | Include ability to add special tokens |