This is a simple BPE-Tokenizer I've created for the electoral module Foundational Generative Models at my university. I will document everything here in this repo and on the website I'm going to build where you can also test the tokenizer on different versions with different methods.
| Status | Feature |
|---|---|
| ✅ | Text Loader supporting .TXT files |
| ✅ | Training process function |
| ✅ | Tokenize text function |
| ✅ | Export function for vocabulary |
| 🕣 | Website to show off tokenizer |
| ✅ | Support Case-Sensitivity |
| ✅ | Support UTF-8 text |
| ☑️ | Handle unknown characters using the Out-of-Vocabulary method |
| ✅ | Train Tokenizer on English and German text |
| ✅ | Add token decoding |
| ☑️ | Support additional languages like Durch, Spanish, Polish, French, etc. |
| ☑️ | Support additional file formats like CSV or Excel for Text Loader |
| ✅ | Convert to library |
| 🕣 | Include ability to add special tokens |