BPE-Tokenizer

This is a simple BPE-Tokenizer I've created for the electoral module Foundational Generative Models at my university. I will document everything here in this repo and on the website I'm going to build where you can also test the tokenizer on different versions with different methods.

🛣️ Roadmap

Status	Feature
✅	Text Loader supporting .TXT files
✅	Training process function
✅	Tokenize text function
✅	Export function for vocabulary
🕣	Website to show off tokenizer
✅	Support Case-Sensitivity
✅	Support UTF-8 text
☑️	Handle unknown characters using the Out-of-Vocabulary method
✅	Train Tokenizer on English and German text
✅	Add token decoding
☑️	Support additional languages like Durch, Spanish, Polish, French, etc.
☑️	Support additional file formats like CSV or Excel for Text Loader
✅	Convert to library
🕣	Include ability to add special tokens

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BPE-Tokenizer

🛣️ Roadmap

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

jonasliendl/bpe_tokenizer

Folders and files

Latest commit

History

Repository files navigation

BPE-Tokenizer

🛣️ Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages