This repository demonstrates key steps in natural language processing and information retrieval using Turkish text data. It covers text tokenization, TF-IDF vectorization, and sparse matrix generation.
📌 This project was submitted as a midterm replacement assignment by Emirkan Beyaz for the course "BGG - Bilgi Geri Getirimine Giriş (Introduction to Information Retrieval)" taken during the 2023–2024 Spring Semester, taught by Asst. Prof. Tolga Berber.
- Process Turkish textual data.
- Apply BPE-based tokenization.
- Calculate term frequencies and inverse document frequencies.
- Construct a sparse TF-IDF matrix for information retrieval purposes.
- Data Loading
- Text Preprocessing
- Text Fragmentation
- Numerical Encoding
- TF-IDF Calculation
- Normalization
- Sparse Matrix Conversion
- Fullness Rate Calculation
-
🧠 BPE-32 Tokenizer (Turkish)
Byte-Pair Encoding tokenizer used for efficient subword tokenization, created in class under guidance of Asst. Prof. Tolga Berber. -
📰 Turkish News Dataset
A set of Turkish news articles in.txt
format provided as course material.
If you use this repository, any part of the code, or the provided materials in your own projects, research, or publication, please cite or give proper credit as follows:
Emirkan Beyaz, "Text Processing and Information Retrieval Project", GitHub Repository, 2025.
Please cite this repository or mention the author when using the code or project methodology.
The Turkish dataset and BPE tokenizer used in this project were provided by:
Asst. Prof. Tolga Berber, "BPE Tokenizer and Turkish News Dataset",
Karadeniz Technical University, Department of Statistics and Computer Science, 2024.
If you use these resources, please cite or acknowledge Tolga Berber accordingly.
For any questions, suggestions, or issues, feel free to open an issue or contact me via GitHub.