🗞️ Text Processing and Information Retrieval

This repository demonstrates key steps in natural language processing and information retrieval using Turkish text data. It covers text tokenization, TF-IDF vectorization, and sparse matrix generation.

📌 This project was submitted as a midterm replacement assignment by Emirkan Beyaz for the course "BGG - Bilgi Geri Getirimine Giriş (Introduction to Information Retrieval)" taken during the 2023–2024 Spring Semester, taught by Asst. Prof. Tolga Berber.

🎯 Project Objectives

Process Turkish textual data.
Apply BPE-based tokenization.
Calculate term frequencies and inverse document frequencies.
Construct a sparse TF-IDF matrix for information retrieval purposes.

🧩 Project Structure

Data Loading
Text Preprocessing
Text Fragmentation
Numerical Encoding
TF-IDF Calculation
Normalization
Sparse Matrix Conversion
Fullness Rate Calculation

🛠️ Tools & Resources Used

🧠 BPE-32 Tokenizer (Turkish)
Byte-Pair Encoding tokenizer used for efficient subword tokenization, created in class under guidance of Asst. Prof. Tolga Berber.
📰 Turkish News Dataset
A set of Turkish news articles in .txt format provided as course material.

📌 Citation / Attribution

If you use this repository, any part of the code, or the provided materials in your own projects, research, or publication, please cite or give proper credit as follows:

🔹 Code & Project

Emirkan Beyaz, "Text Processing and Information Retrieval Project", GitHub Repository, 2025.
Please cite this repository or mention the author when using the code or project methodology.

🔹 Dataset & Tokenizer Attribution

The Turkish dataset and BPE tokenizer used in this project were provided by:

Asst. Prof. Tolga Berber, "BPE Tokenizer and Turkish News Dataset",
Karadeniz Technical University, Department of Statistics and Computer Science, 2024.
If you use these resources, please cite or acknowledge Tolga Berber accordingly.

📬 Contact

For any questions, suggestions, or issues, feel free to open an issue or contact me via GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
TurkishTokenizer/lower_cha/BPE_32		TurkishTokenizer/lower_cha/BPE_32
data/42bin_haber/news		data/42bin_haber/news
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🗞️ Text Processing and Information Retrieval

🎯 Project Objectives

🧩 Project Structure

🛠️ Tools & Resources Used

📌 Citation / Attribution

🔹 Code & Project

🔹 Dataset & Tokenizer Attribution

📬 Contact

About

Releases

Packages

Languages

License

Hords01/Data_Mining

Folders and files

Latest commit

History

Repository files navigation

🗞️ Text Processing and Information Retrieval

🎯 Project Objectives

🧩 Project Structure

🛠️ Tools & Resources Used

📌 Citation / Attribution

🔹 Code & Project

🔹 Dataset & Tokenizer Attribution

📬 Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages