Skip to content

Hords01/Data_Mining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🗞️ Text Processing and Information Retrieval

This repository demonstrates key steps in natural language processing and information retrieval using Turkish text data. It covers text tokenization, TF-IDF vectorization, and sparse matrix generation.

📌 This project was submitted as a midterm replacement assignment by Emirkan Beyaz for the course "BGG - Bilgi Geri Getirimine Giriş (Introduction to Information Retrieval)" taken during the 2023–2024 Spring Semester, taught by Asst. Prof. Tolga Berber.


🎯 Project Objectives

  • Process Turkish textual data.
  • Apply BPE-based tokenization.
  • Calculate term frequencies and inverse document frequencies.
  • Construct a sparse TF-IDF matrix for information retrieval purposes.

🧩 Project Structure

  1. Data Loading
  2. Text Preprocessing
  3. Text Fragmentation
  4. Numerical Encoding
  5. TF-IDF Calculation
  6. Normalization
  7. Sparse Matrix Conversion
  8. Fullness Rate Calculation

🛠️ Tools & Resources Used

  • 🧠 BPE-32 Tokenizer (Turkish)
    Byte-Pair Encoding tokenizer used for efficient subword tokenization, created in class under guidance of Asst. Prof. Tolga Berber.

  • 📰 Turkish News Dataset
    A set of Turkish news articles in .txt format provided as course material.


📌 Citation / Attribution

If you use this repository, any part of the code, or the provided materials in your own projects, research, or publication, please cite or give proper credit as follows:

🔹 Code & Project

Emirkan Beyaz, "Text Processing and Information Retrieval Project", GitHub Repository, 2025.
Please cite this repository or mention the author when using the code or project methodology.

🔹 Dataset & Tokenizer Attribution

The Turkish dataset and BPE tokenizer used in this project were provided by:

Asst. Prof. Tolga Berber, "BPE Tokenizer and Turkish News Dataset",
Karadeniz Technical University, Department of Statistics and Computer Science, 2024.
If you use these resources, please cite or acknowledge Tolga Berber accordingly.


📬 Contact

For any questions, suggestions, or issues, feel free to open an issue or contact me via GitHub.