This project aims to perform author identification on a dataset consisting of columns from Turkish newspapers written by 18 different authors.
The dataset comprises 630 Turkish columns, each written by one of 18 different authors. The data is sourced from "news-paper".
The project employs the following machine learning algorithms for author identification:
- Bert
- LSTM
- XGBoost
- Decision Tree
- Naive Bayes
- Random Forest
- KNN (K-Nearest Neighbors)
- Gradient Boost
- SGD (Stochastic Gradient Descent)
- SVM (Support Vector Machine)
- Loading and Preprocessing the Dataset
- Implementation and Training of Each Algorithm
- Evaluation of Model Performance
- Selection of the Best Performing Models
The project has identified Bert and SVM algorithms to achieve the highest accuracy scores. Below are the confusion matrices and accuracy values for these models:
Confusion Matrix:
Accuracy: 0.9793650793650793
Confusion Matrix:
Accuracy: 0.8492063492063492
The Turkish stopwords used in this project were obtained from countwordsfree.
Special thanks to my friend Levent Demirkaya for their valuable contributions to this project.