This is a fine-tuned model that checks if a sentence sounds toxic or not. The project uses a small dataset of online comments. It fine-tunes a pretrained DistilBERT model to do binary classification: toxic or not toxic.
βββ main.py # Trains the model
βββ predict.py # Usage of trained model
βββ download_dataset.py # Downloads dataset
βββ model/ # Saved model after training (ignored by git)
Dataset: Jigsaw Toxic Comment Classification Challenge from Kaggle.
Only train.csv
is needed.
Use download_dataset.py
to download it into the project folder.
- Install libraries
pip install transformers datasets scikit-learn pandas accelerate kagglehub
-
Download dataset
python download_dataset.py
-
Train the model
python main.py
-
Check tone of your sentence
python predict.py
Then type a sentence in the console. The model will say if it is toxic or not.