A Hate classifier for low resource Indian languages
This project uses datasets collected from various research papers and AI workshops.
| Column | Description | Format |
|---|---|---|
| UID | Unique identifier to trace the origin of the dataset and act as index for dataset. | <language_code><train/test/val>_<index_number> |
| text | The text content used for classifier | utf-8 encoded text |
| label_yn | A binary label indicating whether text is classified as hate / non-hate in respective datasets. | 1 - hate 0 - non-hate |
Datasets used for each language are mentioned below.
- https://github.com/ShareChatAI/MACD/tree/main/dataset
- https://competitions.codalab.org/competitions/26654
- https://github.com/hate-alert/HateCheckHIn/blob/main/monolingual_functionalities.csv
- https://hasocfire.github.io/hasoc/2019/dataset.html
- https://github.com/Kalit31/HASOC-2021/tree/main/data/marathi
- https://github.com/TharinduDR/DeepOffense/tree/master/examples/marathi/data
- https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaHate/2-class
- Please install Poetry
- Run
poetry installin the project root to install required dependencies. - Run
poetry shellto create a new poetry shell. - Run
jupyter notebookto run jupyter notebook server.