Repository for the final project of "Practical Machine Learning for Natural Language Processing" (University of Vienna, Summer Semester 2024).
This project presents a comparative analysis of modern Natural Language Processing (NLP) methodologies for the task of detecting sexist language in tweets. It evaluates the performance and cost of five distinct approaches:
- Fine-Tuned BERT Model for sequence classification. ('germeval.ipynb')
- Zero-Shot and Few-Shot Prompting using DeepSeek and GPT APIs. ('comparisonAPI.ipynb')
- Retrieval-Augmented Generation (RAG) using a VectorStoreIndex. ('comparisonVectorStoreIndex.ipynb')
- Retrieval-Augmented Generation (RAG) using a KeywordTableIndex. ('comparisonKeywordTableIndex.ipynb')
- Combining RAG pipelines with few shot Prompting. (comparisonAPIRAG.ipynb)
Detecting sexist language is a key challenge in studying online discourse. Automated methods can help researchers analyze large-scale corpora of social media data to understand the prevalence and nature of sexist rhetoric. However, the nuanced and context-dependent nature of such language makes it a difficult classification problem.
The performance of each method was evaluated and compared on a standardized dataset (with dedicated train, development, and test set) using the F1-score as the primary metric. The implementation details for each approach are as follows:
- Fine-Tuned BERT (bert-base-uncased): Implemented using the Hugging Face transformers library, optimized for sequence classification on our specific dataset.
- API Prompting (DeepSeek & GPT): Leveraged zero-shot and few-shot prompting strategies to query large language models via their APIs for classification, testing their ability to generalize with little to no task-specific training.
- RAG Pipeline (VectorStoreIndex): Implemented a RAG system using LlamaIndex's VectorStoreIndex to retrieve the most semantically similar examples from the training set to provide context for an LLM's classification decision.
- RAG Pipeline (KeywordTableIndex): Implemented an alternative RAG system using a KeywordTableIndex to retrieve examples based on keyword matching, providing a different retrieval strategy for comparison.
- RAG Pipelines (3. and 4.) combined with API Prompting (DeepSeek & GPT) (2.).
Best overall performance was achieved by models relying on the API (specifically DeepSeek) both with and without RAG. The fine-tuned BERT models achieved average results, while models relying only on RAG yielded the lowest F1-scores.