A comparative analysis of Steam reviews for games with male vs. female protagonists, exploring sentiment, topics, and classification potential.
This project analyzes user reviews from Steam games featuring male and female protagonists to identify differences in sentiment, thematic content, and classification performance.
It seeks to explore whether reviews of games with female leads differ significantly from those with male leads, and what insights can be drawn about player reception and potential bias.
- Investigates sentiment and topical content differences in game reviews based on protagonist gender.
- Offers a machine learning pipeline to predict review polarity (positive or negative).
- Aims to surface trends in language, tone, and content focus in gaming discourse.
- This was both a learning exercise and an exploration of real-world representation and audience bias in gaming.
- Python 3
- spaCy (NLP preprocessing)
- NLTK (Sentiment Analysis with VADER)
- Scikit-learn (ML models, LDA topic modeling)
- imbalanced-learn (resampling techniques)
- Seaborn / Matplotlib (visualizations)
- Pandas / NumPy
- Jupyter Notebooks
- Preprocesses user reviews using spaCy and stores metadata for easy manipulation
- Performs sentiment analysis using NLTK’s VADER model
- Runs a 5-fold cross-validation using SVM, Naive Bayes, and Random Forest classifiers
- Upsamples minority classes to handle extreme class imbalance
- Visualizes sentiment polarity and classification confusion matrices
- Extracts review topics using LDA and compares topic distribution by protagonist gender
/project-root
│
├── data/ # Pickled preprocessed review bins
├── MLTable.png # Summary results for ML models
├── README.md # This file
├── LICENSE # License GPLv3
├── requirements.txt # Requirements to run notebook
├── steamScrape.py # Script to scrape data
├── analysis.html # Rendered Output for easy viewing
└── analysis.ipynb # Full analysis pipeline
- SVM Accuracy: ~95% with upsampling
- Statistical Test (KS): Significant difference (p < 0.05) in polarity distributions between male and female protagonist reviews
- Topic Modeling: Distinct topic distributions; female-lead games showed more discussion around "vibe"-oriented themes
- Sentiment Distributions: Female-led games slightly skewed more positively
Note: Small and imbalanced datasets reduce generalizability. Genre is a confounding variable.
- How to handle small, imbalanced datasets with upsampling techniques
- How to interpret and visualize sentiment polarity using dictionary-based models
- How to implement and evaluate multiple supervised classifiers on sparse review data
- Challenges of topic modeling with LDA and the value of more modern transformer-based methods
- The importance of dataset scope, control variables (e.g., genre), and reproducibility in social-NLP analysis
git clone https://github.com/yourname/gendered-game-review-analysis.git
cd gendered-game-review-analysis
pip install -r requirements.txt
jupyter notebook analysis.ipynb