An endβtoβend NLP project for analyzing sentiment in social media posts (positive / negative / neutral).
Includes data cleaning, text normalization, feature extraction (TFβIDF / embeddings), classical ML baselines (Logistic Regression, Naive Bayes, SVM), and optional deep learning (BiβLSTM). Provides rigorous evaluation with Accuracy, F1βscore, and confusion matrix.
This repository demonstrates a reproducible sentiment analysis pipeline, from raw text to deployable models.
It covers:
- Preprocessing & normalization (lowercasing, punctuation removal, stopwords, lemmatization)
- Feature extraction with TFβIDF (nβgrams) or pretrained embeddings
- Model training & comparison (LogReg, NB, SVM, Random Forest)
- (Optional) BiβLSTM sequence model for improved context handling
- Explainability & error analysis (misclassification review)
- Input: CSV with columns like:
id
,text
,label
(pos
/neg
/neu
or1/0
). - Location: place your files under
Dataset/
(e.g.,train.csv
,test.csv
). - Class balance: check for imbalance; consider stratified splits and class weights.
If you use a public dataset (e.g., Twitter Sentiment, Sentiment140), cite the source in this README.
- Clean text: lowercase, remove URLs, mentions, hashtags (optional keep hashtag text), emojis handling
- Tokenization, stopword removal, lemmatization
- TFβIDF with nβgrams (1β2 or 1β3), max_features cap
- (Optional) Emoji/emoticon normalization and slang expansion
- Logistic Regression (strong linear baseline on TFβIDF)
- Multinomial Naive Bayes (fast baseline for sparse text)
- Linear SVM (robust with TFβIDF)
- Random Forest (tabular baseline)
- (Optional) BiβLSTM with pretrained word embeddings
Model | Accuracy | F1 (macro) |
---|---|---|
Logistic Regression | 0.90 | 0.89 |
Multinomial NB | 0.88 | 0.87 |
Linear SVM | 0.92 | 0.91 |
Random Forest | 0.86 | 0.85 |
BiβLSTM (optional) | 0.93 | 0.92 |
Export:
- Confusion matrix for 3βclass sentiment
- Perβclass precision/recall/F1
- Top informative features (for linear models)
Social-Media-Sentiment-Analysis/
ββ Dataset/ # train/test CSVs
ββ Notebook/
β ββ Sentiment_Analysis.ipynb # main notebook
ββ src/ # optional scripts
β ββ data.py # loading/cleaning
β ββ features.py # TF-IDF / embeddings
β ββ train.py # training & tuning
β ββ eval.py # metrics & plots
ββ reports/figures/ # CM, PR curves, feature plots
ββ requirements.txt
ββ .gitignore
ββ README.md
- Clone & install
git clone https://github.com/ziaee-mohammad/Social-Media-Sentiment-Analysis.git
cd Social-Media-Sentiment-Analysis
pip install -r requirements.txt
- Run notebook
jupyter notebook Notebook/Sentiment_Analysis.ipynb
- (Optional) Run scripts
python -m src.train --model "svm" --ngrams 1,2 --max_features 200000
python -m src.eval --report
pandas
numpy
scikit-learn
nltk
matplotlib
seaborn
If using BiβLSTM: add
torch
(ortensorflow
) and suitable tokenizers/embeddings.
- Use stratified train/test split
- Keep vectorizer + model in a single
Pipeline
to avoid leakage - Fix random seeds for reproducibility
- Save vectorizer/model artifacts if you plan to deploy
data-science
machine-learning
nlp
sentiment-analysis
text-mining
classification
python
scikit-learn
tf-idf
``
---
## π€ Author
**Mohammad Ziaee** β Computer Engineer | AI & Data Science
π§ [email protected]
π https://github.com/ziaee-mohammad
---
## π License
MIT β free to use and adapt with attribution.