This project applies machine learning to predict Airbnb listing prices in New York City, following the full ML lifecycle: from data exploration and preprocessing to model building, evaluation, and feature analysis.
The goal of this project is to predict the price of an Airbnb listing based on its attributes (number of rooms, location, reviews, availability, etc.) and identify the most important factors influencing price.
The workflow follows these key steps:
-
Data Collection & Cleaning β Handle missing values, outliers, and irrelevant columns.
-
Exploratory Data Analysis (EDA) β Understand distributions, correlations, and potential outliers.
-
Feature Engineering & Preprocessing β Encode categorical variables, scale numerical features.
-
Modeling β Build and evaluate:
- Linear Regression (baseline)
- Random Forest Regressor with GridSearchCV hyperparameter tuning
-
Evaluation β Compare models using MAE, RMSE, and RΒ².
-
Feature Importance β Identify top features driving Airbnb pricing.
-
Linear Regression (Baseline)
- RMSE: ~107
- RΒ²: ~0.40
-
Random Forest (Tuned)
- Best RMSE: ~89
- Outperforms linear regression and captures non-linear relationships
Top 5 Features Influencing Price:
accommodatesbathroomsneighbourhood_group_cleansed_Manhattanroom_type_Private roomminimum_nights_avg_ntm
- Residual Distribution β Shows prediction errors and highlights outliers
- Actual vs Predicted Prices β Evaluates regression performance visually
- Feature Importance Plot β Explains which factors impact price the most
- Python (Pandas, NumPy, Matplotlib, Seaborn)
- Scikit-learn (Pipelines, GridSearchCV, RandomForestRegressor)
- Jupyter Notebook
- Handle outliers with log-transformation or robust scaling
- Try Gradient Boosting or XGBoost for better accuracy
- Add engineered features (e.g., price per bedroom, host activity level)
- Perform more advanced cross-validation
Author: Husnain Khaliq This project demonstrates a complete end-to-end regression pipeline for real-world price prediction.