Skip to content

Predicting Airbnb listing prices in NYC using machine learning with data cleaning, EDA, and Random Forest regression.

Notifications You must be signed in to change notification settings

huscse/ecornell-project-btt

Repository files navigation

🏑 Airbnb Price Prediction – NYC

This project applies machine learning to predict Airbnb listing prices in New York City, following the full ML lifecycle: from data exploration and preprocessing to model building, evaluation, and feature analysis.

πŸ“Œ Project Overview

The goal of this project is to predict the price of an Airbnb listing based on its attributes (number of rooms, location, reviews, availability, etc.) and identify the most important factors influencing price.

The workflow follows these key steps:

  1. Data Collection & Cleaning – Handle missing values, outliers, and irrelevant columns.

  2. Exploratory Data Analysis (EDA) – Understand distributions, correlations, and potential outliers.

  3. Feature Engineering & Preprocessing – Encode categorical variables, scale numerical features.

  4. Modeling – Build and evaluate:

    • Linear Regression (baseline)
    • Random Forest Regressor with GridSearchCV hyperparameter tuning
  5. Evaluation – Compare models using MAE, RMSE, and RΒ².

  6. Feature Importance – Identify top features driving Airbnb pricing.


πŸ“Š Results

  • Linear Regression (Baseline)

    • RMSE: ~107
    • RΒ²: ~0.40
  • Random Forest (Tuned)

    • Best RMSE: ~89
    • Outperforms linear regression and captures non-linear relationships

Top 5 Features Influencing Price:

  1. accommodates
  2. bathrooms
  3. neighbourhood_group_cleansed_Manhattan
  4. room_type_Private room
  5. minimum_nights_avg_ntm

πŸ“ˆ Visualizations

  • Residual Distribution – Shows prediction errors and highlights outliers
  • Actual vs Predicted Prices – Evaluates regression performance visually
  • Feature Importance Plot – Explains which factors impact price the most

βš™οΈ Tech Stack

  • Python (Pandas, NumPy, Matplotlib, Seaborn)
  • Scikit-learn (Pipelines, GridSearchCV, RandomForestRegressor)
  • Jupyter Notebook

πŸš€ Future Improvements

  • Handle outliers with log-transformation or robust scaling
  • Try Gradient Boosting or XGBoost for better accuracy
  • Add engineered features (e.g., price per bedroom, host activity level)
  • Perform more advanced cross-validation

Author: Husnain Khaliq This project demonstrates a complete end-to-end regression pipeline for real-world price prediction.

About

Predicting Airbnb listing prices in NYC using machine learning with data cleaning, EDA, and Random Forest regression.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published