This project implements a K-Nearest Neighbors (KNN) model to classify customers based on a marketing campaign dataset. The goal is to predict whether customers will purchase a product in the company’s 6th marketing campaign, enabling efficient resource allocation and targeted marketing strategies.
- Project Name: Customer Classification Using KNN
- Objective: Predict purchase decisions (1 for purchase, 0 for decline)
- Data Source: Marketing Campaign Dataset from Kaggle
The company is launching its 6th marketing campaign to promote a new product line and aims to:
- Identify customers likely to buy for focused marketing efforts.
- Exclude unlikely buyers to minimize costs.
- Understand factors driving purchase decisions.
- Task Type: Binary classification
- Input: Customer attributes (e.g., income, past campaign responses, online purchases)
- Output: Binary label: 1 (purchase) or 0 (decline)
- Source: Marketing Campaign dataset from Kaggle
- Size: 2,240 customers from a phone-based pilot campaign
- Labels: 1 (purchase), 0 (decline)
Field Name | Meaning |
---|---|
ID | Customer ID |
Year_birth | Year of birth |
Education | Education level |
Marital_status | Marital status |
Income | Annual household income |
Kidhome | Number of children in the household |
Teenhome | Number of teenagers in the household |
DtCustomer | Date when the customer information was first recorded in the system |
Recency | Number of days since the last purchase |
MntWines | Amount spent on wine products in the past 2 years |
MntFishProducts | Amount spent on fish products in the past 2 years |
MntMeatProducts | Amount spent on meat products in the past 2 years |
MntFruits | Amount spent on fruit products in the past 2 years |
MntSweetProducts | Amount spent on sweet products in the past 2 years |
MntGoldProds | Amount spent on luxury products in the past 2 years |
NumDealsPurchases | Number of purchases with promotions |
NumStorePurchases | Number of in-store purchases |
NumCatalogPurchases | Number of purchases made via catalog |
NumWebPurchases | Number of purchases made via the website |
NumWebVisitsMonth | Number of website visits in the last month |
AcceptedCmp1 | 1 if the customer agreed to purchase in the first campaign, 0 if declined |
AcceptedCmp2 | 1 if the customer agreed to purchase in the second campaign, 0 if declined |
AcceptedCmp3 | 1 if the customer agreed to purchase in the third campaign, 0 if declined |
AcceptedCmp4 | 1 if the customer agreed to purchase in the fourth campaign, 0 if declined |
AcceptedCmp5 | 1 if the customer agreed to purchase in the fifth campaign, 0 if declined |
Complain | 1 if the customer has complained about a product/service in the past 2 years, 0 otherwise |
Response (target) | 1 if the customer accepted the last campaign, 0 if declined |
- Splitting: Divided into 80% training (X_train_scaled, y_train) and 20% validation (X_val_scaled, y_val) sets.
- Normalization: Scaled features for KNN’s Euclidean distance metric.
- Feature Selection: Retained 30 relevant features based on initial analysis.
- Purpose: Analyzed the dataset to identify patterns and potential predictors of purchase behavior before modeling.
- Approach: Examined distributions of numerical features (Income, NumWebPurchases, Age) and categorical features (AcceptedCmp1-5, Education) using summary statistics and visualizations.
- Hypotheses:
- AcceptedCmp1-5: Prior campaign acceptance signals higher purchase likelihood.
- Income: Greater income may increase buying potential.
- NumWebPurchases: Frequent online purchases suggest receptiveness.
- Age: Certain age groups may be more inclined to buy.
- Findings:
- Positive AcceptedCmp5 responses strongly correlated with purchases.
- Higher NumWebPurchases linked to increased buying tendency.
- Income showed a moderate positive trend with purchases; Age varied by group (e.g., 30-50 more likely).
- K-Nearest Neighbors (KNN):
- Distance metric: Euclidean (metric='euclidean)
- Weighting: Tested uniform (equal) and distance (distance-based)
- Preprocessing: Split data 80/20 and scaled features.
- Training: Tested k from 1 to 30, focusing on k=9 and k=17.
- Tuning: Evaluated performance, noting overfitting at k=17 with weights='distance'; preferred k=9 with weights='uniform'.
- Feature Analysis: Assessed feature impact on validation accuracy.
- Python, scikit-learn (KNN, scaling), pandas, numpy (data handling), matplotlib, seaborn (visualization)
- Environment: Jupyter Notebook
- At k=9 with weights='uniform': Balanced accuracy across training and validation (specific values pending refinement), minimal overfitting.
- At k=17 with weights='distance': 99% training accuracy, 87% validation accuracy (8% gap, indicating overfitting).
Visualization:
- Train vs. Validation Accuracy Plot (at k=9)
- Train vs. Validation Accuracy Plot (k= 17 ****weights='distance')
Feature | Importance Score | Interpretation |
---|---|---|
AcceptedCmp5 | 0.008929 | Top predictor (past engagement) |
AcceptedCmp1 | 0.006696 | Prior campaign response |
Age | 0.006696 | Demographic factor |
AcceptedCmp3 | 0.004464 | Additional campaign influence |
Income | 0.002232 | Financial capacity |
NumWebPurchases | 0.002232 | Online purchase behavior |
- Simplicity: Easy to implement and interpret, suitable for small datasets like this (2,240 samples).
- EDA Alignment: Feature importance results matched EDA hypotheses, enhancing reliability.
- Flexibility: Adjustable k and weighting options allow performance tuning.
- Overfitting Risk: Evident at k=17 with weights='distance' (8% accuracy gap), requiring careful parameter selection.
- Scalability: Computationally intensive for larger datasets due to distance calculations.
- Sensitivity to Noise: Performance may degrade with irrelevant or noisy features.