Skip to content

Solving a binary classification problem using MLlib and PySpark

Notifications You must be signed in to change notification settings

stephaniebernhard/binary_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Binary Classification Problem

Solving a binary classification problem using MLlib and PySpark

Goal

The goal is to build a binary classifier in a direct marketing campaign (phone calls) of a bank. The classification should predict whether a client will subsribe to a deposit or not.

Dataset

The dataset is from UCI Machine Learning and can be downloaded here

Statistical properties of numerical variables in dataset:

0 1 2 3 4
summary count mean stddev min max
age 41188 40.02406040594348 10.421249980934071 17 98
duration 41188 258.2850101971448 259.2792488364657 0 4918
campaign 41188 2.567592502670681 2.7700135429023445 1 56
pdays 41188 962.4754540157328 186.91090734474153 0 999
previous 41188 0.17296299893172767 0.49490107983929027 0 7
emp_var_rate 41188 0.0818855006312578 1.5709597405170233 -3.4 1.4
cons_price_idx 41188 93.57566436827008 0.5788400489541244 92.201 94.767
cons_conf_idx 41188 -40.50260027192037 4.628197856174547 -50.8 -26.9
euribor3m 41188 3.621290812858366 1.734447404851269 0.634 5.045
nr_employed 41188 5167.035910943202 72.25152766826123 4963.6 5228.1
label 41188 0.11265417111780131 0.3161734269429653 0 1

Approach

  1. Screen and Prepare Data
  2. Build Pipeline
  3. Use Logisitc Regression Model
  4. Use Decision Tree Classifier
  5. Use Random Forest
  6. Use Gradient-Boosted Tree
  7. Perform grid search for parameter optimization

Conclusion

The variables in the dataset do not significantly correlate, which is why I decided to keep most of them for the model.

png

For the categorical features (i.e. job, martial, education, housing, etc.), a string indexer together with an appropriate encoder have been used. The Encoder transofrms the categorical features into vectors for further processing.

Together with the numeric features, all the relevant variables have been transformed into a feature vector using a vector assembler.

The following graph shows the coefficients of the linear regression performed on the resulting features:

png

The precision/recall curve for the logistic regression model is shown in the following graph:

png

Using the same pipeline, different tree-based models have been tries (decision tree, random forest, gradient-boosted). The gradient-boosted tree has beformed best with a AUROC of approx. 0.931 on the given test set.

In order to optimize/tune the parameters used, a parameter grid has been built and different trees have been trained and evaluated using a cross validator. The AUROC could slightly be improved to 0.935 which is not really significant.

Overall, the gradient-boosted tree has performed best, directly followed by the logistic regression.

About

Solving a binary classification problem using MLlib and PySpark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published