Solving a binary classification problem using MLlib and PySpark
The goal is to build a binary classifier in a direct marketing campaign (phone calls) of a bank. The classification should predict whether a client will subsribe to a deposit or not.
The dataset is from UCI Machine Learning and can be downloaded here
Statistical properties of numerical variables in dataset:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
summary | count | mean | stddev | min | max |
age | 41188 | 40.02406040594348 | 10.421249980934071 | 17 | 98 |
duration | 41188 | 258.2850101971448 | 259.2792488364657 | 0 | 4918 |
campaign | 41188 | 2.567592502670681 | 2.7700135429023445 | 1 | 56 |
pdays | 41188 | 962.4754540157328 | 186.91090734474153 | 0 | 999 |
previous | 41188 | 0.17296299893172767 | 0.49490107983929027 | 0 | 7 |
emp_var_rate | 41188 | 0.0818855006312578 | 1.5709597405170233 | -3.4 | 1.4 |
cons_price_idx | 41188 | 93.57566436827008 | 0.5788400489541244 | 92.201 | 94.767 |
cons_conf_idx | 41188 | -40.50260027192037 | 4.628197856174547 | -50.8 | -26.9 |
euribor3m | 41188 | 3.621290812858366 | 1.734447404851269 | 0.634 | 5.045 |
nr_employed | 41188 | 5167.035910943202 | 72.25152766826123 | 4963.6 | 5228.1 |
label | 41188 | 0.11265417111780131 | 0.3161734269429653 | 0 | 1 |
- Screen and Prepare Data
- Build Pipeline
- Use Logisitc Regression Model
- Use Decision Tree Classifier
- Use Random Forest
- Use Gradient-Boosted Tree
- Perform grid search for parameter optimization
The variables in the dataset do not significantly correlate, which is why I decided to keep most of them for the model.
For the categorical features (i.e. job, martial, education, housing, etc.), a string indexer together with an appropriate encoder have been used. The Encoder transofrms the categorical features into vectors for further processing.
Together with the numeric features, all the relevant variables have been transformed into a feature vector using a vector assembler.
The following graph shows the coefficients of the linear regression performed on the resulting features:
The precision/recall curve for the logistic regression model is shown in the following graph:
Using the same pipeline, different tree-based models have been tries (decision tree, random forest, gradient-boosted). The gradient-boosted tree has beformed best with a AUROC of approx. 0.931 on the given test set.
In order to optimize/tune the parameters used, a parameter grid has been built and different trees have been trained and evaluated using a cross validator. The AUROC could slightly be improved to 0.935 which is not really significant.
Overall, the gradient-boosted tree has performed best, directly followed by the logistic regression.