Binary Classification Problem

Solving a binary classification problem using MLlib and PySpark

Goal

The goal is to build a binary classifier in a direct marketing campaign (phone calls) of a bank. The classification should predict whether a client will subsribe to a deposit or not.

Dataset

The dataset is from UCI Machine Learning and can be downloaded here

Statistical properties of numerical variables in dataset:

	0	1	2	3	4
summary	count	mean	stddev	min	max
age	41188	40.02406040594348	10.421249980934071	17	98
duration	41188	258.2850101971448	259.2792488364657	0	4918
campaign	41188	2.567592502670681	2.7700135429023445	1	56
pdays	41188	962.4754540157328	186.91090734474153	0	999
previous	41188	0.17296299893172767	0.49490107983929027	0	7
emp_var_rate	41188	0.0818855006312578	1.5709597405170233	-3.4	1.4
cons_price_idx	41188	93.57566436827008	0.5788400489541244	92.201	94.767
cons_conf_idx	41188	-40.50260027192037	4.628197856174547	-50.8	-26.9
euribor3m	41188	3.621290812858366	1.734447404851269	0.634	5.045
nr_employed	41188	5167.035910943202	72.25152766826123	4963.6	5228.1
label	41188	0.11265417111780131	0.3161734269429653	0	1

Approach

Screen and Prepare Data
Build Pipeline
Use Logisitc Regression Model
Use Decision Tree Classifier
Use Random Forest
Use Gradient-Boosted Tree
Perform grid search for parameter optimization

Conclusion

The variables in the dataset do not significantly correlate, which is why I decided to keep most of them for the model.

For the categorical features (i.e. job, martial, education, housing, etc.), a string indexer together with an appropriate encoder have been used. The Encoder transofrms the categorical features into vectors for further processing.

Together with the numeric features, all the relevant variables have been transformed into a feature vector using a vector assembler.

The following graph shows the coefficients of the linear regression performed on the resulting features:

The precision/recall curve for the logistic regression model is shown in the following graph:

Using the same pipeline, different tree-based models have been tries (decision tree, random forest, gradient-boosted). The gradient-boosted tree has beformed best with a AUROC of approx. 0.931 on the given test set.

In order to optimize/tune the parameters used, a parameter grid has been built and different trees have been trained and evaluated using a cross validator. The AUROC could slightly be improved to 0.935 which is not really significant.

Overall, the gradient-boosted tree has performed best, directly followed by the logistic regression.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
out		out
.gitignore		.gitignore
README.md		README.md
binary_classification.ipynb		binary_classification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Binary Classification Problem

Goal

Dataset

Approach

Conclusion

About

Uh oh!

Releases

Packages

Languages

stephaniebernhard/binary_classification

Folders and files

Latest commit

History

Repository files navigation

Binary Classification Problem

Goal

Dataset

Approach

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages