Skip to content

Coded a Random Forest (Ensemble Method) Model from scratch; attained 88% accuracy on UCI Census Income Dataset.

Notifications You must be signed in to change notification settings

rosamundlim/rf_from_scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Random Forest from scratch using Numpy

1. Overview:

In this repo, I coded a Random Forest (RF) Model from scratch (using only NUMPY), achieving an 88% accuracy on UCI Census Income Dataset.

The goal is to use this RF Model to predict whether an individual's annual income is >=50,000. This makes it a binary classification problem, with two labels: 0, if an individual income is <50,000, and 1, if an individual's income is >=50,000.

2. Directory Structure:

rf_from_scratch
│   .gitignore
│   model_train_eval.ipynb
│   README.md
│   requirements.txt
│
├───assets
│       decision_tree.png
│
├───data
│       data.csv
│
└───src
        datapipeline.py
        decision_tree.py
        random_forest.py
        __init__.py

The diagram above shows the directory structure of this repo. The model files are located in src as decision_tree.py and random_forest.py. src/datapipeline.py is a data cleaning pipeline that process the data and does train test split.

Typically, we do not upload our data files on git repo, but for the purposes of replication and demonstration of the working model, I have included the data set I used from UCI machine learning repository into the data/ folder.

The file that is important if you would like to see me put the coded RF to use is model_train_eval.ipynb, located in root.

3. Coding the Model:

Random Forest is a multitude of Decision Trees. So, to build a Random Forest, we need to first code a Decision Tree.

3.1 Decision Tree:

decision-tree-diagram

My decision tree classifier was built from scratch using Python and Numpy. It recursively splits the dataset based on feature thresholds that minimize Gini impurity, creating nodes that store feature indices, thresholds, and class labels. The tree grows until a stopping criterion (max depth or pure nodes) is met. The fit method selects the best splits, while the predict method traverses the tree to classify new data based on learned splits. My model uses a majority class rule when no further splitting improves impurity reduction.

3.2 Random Forest:

To create a Random Forest, we need to define the number of Decision Trees in the forest. I draw samples from the X_train set using bootstrap aggregation (just a fancy term for sampling with replacement, also called bagging); I also sample features from the X_train set. How many times do I draw? As many times as the number of trees in my random forest.
Model training happens by fitting Decision Trees to each sample data.
For inferencing, (we use the X_test data now) we use a simple averaging mechanism, where the "votes" are averaged across the trees for each data point (an entry of X_test).

4. Evaluation

I evaluated the Decision Tree and Random Forest Models in the notebook: model_train_eval.ipynb.

Decision Tree: Accuracy - 82% Random Forest: Accuracy - 88%

5. Replication

To replicate my experiment, do the following:

$ git clone <repository_url>

Create a new virtual environment; activate it. I used python 3.9 for this project.

$ conda create --name <env_name> python=3.9

$ conda activate <env_name>

Then install the required dependencies: $ pip install -r requirements.text

About

Coded a Random Forest (Ensemble Method) Model from scratch; attained 88% accuracy on UCI Census Income Dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published