Contributions & Team Members

Sri Sethu Madhavan S
Harshavardhan M V
Shrihari A

Introduction

Phishing has emerged as a major problem in the rapidly changing digital landscape as a result of the rise in internet usage. Phishing is the act of deceiving individuals into clicking on fraudulent links that take them to malicious websites, allowing confidential information to be collected without the user's awareness. Phishing URLs are difficult to recognize using pattern matching because they frequently use strategies like misspelled words, phony subdomains, or character changes to evade detection. Phishing URLs could seem safe at first, but after a while they might lead people to unsafe websites. In this work, phishing URLs are detected by feature extraction and then applied to specific algorithms using machine learning methods like Random Forests, Decision Trees, Light GBM, Logistic Regression, and Support Vector Machine (SVM).

Dataset Description

Dataset for Phishing URL: Phishing URLs were collected from a service called Phish Tank, an open-source service.
Dataset for Benign URL: University of New Brunswick and the number of legitimate URLs in that dataset is around 35000.

Architecture

Methodology

The methodology used in the study involves the application of machine learning algorithms for detecting malicious URLs, data preprocessing, feature extraction using lexical domain-based features, and the application of machine learning models to the extracted features.

Features

The Initial Features extracted from the URL are

Have_IP
Have_At
URL_Length
URL_Depth
Redirection
https_Domain
Prefix/Suffix
Domain_Age
Domain_End
Subdomain_Count
DNS_Record
Web_Traffic
Tiny_URL
Statistical Report

Boruta Feature Selection Algorithm

The Boruta feature selection algorithm identifies important features by leveraging a random forest classifier. It starts by creating shadow features, which are shuffled duplicates of the original features. The dataset, now containing both original and shadow features, is used to train a random forest model. The importance scores of the original features are then compared to the highest importance scores of the shadow features. If an original feature's importance score is significantly higher than the shadow features, it is deemed important; if lower, it is considered unimportant. Features with unclear importance are tentatively kept for further testing, ensuring a robust selection of relevant features.

Machine Learning Algorithms

The Machine Learning Algorithms used are

Decision Tree Classifier
Random Forest Classifier
Logistic Regression
LightGBM
Support Vector Machine

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Dataset		Dataset
Feature_Extraction.ipynb		Feature_Extraction.ipynb
Machine Learning based URL classification.ipynb		Machine Learning based URL classification.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Contributions & Team Members

Introduction

Dataset Description

Architecture

Methodology

Features

Boruta Feature Selection Algorithm

Machine Learning Algorithms

Results

About

Uh oh!

Releases

Packages

Languages

TheRootDaemon/url-classification

Folders and files

Latest commit

History

Repository files navigation

Contributions & Team Members

Introduction

Dataset Description

Architecture

Methodology

Features

Boruta Feature Selection Algorithm

Machine Learning Algorithms

Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages