- Sri Sethu Madhavan S
- Harshavardhan M V
- Shrihari A
Phishing has emerged as a major problem in the rapidly changing digital landscape as a result of the rise in internet usage. Phishing is the act of deceiving individuals into clicking on fraudulent links that take them to malicious websites, allowing confidential information to be collected without the user's awareness. Phishing URLs are difficult to recognize using pattern matching because they frequently use strategies like misspelled words, phony subdomains, or character changes to evade detection. Phishing URLs could seem safe at first, but after a while they might lead people to unsafe websites. In this work, phishing URLs are detected by feature extraction and then applied to specific algorithms using machine learning methods like Random Forests, Decision Trees, Light GBM, Logistic Regression, and Support Vector Machine (SVM).
- Dataset for Phishing URL: Phishing URLs were collected from a service called Phish Tank, an open-source service.
- Dataset for Benign URL: University of New Brunswick and the number of legitimate URLs in that dataset is around 35000.
The methodology used in the study involves the application of machine learning algorithms for detecting malicious URLs, data preprocessing, feature extraction using lexical domain-based features, and the application of machine learning models to the extracted features.
The Initial Features extracted from the URL are
- Have_IP
- Have_At
- URL_Length
- URL_Depth
- Redirection
- https_Domain
- Prefix/Suffix
- Domain_Age
- Domain_End
- Subdomain_Count
- DNS_Record
- Web_Traffic
- Tiny_URL
- Statistical Report
The Boruta feature selection algorithm identifies important features by leveraging a random forest classifier. It starts by creating shadow features, which are shuffled duplicates of the original features. The dataset, now containing both original and shadow features, is used to train a random forest model. The importance scores of the original features are then compared to the highest importance scores of the shadow features. If an original feature's importance score is significantly higher than the shadow features, it is deemed important; if lower, it is considered unimportant. Features with unclear importance are tentatively kept for further testing, ensuring a robust selection of relevant features.
The Machine Learning Algorithms used are
- Decision Tree Classifier
- Random Forest Classifier
- Logistic Regression
- LightGBM
- Support Vector Machine