PhiUSIIL Phishing URL Dataset Analysis

Introduction

This repository contains a data mining project focused on analyzing the PhiUSIIL Phishing URL Dataset. The dataset comprises a substantial collection of legitimate and phishing URLs, extracted from both webpage source code and URL features.

Dataset Description

The PhiUSIIL Phishing URL Dataset consists of:

Legitimate URLs: 134,850 instances
Phishing URLs: 100,945 instances
Features: Extracted from the source code of the webpage and URL. Includes features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb.

For more details, refer to the dataset documentation.

Analysis Overview

The analysis workflow consists of the following steps:

Data Cleaning and Preprocessing:
- Reading the dataset
- Dropping irrelevant columns (e.g., 'FILENAME', 'URL', 'Domain', 'Title')
- Handling categorical variables
- Removing highly correlated features
- Profiling the dataset for exploratory analysis (Profiling Report)
Random Forest Algorithm:
- Training a Random Forest classifier
- Evaluating performance using accuracy, precision, recall, F1-score, confusion matrix, and ROC curve
- Visualizing a decision tree from the forest and feature importances
Comparison with Other Algorithms:
- Implementing K-Nearest Neighbors (KNN), Naive Bayes, and Decision Trees
- Training classifiers, evaluating performance, and visualizing results
- Comparing accuracy, precision, recall, and F1-score across algorithms
Additional Analysis:
- Conducting a Chi-square test for feature importance assessment

Usage

To replicate the analysis, follow these steps:

Clone the repository:

git clone https://github.com/elaaatif/DATA-MINING-PhiUSIIL-Phishing-URL

Install the required packages:

pip install scikit-learn pandas numpy matplotlib seaborn

Run the Jupyter Notebook :

python Phishing URL (Website) - DATA MINING PROJECT.IPYNB

This will execute the data preprocessing steps, train and evaluate the Random Forest classifier, and compare the results with other algorithms.

Interpretation of Results

The analysis demonstrates the effectiveness of different machine learning algorithms in classifying legitimate and phishing URLs. Here are some key findings:

Random Forest: Achieved high accuracy and balanced precision and recall values.
K-Nearest Neighbors (KNN): Also performed well, with high accuracy and balanced precision and recall values.
Naive Bayes: Demonstrated strong performance, albeit with slightly lower accuracy compared to Random Forest and KNN.
Decision Trees: Achieved perfect accuracy, precision, recall, and F1-score, indicating optimal performance on the dataset.

The comparison highlights the strengths and weaknesses of each algorithm, providing valuable insights for selecting the most suitable approach for phishing URL detection tasks.

The following Image show the F1-Score ,Precision & Recall for the K-Nearest Neighbors (KNN), Naive Bayes, and Decision Trees in comparison with Random Forest

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
DATA-MININ-PhiUSIIL_Phishing_URL_Dataset-PRESENTATION.pdf		DATA-MININ-PhiUSIIL_Phishing_URL_Dataset-PRESENTATION.pdf
DATA-MININ-PhiUSIIL_Phishing_URL_Dataset-REPORT.pdf		DATA-MININ-PhiUSIIL_Phishing_URL_Dataset-REPORT.pdf
PhiUSIIL_Phishing_URL_Dataset.csv		PhiUSIIL_Phishing_URL_Dataset.csv
Phishing_URL_(Website)_DATA_MINING_PROJECT.ipynb		Phishing_URL_(Website)_DATA_MINING_PROJECT.ipynb
Profiling Report.mp4		Profiling Report.mp4
README.md		README.md
phishing_URL_(Website)_DATA_MINING_PROJECT.ipynb		phishing_URL_(Website)_DATA_MINING_PROJECT.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PhiUSIIL Phishing URL Dataset Analysis

Introduction

Dataset Description

Analysis Overview

Usage

Interpretation of Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

elaaatif/DATA-MINING-PhiUSIIL-Phishing-URL

Folders and files

Latest commit

History

Repository files navigation

PhiUSIIL Phishing URL Dataset Analysis

Introduction

Dataset Description

Analysis Overview

Usage

Interpretation of Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages