Phishing Detection Using Graph Sage using Casual Sampling (GNNs)

This repository presents a complete workflow for phishing detection leveraging GraphSAGE, a type of Graph Neural Network (GNN), with temporal modeling, causal sampling, and robustness testing.

🧠 Overview

Phishing attacks often involve subtle patterns that can be better detected using relational and temporal data. This project converts phishing datasets into graphs and applies a GNN model that:

Respects causal constraints in message passing.
Incorporates temporal windowing for realistic data flow.
Tests robustness through noise injection.

🛠 Tech Stack

Programming Language: Python
Graph Processing: PyTorch Geometric
Machine Learning: PyTorch, Scikit-learn
Data Handling: pandas, numpy
Visualization: matplotlib

📊 Workflow Summary

1. Data Preprocessing

Load and clean phishing data from phish.xlsx
One-hot encode categorical features
Scale numerical features
Combine features for each URL

2. Graph Construction

Create a similarity graph using cosine similarity
Connect each node to k=5 nearest neighbors
Partition data into time windows of 10 samples
Generate PyG Data objects for each time window

3. Causal GraphSAGE Model

Custom model using SAGEConv, BatchNorm, Dropout
Enforces causal message passing (no future info leakage)

4. Noise Injection for Robustness

Add Gaussian noise to node features
Randomly flip labels to simulate real-world inconsistencies

5. Training

Trained with Binary Cross-Entropy loss and Adam optimizer
Evaluated using AUC-ROC score and ROC curve visualization

📈 Evaluation

The model achieved strong performance on phishing detection:

Metric	Value
Accuracy	86.36%
Precision	86.32%
Recall	86.36%
F1-Score	86.14%
AUC-ROC	0.9023

Visualizations

Training Loss and Accuracy Over Epochs (Causal GraphSAGE): Visualizes the convergence of the model during causal training, showing decreasing loss and increasing accuracy over epochs.
Confusion Matrix: Provides a detailed breakdown of true positives, true negatives, false positives, and false negatives from the final evaluation, illustrating the model's classification accuracy for each class.
ROC Curve: Illustrates the model's trade-off between True Positive Rate and False Positive Rate across various classification thresholds, with the AUC-ROC score quantifying overall performance.
Training Loss - Phishing Noise Training: Depicts the loss reduction during the training phase where noise was intentionally injected, demonstrating the model's ability to learn effectively despite data imperfections.
Overall Training Loss/Accuracy: Shows the general learning progression of the model, likely from an initial training phase, with loss decreasing and accuracy increasing.
Visual Interface: The dashboard helps to visualize the data fed to the global (fusion classifier) and attack - specific models for viewing class probabilities, graph plot visualization and accuracy metrics, confidence scores of both models and the probable reason behind the respective model's classification.

Dependencies

The project relies on the following key libraries:

Python 3.x torch (PyTorch) torch-geometric (PyG) torch-scatter pandas numpy scikit-learn matplotlib gradio

git clone https://github.com/spk-22/Phish-Guard

pip install -r requirements.txt
# (Or manually install: torch, torch-geometric, scikit-learn, pandas, numpy, matplotlib)
# Ensure torch-geometric, torch-scatter, and torch-sparse versions are compatible with your PyTorch version.

python phish.py

streamlit run web_app.py

🔍 Use Case

This pipeline is ideal for cybersecurity researchers and engineers looking to detect phishing attempts using relational and temporal patterns within data. The AUC-ROC score of 0.9023 signifies excellent discriminative power, even when trained on noisy data, indicating the model's strong ability to differentiate between phishing and legitimate attempts.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
models.py		models.py
phish.py		phish.py
phish_model.pth		phish_model.pth
sampling_comparison_simulation.html		sampling_comparison_simulation.html
utils.py		utils.py
web_app.py		web_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Phishing Detection Using Graph Sage using Casual Sampling (GNNs)

🧠 Overview

🛠 Tech Stack

📊 Workflow Summary

1. Data Preprocessing

2. Graph Construction

3. Causal GraphSAGE Model

4. Noise Injection for Robustness

5. Training

📈 Evaluation

Visualizations

Dependencies

🔍 Use Case

About

Uh oh!

Releases

Packages

Languages

spk-22/Phish-Guard

Folders and files

Latest commit

History

Repository files navigation

Phishing Detection Using Graph Sage using Casual Sampling (GNNs)

🧠 Overview

🛠 Tech Stack

📊 Workflow Summary

1. Data Preprocessing

2. Graph Construction

3. Causal GraphSAGE Model

4. Noise Injection for Robustness

5. Training

📈 Evaluation

Visualizations

Dependencies

🔍 Use Case

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages