Identifying Cyberattacks using Machine Learning Techniques

This repository contains the code and supporting material for the undergraduate dissertation titled "Identifying Cyberattacks using Machine Learning Techniques". The project applies Artificial Neural Networks (ANNs) and Variable Importance Techniques to a labelled cyberattack dataset HIKARI-2022, with a focus on classification performance and interpretability.

Project Structure

├── Logistic Regression/
│   ├── Logistic Regression Model.py
│   ├── Classification Reports
│   ├── Confusion Matrices
│   └── Metrics Summary
├── Parameter Optimisation/
│   ├── 0. Raw Data/
│   ├── 1. Preprocessing/
│   ├── 2-7. Hyperparameter Tuning Experiments/
│   ├── 8. Baseline vs Optimised Model/
│   ├── optimised_model.keras
├── Variable Importance/
│   ├── Variable Importance.py
│   ├── SHAP & PFI Graphs
│   └── Results Tables

External Data Files

Due to GitHub file size limitations, the following data files are hosted externally:

ALLFLOWMETER_HIKARI2022.csv (119 MB)

This file contains the raw data as explained below. Download and place it in Parameter Optimisation/0. Raw Data/ as it is required for preprocessing.
X_train.csv (404 MB)
X_val.csv (101 MB)
train_data_with_smotenc_tracking.csv (172 MB)

These files are required for training, and evaluation. Download and place them in Parameter Optimisation/1. Preprocessing/SMOTENC/ as needed.

Dataset

This project uses the HIKARI-2022 dataset for evaluating cyberattack detection models. The dataset was proposed by:

Ferriyan, A., Thamrin, A. H., Takeda, K., & Murai, J. (2021).
Generating Network Intrusion Detection Dataset Based on Real and Encrypted Synthetic Attack Traffic.
Applied Sciences, 11(17), 7868. https://doi.org/10.3390/app11177868

The dataset is publicly available for download via Zenodo.

Models and Methods

Artificial Neural Networks (ANNs)
- Built with Keras and TensorFlow
- Extensive hyperparameter tuning (learning rate, layers, dropout, etc.) with outputs for each stage
- Final model saved as optimised_model.keras
Logistic Regression
- Used as a baseline model
- Includes classification reports, confusion matrices, and metrics
Variable Importance
- Analysed using:
  - Permutation Feature Importance (PFI)
  - SHAP values using DeepSHAP explainer

Output Highlights

Performance comparisons between baseline and optimised models
Evaluation metrics: F1-score, Precision, Accuracy, ROC-AUC, PR-AUC
Clear visualisations of feature contributions using DeepSHAP and PFI

Environment Setup

This project was developed with Python 3.11 using Spyder as the IDE of choice. Required packages include:

numpy, pandas, scikit-learn, tensorflow, keras, shap, imbalanced-learn, matplotlib, tqdm, h5py

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Logistic Regression		Logistic Regression
Parameter Optimisation		Parameter Optimisation
Variable Importance		Variable Importance
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Identifying Cyberattacks using Machine Learning Techniques

📚 Table of Contents

Project Structure

External Data Files

Dataset

Models and Methods

Output Highlights

Environment Setup

About

Uh oh!

Uh oh!

Languages

rox-spi/B.Sc-Dissertation-Identifying-Cyberattacks-using-ML-Techniques

Folders and files

Latest commit

History

Repository files navigation

Identifying Cyberattacks using Machine Learning Techniques

📚 Table of Contents

Project Structure

External Data Files

Dataset

Models and Methods

Output Highlights

Environment Setup

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages