This repository contains the code and supporting material for the undergraduate dissertation titled "Identifying Cyberattacks using Machine Learning Techniques". The project applies Artificial Neural Networks (ANNs) and Variable Importance Techniques to a labelled cyberattack dataset HIKARI-2022, with a focus on classification performance and interpretability.
- 📁 Project Structure
- 📂 External Data Files
- 📂 Dataset
- 🧠 Models and Methods
- 📊 Output Highlights
- 🧪 Environment Setup
├── Logistic Regression/
│ ├── Logistic Regression Model.py
│ ├── Classification Reports
│ ├── Confusion Matrices
│ └── Metrics Summary
├── Parameter Optimisation/
│ ├── 0. Raw Data/
│ ├── 1. Preprocessing/
│ ├── 2-7. Hyperparameter Tuning Experiments/
│ ├── 8. Baseline vs Optimised Model/
│ ├── optimised_model.keras
├── Variable Importance/
│ ├── Variable Importance.py
│ ├── SHAP & PFI Graphs
│ └── Results Tables
Due to GitHub file size limitations, the following data files are hosted externally:
-
ALLFLOWMETER_HIKARI2022.csv
(119 MB)This file contains the raw data as explained below. Download and place it in
Parameter Optimisation/0. Raw Data/
as it is required for preprocessing. -
X_train.csv
(404 MB) -
X_val.csv
(101 MB) -
train_data_with_smotenc_tracking.csv
(172 MB)These files are required for training, and evaluation. Download and place them in
Parameter Optimisation/1. Preprocessing/SMOTENC/
as needed.
This project uses the HIKARI-2022 dataset for evaluating cyberattack detection models. The dataset was proposed by:
Ferriyan, A., Thamrin, A. H., Takeda, K., & Murai, J. (2021).
Generating Network Intrusion Detection Dataset Based on Real and Encrypted Synthetic Attack Traffic.
Applied Sciences, 11(17), 7868. https://doi.org/10.3390/app11177868
The dataset is publicly available for download via Zenodo.
-
Artificial Neural Networks (ANNs)
- Built with Keras and TensorFlow
- Extensive hyperparameter tuning (learning rate, layers, dropout, etc.) with outputs for each stage
- Final model saved as
optimised_model.keras
-
Logistic Regression
- Used as a baseline model
- Includes classification reports, confusion matrices, and metrics
-
Variable Importance
- Analysed using:
- Permutation Feature Importance (PFI)
- SHAP values using DeepSHAP explainer
- Analysed using:
- Performance comparisons between baseline and optimised models
- Evaluation metrics: F1-score, Precision, Accuracy, ROC-AUC, PR-AUC
- Clear visualisations of feature contributions using DeepSHAP and PFI
This project was developed with Python 3.11 using Spyder as the IDE of choice. Required packages include:
- numpy, pandas, scikit-learn, tensorflow, keras, shap, imbalanced-learn, matplotlib, tqdm, h5py