This project implements a Network Intrusion Detection System (NIDS) using Python and machine learning techniques. It utilizes a dataset from Kaggle to identify malicious activities within network traffic using GNS3, Pyshark and Simple Mail Transfer protocol. The primary objective is to classify network activity as normal or malicious based on features in network traffic data. Please Note that this project is currently under development.
- Data Source: Network Traffic Data for Malicious Activity Detection
- Framework: Jupyter Notebook
- Main Libraries:
pandas
,numpy
: Data manipulation and analysismatplotlib
: Data visualizationscikit-learn
: Machine learning models and metrics
The dataset used for this project is sourced from Kaggle and contains various attributes related to network traffic, such as Time
, Source
, Destination
, Protocol
, and more. Each record in the dataset is labeled to indicate if it is a malicious activity (bad_packet
).
- Time: Timestamp of the packet capture.
- Source: The source IP address or identifier of the packet.
- Destination: The destination IP address or identifier.
- Protocol: Protocol type (e.g., ARP, TCP).
- Length: Length of the packet.
- bad_packet: Label indicating if the packet is malicious (1) or normal (0).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
- Load the dataset, dropping unnecessary columns (
Source Port
,Destination Port
) to simplify analysis. - Basic data cleaning and feature engineering, such as one-hot encoding categorical variables.
path = "/path/to/network_anomaly.csv"
data = pd.read_csv(path, low_memory=False)
data = data.drop(columns=['Source Port', 'Destination Port'])
- Visualize distribution of malicious and non-malicious traffic.
- Generate histograms and distribution plots to understand network traffic patterns.
- Categorical columns (e.g.,
Protocol
) are transformed using OneHotEncoder to prepare them for model input.
column_transformer = ColumnTransformer(
transformers=[
('encoder', OneHotEncoder(), ['Protocol'])
], remainder='passthrough'
)
X = column_transformer.fit_transform(data.drop('bad_packet', axis=1))
y = data['bad_packet']
Three machine learning models are trained and evaluated:
- Logistic Regression
- Random Forest Classifier
Example of training the Random Forest Classifier:
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
The models are evaluated using classification metrics, focusing on accuracy, precision, recall, and F1 score.
y_pred = rf_model.predict(X_test)
print(classification_report(y_test, y_pred))
- Visualizations such as confusion matrices and ROC curves are included to better understand model performance.
-
Clone the Repository and open the notebook in Jupyter:
git clone https://github.com/SparshLadani/Network-Intrusion-Detection-System.git cd Network-Intrusion-Detection-System jupyter notebook
-
Install Dependencies:
pip install pandas numpy matplotlib scikit-learn
-
Run the Notebook:
- Open
Network_Intrusion_Detection_System.ipynb
. - Execute each cell sequentially to load data, train models, and evaluate the results.
- Open
After running the models, the Random Forest Classifier and Logistic Regression achieved a 100% accuracy on the dataset, making it the preferred choice for deployment in a real-time intrusion detection system. Detailed classification metrics are available in the notebook.
This Network Intrusion Detection System demonstrates the use of machine learning techniques for identifying malicious network traffic. It provides a foundation for further enhancement with more complex models, real-time data processing, and integration into network monitoring setups.
- Dataset by Advait Menon, available on Kaggle.