Pred-Sus-Act focuses on detecting network anomalies using machine learning techniques. It consists of three main parts:
- Exploratory Data Analysis (EDA): Loading and preparing the dataset for analysis
- Feature Selection: Using LASSO regression to identify the most important features for anomaly detection
- Model Training: Implementing and evaluating various machine learning models for classification
- Performs feature selection using LASSO regression
- Identifies the most significant features for anomaly detection
- Saves selected features to
data/original/LASSO_selected_features.csv
Key steps:
- Data loading and preprocessing
- Feature engineering (numeric and categorical features)
- LASSO regression with alpha parameter tuning
- Feature importance visualization
- Saving selected features
- Loads the dataset and performs initial data exploration
- Uses ANOVA for feature selection
- Visualizes the distribution of features and target variable
- Handles unbalanced classes
- Applies dimensionality reduction techniques (PCA, t-SNE, UMAP) for visualization
- Applies
SMOTEN
as best technique for oversampling the minority class - Visualizes the impact of different oversampling techniques on the dataset
- Saves the processed dataset to
data/resampled/ANOVA_selected_features.csv
- Implements and evaluates four machine learning models:
- Support Vector Machine (SVM) Classifier
- Decision Tree Classifier
- Random Forest Classifier
- Logistic Regression
- Includes an ensemble stacking classifier
- Generates performance metrics and saves reports to SQLite database
Key components:
- Data loading and train-test split
- Model fine-tuning with hyperparameter optimization
- Performance metric generation and visualization
- Model comparison and ensemble learning
- Uses ANOVA for feature selection
- Uses LASSO regression for feature selection
- Evaluates feature importance based on coefficients
- Handles both numeric and categorical features
- Visualizes the impact of different alpha values on model performance
-
SVM Classifier
- Linear kernel implementation
- C parameter tuning for optimal recall
- Achieves high accuracy in anomaly detection
-
Decision Tree Classifier
- Uses entropy as splitting criterion
- Visualizes the decision tree structure
- Provides interpretable classification rules
-
Random Forest Classifier
- Ensemble of decision trees
- Uses entropy for node splitting
- Handles high-dimensional feature space effectively
-
Logistic Regression
- Linear classification model
- Regularization parameter (C) tuning
- Efficient for binary classification tasks
-
Stacking Ensemble
- Combines predictions from all base models
- Uses Random Forest as final estimator
- Potentially improves overall performance
- Implements comprehensive metric reporting:
- Precision, recall, and F1-score for each class
- Accuracy, macro and weighted averages
- Visualizations of model performance
- Uses SQLite database (
models_reports.db
) to store:- Test metadata (model names, dataset versions)
- Model parameters
- Performance metrics
- Clone the repository and navigate to the project directory
- Activate poetry environment
- Install dependencies using
poetry install
- Run the Jupyter notebooks in order:
LASSO_feature_selection.ipynb
EDA.ipynb
Models.ipynb
- Navigate to
pyproject.toml
for project dependencies
- Complete hyperparameter tuning for Random Forest
- Experiment with additional models (e.g., Neural Networks)
- Implement more sophisticated feature engineering
- Add cross-validation for more robust evaluation
This project provides a comprehensive framework for network anomaly detection, from feature selection to model evaluation, with a focus on interpretability and performance.