CyberGuardAI

CyberGuardAI is an intelligent cybersecurity log analysis system that uses transformer-based machine learning models to detect suspicious and malicious activities in system logs. The system combines the power of BERT models with rule-based pattern matching to provide highly accurate classifications of security events.

Background

In today's cybersecurity landscape, organizations face an overwhelming volume of log data from various systems. Manual analysis of these logs is time-consuming and error-prone. CyberGuardAI addresses this challenge by providing an automated system that can:

Process large volumes of log data efficiently
Classify logs as benign, suspicious, or malicious
Provide a simple API for integration with existing security systems
Deploy easily in containerized environments

The system uses a hybrid approach that combines the flexibility of machine learning with the reliability of rule-based pattern matching, ensuring high accuracy while minimizing false positives.

This is a comprehensive AI project for cybersecurity incident identification using deep learning foundation models. The solution leverages a transformer-based model fine-tuned for log analysis, achieving high accuracy and low false positives.

Features

Intelligent Log Classification: Categorizes logs as benign, suspicious, or malicious
Hybrid Detection System: Combines machine learning with rule-based pattern matching
REST API: Simple HTTP API for easy integration
Interactive UI Demo: Web interface for visualizing log analysis results
Docker Support: Ready for containerized deployment
Customizable Rules: Easily extend the pattern matching rules for specific use cases
Robust Error Handling: User-friendly error messages for API clients
Scalable Architecture: Designed for processing large volumes of log data

Architecture

CyberGuardAI follows a modular architecture with the following components:

Data Processing Module: Handles log preprocessing, tokenization, and feature extraction
Model Module: Implements the BERT-based neural network for log classification
Inference Module: Combines model predictions with rule-based pattern matching
API Module: Provides a REST API for interacting with the system

The system uses a hybrid approach for classification:

Rule-Based Component: Fast pattern matching for known attack patterns
ML Component: BERT-based deep learning for novel and complex patterns

System Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                          CyberGuardAI System                            │
└───────────────────────────────┬────────────────────────────────────----─┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌────────┐ │
│  │             │     │             │     │             │     │        │ │
│  │    Data     │     │    Model    │     │  Inference  │     │  API   │ │
│  │  Processing ├────►│   Training  ├────►│   Engine    ├────►│ Server │ │
│  │   Module    │     │   Module    │     │             │     │        │ │
│  │             │     │             │     │             │     │        │ │
│  └─────┬───────┘     └─────────────┘     └──────┬──────┘     └────┬───┘ │
│        │                                        │                 │     │
│        ▼                                        ▼                 ▼     │
│  ┌─────────────┐                        ┌──────────────┐    ┌──────────┐│
│  │ Raw Logs &  │                        │  Prediction  │    │  REST    ││
│  │ Sample Data │                        │   Logic      │    │  API     ││
│  │ Generation  │                        │              │    │ Endpoints││
│  └─────────────┘                        └──────┬───────┘    └──────────┘│
│                                                │                        │
└────────────────────────────────────────────────┼────────────────────────┘
                                                 │
                                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                             UI Layer                                    │
│                                                                         │
│  ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐    │
│  │                 │     │                 │     │                 │    │
│  │  Log Input &    │     │  Prediction     │     │  Statistics     │    │
│  │  Sample Display │     │  Visualization  │     │  Dashboard      │    │
│  │                 │     │                 │     │                 │    │
│  └─────────────────┘     └─────────────────┘     └─────────────────┘    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Model Architecture

CyberGuardAI uses a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model for log classification. The model architecture includes:

BERT Base Layer: Pre-trained BERT model that understands contextual relationships in text
Classification Head: Custom layers added on top of BERT for the specific task of log classification
Output Layer: Final layer with softmax activation to produce classification probabilities

┌───────────────────────────────────────────────────────────────┐
│                    CyberGuardAI Model                         │
└───────────────────────────────┬───────────────────────────────┘
                                │
                                ▼
┌───────────────────────────────────────────────────────────────┐
│                                                               │
│  ┌───────────────────────────────────────────────────────┐    │
│  │                   BERT Base Model                     │    │
│  │                                                       │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │    │
│  │  │ Transformer │  │ Transformer │  │ Transformer │    │    │
│  │  │   Layer 1   │─►│   Layer 2   │─►│   Layer N   │    │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘    │    │
│  │                                                       │    │
│  └───────────────────────────┬───────────────────────────┘    │
│                              │                                │
│                              ▼                                │
│  ┌───────────────────────────────────────────────────────┐    │
│  │                 Classification Head                   │    │
│  │                                                       │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │    │
│  │  │   Linear    │  │  Dropout    │  │   Linear    │    │    │
│  │  │   Layer     │─►│   Layer     │─►│   Layer     │    │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘    │    │
│  │                                                       │    │
│  └───────────────────────────┬───────────────────────────┘    │
│                              │                                │
│                              ▼                                │
│  ┌───────────────────────────────────────────────────────┐    │
│  │                    Output Layer                       │    │
│  │                                                       │    │
│  │  ┌─────────────────────────────────────────────────┐  │    │
│  │  │              Softmax Activation                 │  │    │
│  │  │                                                 │  │    │
│  │  │     ┌─────────┐    ┌─────────-┐    ┌─────────┐  │  │    │
│  │  │     │ Benign  │    │Suspicious│    │Malicious│  │  │    │
│  │  │     │  Class  │    │  Class   │    │  Class  │  │  │    │
│  │  │     └─────────┘    └─────────-┘    └─────────┘  │  │    │
│  │  │                                                 │  │    │
│  │  └─────────────────────────────────────────────────┘  │    │
│  │                                                       │    │
│  └───────────────────────────────────────────────────────┘    │
│                                                               │
└───────────────────────────────────────────────────────────────┘

Inference System

The inference system employs a hybrid approach:

Pattern Recognition:
- Benign patterns: Successful operations, routine activities
- Suspicious patterns: Failed login attempts, unusual access patterns
- Malicious patterns: Attack signatures, exploitation attempts
Web Attack Detection:
- XSS detection: Identifies script tags, alert() functions, and JavaScript injection
- SQL Injection: Detects SQL commands and syntax in unexpected contexts
- Command Injection: Identifies shell commands and suspicious character sequences
- Directory Traversal: Detects path manipulation attempts (../../../etc/passwd)
- CSRF: Identifies cross-site request forgery patterns

┌────────────────────────────────────────────────────────────────┐
│                 Inference System Workflow                      │
└────────────────────────────────┬───────────────────────────────┘
                                 │
                                 ▼
┌────────────────────────────────────────────────────────────────┐
│                                                                │
│                        Input Log Entry                         │
│                                                                │
└────────────────────────────────┬───────────────────────────────┘
                                 │
                                 ▼
┌────────────────────────────────────────────────────────────────┐
│                       Preprocessing                            │
│                                                                │
│  ┌─────────────-────┐    ┌─────────────────┐                   │
│  │ Truncation       │    │ Normalization   │                   │
│  │ (if > 1000 chars)│    │                 │                   │
│  └───────────────-──┘    └─────────────────┘                   │
│                                                                │
└────────────────────────────────┬───────────────────────────────┘
                                 │
                                 ▼
┌────────────────────────────────────────────────────────────────┐
│                      Pattern Matching                          │
│                                                                │
│  ┌─────────────────┐    ┌─────────────────┐    ┌──────────────┐│
│  │ Benign          │    │ Suspicious      │    │ Malicious    ││
│  │ Patterns        │    │ Patterns        │    │ Patterns     ││
│  └─────────────────┘    └─────────────────┘    └──────────────┘│
│                                                                │
└──────────────┬─────────────────────────────────┬───────────────┘
               │                                 │
               │ Pattern Found                   │ No Strong Match
               ▼                                 ▼
┌─────────────────────────────┐    ┌────────────────────────────┐
│                             │    │                            │
│    Return Classification    │    │     BERT Model Analysis    │
│    Based on Pattern         │    │                            │
│                             │    └──────────────┬─────────────┘
└─────────────────────────────┘                   │
                                                  ▼
                                    ┌────────────────────────────┐
                                    │                            │
                                    │  Return Classification     │
                                    │  Based on Model Prediction │
                                    │                            │
                                    └────────────────────────────┘

Data Flow Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│  Log Sources    │────►│  Preprocessing  │────►│  Feature        │
│  (CSV/Generated)│     │  Pipeline       │     │  Extraction     │
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └────────┬────────┘
                                                         │
                                                         ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│  Prediction     │◄────│  Inference      │◄────│  Model Training │
│  Results        │     │  Engine         │     │  & Evaluation   │
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘
        │
        ▼
┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │
│  API Response   │────►│  UI             │
│  Generation     │     │  Visualization  │
│                 │     │                 │
└─────────────────┘     └─────────────────┘

This hybrid approach ensures high accuracy for known threats while maintaining the ability to detect novel attacks.

Installation

Prerequisites

Python 3.12+
PyTorch 2.0+
Docker (optional, for containerized deployment)

Local Installation

Clone the repository:

git clone https://github.com/arifazim/CyberGuardAI.git
cd CyberGuardAI

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install --upgrade pip setuptools>=68.0.0
pip install -r requirements.txt
pip install -e .
```
Note: For Python 3.12 compatibility, ensure you're using setuptools>=68.0.0 and PyYAML>=6.0.1.

Usage

Data Preprocessing

Before training the model, you need to preprocess your log data:

python scripts/preprocess_data.py

This script will:

Create sample data if no input data exists
Clean and normalize log text
Split data into training and validation sets
Convert labels to numerical format
Save processed data to the configured output path

The preprocessing steps include:

Loading raw log data from CSV or generating sample data if none exists
Cleaning text by removing special characters and normalizing whitespace
Encoding labels (benign=0, suspicious=1, malicious=2)
Balancing the dataset to ensure equal representation of classes
Saving the processed data for training

Training the Model

To train the model on your preprocessed data:

python src/train.py

The training process will:

Load preprocessed data from the configured path
Initialize the BERT tokenizer and model
Tokenize logs using BERT tokenizer with padding and truncation
Split data into training and validation sets
Train the model for the configured number of epochs (default: 5)
Save the trained model to the configured path (data_processing/models/CybergGuard_model)

Training parameters are configurable in config/config.yaml:

Batch size: Controls memory usage and training speed
Learning rate: Affects how quickly the model learns
Number of epochs: Controls how many times the model sees the entire dataset
Model name: The base BERT model to use (default: bert-base-uncased)

The model is trained using AdamW optimizer with a learning rate scheduler to improve convergence.

Running the API

To start the API server locally:

python -m src.api

The API will be available at http://localhost:8000.

The API provides endpoints for:

Predicting the classification of log entries
Health check to verify the API is running

The API includes robust error handling for:

Invalid JSON format
Missing 'logs' field in request
Empty log lists
Internal server errors

Running the UI Demo

CyberGuardAI includes a web-based UI for demonstrating the log analysis capabilities:

Install UI dependencies:
```
pip install -r ui/requirements.txt
```
Start the UI server (make sure the API is running first):
```
python ui/app.py
```
Open your browser and navigate to http://localhost:5001

The UI provides:

A clean interface for entering log entries
Side-by-side display of logs and their predictions
Sample logs for quick demonstration, including:
- Benign logs (successful logins, updates, backups)
- Suspicious logs (failed login attempts, unusual access patterns)
- Malicious logs (XSS attacks, SQL injection, command injection)
Statistics dashboard showing counts by classification category
API status indicator to show connection status
Responsive design for both desktop and mobile devices

The UI is implemented as a Flask application that serves as a proxy to the CyberGuardAI API, helping to avoid CORS issues.

Log Analysis

Docker Deployment

Build the Docker image:
```
docker build -t cyberguardai:latest .
```

Run the container:

docker run -p 8000:8000 cyberguardai:latest

The API will be available at http://localhost:8000.

The Dockerfile:

Uses a Python base image
Installs all dependencies
Installs the project as a package
Exposes port 8000
Sets the entry point to run the API

API Reference

POST /predict

Classifies log entries as benign, suspicious, or malicious.

Request:

{
  "logs": ["user login successful", "failed login attempt from 192.168.1.100"]
}

Response:

{
  "predictions": ["benign", "suspicious"]
}

Error Responses:

Missing logs field:

{
  "detail": "Missing 'logs' field in request"
}

Empty logs list:

{
  "detail": "Log list cannot be empty"
}

Invalid JSON:
```
{
  "detail": "Invalid JSON format"
}
```

Log Analysis Methodology

CyberGuardAI analyzes logs through a multi-step process:

Pattern Recognition:
- Benign patterns: Successful operations, routine activities
- Suspicious patterns: Failed login attempts, unusual access patterns
- Malicious patterns: Attack signatures, exploitation attempts
Web Attack Detection:
- XSS detection: Identifies script tags, alert() functions, and JavaScript injection
- SQL Injection: Detects SQL commands and syntax in unexpected contexts
- Command Injection: Identifies shell commands and suspicious character sequences
- Directory Traversal: Detects path manipulation attempts (../../../etc/passwd)
- CSRF: Identifies cross-site request forgery patterns
Special Case Handling:
- Long logs are truncated to the last 1000 characters
- Suspicious pattern checking occurs before malicious pattern checking
- Case-insensitive matching is used for attack signatures
Machine Learning Analysis:
- BERT model analyzes the semantic meaning of log entries
- Contextual understanding helps identify novel or complex threats
- Confidence scores determine final classification

This hybrid approach ensures high accuracy for known threats while maintaining the ability to detect novel attacks.

Configuration

The system is configured using a YAML file located at config/config.yaml. Key configuration options include:

model:
  name: "bert-base-uncased"
  max_length: 512
  num_labels: 3  # Benign, Suspicious, Malicious
training:
  batch_size: 16
  epochs: 5
  learning_rate: 2e-5
data:
  input_path: "data_processing/raw/logs.csv"
  processed_path: "data_processing/processed/processed_logs.csv"
  model_path: "data_processing/models/CybergGuard_model"
api:
  host: "0.0.0.0"
  port: 8000

Project Structure

CyberGuardAI/
├── config/
│   └── config.yaml         # Configuration file
├── data_processing/
│   ├── models/             # Trained model files
│   ├── processed/          # Processed data
│   └── raw/                # Raw input data
├── scripts/
│   ├── preprocess_data.py  # Data preprocessing script
│   └── generate_data.py    # Sample data generation
├── src/
│   ├── __init__.py
│   ├── api.py              # FastAPI implementation
│   ├── data_processing.py  # Data processing utilities
│   ├── inference.py        # Inference logic
│   ├── model.py            # Model definition
│   └── train.py            # Training script
├── tests/
│   ├── test_data_processing.py
│   └── test_inference.py
├── ui/
│   ├── static/             # UI static assets
│   ├── templates/          # UI HTML templates
│   ├── app.py              # UI server
│   └── requirements.txt    # UI dependencies
├── Dockerfile              # Docker configuration
├── requirements.txt        # Python dependencies
├── setup.py                # Package setup
└── README.md               # This file

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CyberGuardAI

Table of Contents

Background

Features

Architecture

System Architecture Diagram

Model Architecture

Inference System

Data Flow Architecture

Installation

Prerequisites

Local Installation

Usage

Data Preprocessing

Training the Model

Running the API

Running the UI Demo

Docker Deployment

API Reference

POST /predict

Log Analysis Methodology

Configuration

Project Structure

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/raw		data/raw
data_processing		data_processing
deployment/kubernetes		deployment/kubernetes
images		images
scripts		scripts
src		src
tests		tests
ui		ui
.gitignore		.gitignore
Dockerfile		Dockerfile
ExectionSteps.md		ExectionSteps.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

arifazim/CyberGuard_AI

Folders and files

Latest commit

History

Repository files navigation

CyberGuardAI

Table of Contents

Background

Features

Architecture

System Architecture Diagram

Model Architecture

Inference System

Data Flow Architecture

Installation

Prerequisites

Local Installation

Usage

Data Preprocessing

Training the Model

Running the API

Running the UI Demo

Docker Deployment

API Reference

POST /predict

Log Analysis Methodology

Configuration

Project Structure

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages