Skip to content

an intelligent cybersecurity log analysis system that uses transformer-based machine learning models to detect suspicious and malicious activities in system logs

Notifications You must be signed in to change notification settings

arifazim/CyberGuard_AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CyberGuardAI

CyberGuardAI Logo License Python Docker

CyberGuardAI is an intelligent cybersecurity log analysis system that uses transformer-based machine learning models to detect suspicious and malicious activities in system logs. The system combines the power of BERT models with rule-based pattern matching to provide highly accurate classifications of security events.

Table of Contents

Background

In today's cybersecurity landscape, organizations face an overwhelming volume of log data from various systems. Manual analysis of these logs is time-consuming and error-prone. CyberGuardAI addresses this challenge by providing an automated system that can:

  1. Process large volumes of log data efficiently
  2. Classify logs as benign, suspicious, or malicious
  3. Provide a simple API for integration with existing security systems
  4. Deploy easily in containerized environments

The system uses a hybrid approach that combines the flexibility of machine learning with the reliability of rule-based pattern matching, ensuring high accuracy while minimizing false positives.

This is a comprehensive AI project for cybersecurity incident identification using deep learning foundation models. The solution leverages a transformer-based model fine-tuned for log analysis, achieving high accuracy and low false positives.

Features

  • Intelligent Log Classification: Categorizes logs as benign, suspicious, or malicious
  • Hybrid Detection System: Combines machine learning with rule-based pattern matching
  • REST API: Simple HTTP API for easy integration
  • Interactive UI Demo: Web interface for visualizing log analysis results
  • Docker Support: Ready for containerized deployment
  • Customizable Rules: Easily extend the pattern matching rules for specific use cases
  • Robust Error Handling: User-friendly error messages for API clients
  • Scalable Architecture: Designed for processing large volumes of log data

Architecture

CyberGuardAI follows a modular architecture with the following components:

  1. Data Processing Module: Handles log preprocessing, tokenization, and feature extraction
  2. Model Module: Implements the BERT-based neural network for log classification
  3. Inference Module: Combines model predictions with rule-based pattern matching
  4. API Module: Provides a REST API for interacting with the system

The system uses a hybrid approach for classification:

  • Rule-Based Component: Fast pattern matching for known attack patterns
  • ML Component: BERT-based deep learning for novel and complex patterns

System Architecture Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          CyberGuardAI System                            β”‚
└───────────────────────────────┬────────────────────────────────────----β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚             β”‚     β”‚             β”‚     β”‚             β”‚     β”‚        β”‚ β”‚
β”‚  β”‚    Data     β”‚     β”‚    Model    β”‚     β”‚  Inference  β”‚     β”‚  API   β”‚ β”‚
β”‚  β”‚  Processing β”œβ”€β”€β”€β”€β–Ίβ”‚   Training  β”œβ”€β”€β”€β”€β–Ίβ”‚   Engine    β”œβ”€β”€β”€β”€β–Ίβ”‚ Server β”‚ β”‚
β”‚  β”‚   Module    β”‚     β”‚   Module    β”‚     β”‚             β”‚     β”‚        β”‚ β”‚
β”‚  β”‚             β”‚     β”‚             β”‚     β”‚             β”‚     β”‚        β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β”‚
β”‚        β”‚                                        β”‚                 β”‚     β”‚
β”‚        β–Ό                                        β–Ό                 β–Ό     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚ Raw Logs &  β”‚                        β”‚  Prediction  β”‚    β”‚  REST    β”‚β”‚
β”‚  β”‚ Sample Data β”‚                        β”‚   Logic      β”‚    β”‚  API     β”‚β”‚
β”‚  β”‚ Generation  β”‚                        β”‚              β”‚    β”‚ Endpointsβ”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                                                β”‚                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                 β”‚
                                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                             UI Layer                                    β”‚
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                 β”‚     β”‚                 β”‚     β”‚                 β”‚    β”‚
β”‚  β”‚  Log Input &    β”‚     β”‚  Prediction     β”‚     β”‚  Statistics     β”‚    β”‚
β”‚  β”‚  Sample Display β”‚     β”‚  Visualization  β”‚     β”‚  Dashboard      β”‚    β”‚
β”‚  β”‚                 β”‚     β”‚                 β”‚     β”‚                 β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Architecture

CyberGuardAI uses a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model for log classification. The model architecture includes:

  1. BERT Base Layer: Pre-trained BERT model that understands contextual relationships in text
  2. Classification Head: Custom layers added on top of BERT for the specific task of log classification
  3. Output Layer: Final layer with softmax activation to produce classification probabilities
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CyberGuardAI Model                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                   BERT Base Model                     β”‚    β”‚
β”‚  β”‚                                                       β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚
β”‚  β”‚  β”‚ Transformer β”‚  β”‚ Transformer β”‚  β”‚ Transformer β”‚    β”‚    β”‚
β”‚  β”‚  β”‚   Layer 1   │─►│   Layer 2   │─►│   Layer N   β”‚    β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚
β”‚  β”‚                                                       β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                              β”‚                                β”‚
β”‚                              β–Ό                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                 Classification Head                   β”‚    β”‚
β”‚  β”‚                                                       β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚
β”‚  β”‚  β”‚   Linear    β”‚  β”‚  Dropout    β”‚  β”‚   Linear    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚   Layer     │─►│   Layer     │─►│   Layer     β”‚    β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚
β”‚  β”‚                                                       β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                              β”‚                                β”‚
β”‚                              β–Ό                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                    Output Layer                       β”‚    β”‚
β”‚  β”‚                                                       β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚  β”‚  β”‚              Softmax Activation                 β”‚  β”‚    β”‚
β”‚  β”‚  β”‚                                                 β”‚  β”‚    β”‚
β”‚  β”‚  β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€-┐    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚    β”‚
β”‚  β”‚  β”‚     β”‚ Benign  β”‚    β”‚Suspiciousβ”‚    β”‚Maliciousβ”‚  β”‚  β”‚    β”‚
β”‚  β”‚  β”‚     β”‚  Class  β”‚    β”‚  Class   β”‚    β”‚  Class  β”‚  β”‚  β”‚    β”‚
β”‚  β”‚  β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    └─────────-β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚    β”‚
β”‚  β”‚  β”‚                                                 β”‚  β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β”‚                                                       β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Inference System

The inference system employs a hybrid approach:

  1. Pattern Recognition:

    • Benign patterns: Successful operations, routine activities
    • Suspicious patterns: Failed login attempts, unusual access patterns
    • Malicious patterns: Attack signatures, exploitation attempts
  2. Web Attack Detection:

    • XSS detection: Identifies script tags, alert() functions, and JavaScript injection
    • SQL Injection: Detects SQL commands and syntax in unexpected contexts
    • Command Injection: Identifies shell commands and suspicious character sequences
    • Directory Traversal: Detects path manipulation attempts (../../../etc/passwd)
    • CSRF: Identifies cross-site request forgery patterns
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Inference System Workflow                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                β”‚
β”‚                        Input Log Entry                         β”‚
β”‚                                                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       Preprocessing                            β”‚
β”‚                                                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€-────┐    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
β”‚  β”‚ Truncation       β”‚    β”‚ Normalization   β”‚                   β”‚
β”‚  β”‚ (if > 1000 chars)β”‚    β”‚                 β”‚                   β”‚
β”‚  └───────────────-β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚                                                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Pattern Matching                          β”‚
β”‚                                                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚ Benign          β”‚    β”‚ Suspicious      β”‚    β”‚ Malicious    β”‚β”‚
β”‚  β”‚ Patterns        β”‚    β”‚ Patterns        β”‚    β”‚ Patterns     β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                                                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚                                 β”‚
               β”‚ Pattern Found                   β”‚ No Strong Match
               β–Ό                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                             β”‚    β”‚                            β”‚
β”‚    Return Classification    β”‚    β”‚     BERT Model Analysis    β”‚
β”‚    Based on Pattern         β”‚    β”‚                            β”‚
β”‚                             β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
                                                  β–Ό
                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β”‚                            β”‚
                                    β”‚  Return Classification     β”‚
                                    β”‚  Based on Model Prediction β”‚
                                    β”‚                            β”‚
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 β”‚     β”‚                 β”‚     β”‚                 β”‚
β”‚  Log Sources    │────►│  Preprocessing  │────►│  Feature        β”‚
β”‚  (CSV/Generated)β”‚     β”‚  Pipeline       β”‚     β”‚  Extraction     β”‚
β”‚                 β”‚     β”‚                 β”‚     β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                         β”‚
                                                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 β”‚     β”‚                 β”‚     β”‚                 β”‚
β”‚  Prediction     │◄────│  Inference      │◄────│  Model Training β”‚
β”‚  Results        β”‚     β”‚  Engine         β”‚     β”‚  & Evaluation   β”‚
β”‚                 β”‚     β”‚                 β”‚     β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 β”‚     β”‚                 β”‚
β”‚  API Response   │────►│  UI             β”‚
β”‚  Generation     β”‚     β”‚  Visualization  β”‚
β”‚                 β”‚     β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This hybrid approach ensures high accuracy for known threats while maintaining the ability to detect novel attacks.

Installation

Prerequisites

  • Python 3.12+
  • PyTorch 2.0+
  • Docker (optional, for containerized deployment)

Local Installation

  1. Clone the repository:

    git clone https://github.com/arifazim/CyberGuardAI.git
    cd CyberGuardAI
  2. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install --upgrade pip setuptools>=68.0.0
    pip install -r requirements.txt
    pip install -e .

    Note: For Python 3.12 compatibility, ensure you're using setuptools>=68.0.0 and PyYAML>=6.0.1.

Usage

Data Preprocessing

Before training the model, you need to preprocess your log data:

python scripts/preprocess_data.py

This script will:

  • Create sample data if no input data exists
  • Clean and normalize log text
  • Split data into training and validation sets
  • Convert labels to numerical format
  • Save processed data to the configured output path

The preprocessing steps include:

  1. Loading raw log data from CSV or generating sample data if none exists
  2. Cleaning text by removing special characters and normalizing whitespace
  3. Encoding labels (benign=0, suspicious=1, malicious=2)
  4. Balancing the dataset to ensure equal representation of classes
  5. Saving the processed data for training

Training the Model

To train the model on your preprocessed data:

python src/train.py

The training process will:

  1. Load preprocessed data from the configured path
  2. Initialize the BERT tokenizer and model
  3. Tokenize logs using BERT tokenizer with padding and truncation
  4. Split data into training and validation sets
  5. Train the model for the configured number of epochs (default: 5)
  6. Save the trained model to the configured path (data_processing/models/CybergGuard_model)

Training parameters are configurable in config/config.yaml:

  • Batch size: Controls memory usage and training speed
  • Learning rate: Affects how quickly the model learns
  • Number of epochs: Controls how many times the model sees the entire dataset
  • Model name: The base BERT model to use (default: bert-base-uncased)

The model is trained using AdamW optimizer with a learning rate scheduler to improve convergence.

Running the API

To start the API server locally:

python -m src.api

The API will be available at http://localhost:8000.

The API provides endpoints for:

  • Predicting the classification of log entries
  • Health check to verify the API is running

The API includes robust error handling for:

  • Invalid JSON format
  • Missing 'logs' field in request
  • Empty log lists
  • Internal server errors

Running the UI Demo

CyberGuardAI includes a web-based UI for demonstrating the log analysis capabilities:

  1. Install UI dependencies:

    pip install -r ui/requirements.txt
  2. Start the UI server (make sure the API is running first):

    python ui/app.py
  3. Open your browser and navigate to http://localhost:5001

The UI provides:

  • A clean interface for entering log entries
  • Side-by-side display of logs and their predictions
  • Sample logs for quick demonstration, including:
    • Benign logs (successful logins, updates, backups)
    • Suspicious logs (failed login attempts, unusual access patterns)
    • Malicious logs (XSS attacks, SQL injection, command injection)
  • Statistics dashboard showing counts by classification category
  • API status indicator to show connection status
  • Responsive design for both desktop and mobile devices

The UI is implemented as a Flask application that serves as a proxy to the CyberGuardAI API, helping to avoid CORS issues.

UI Demo Screenshot Log Analysis UI Log analysis UI Log analysis

Docker Deployment

  1. Build the Docker image:

    docker build -t cyberguardai:latest .
  2. Run the container:

    docker run -p 8000:8000 cyberguardai:latest

The API will be available at http://localhost:8000.

The Dockerfile:

  • Uses a Python base image
  • Installs all dependencies
  • Installs the project as a package
  • Exposes port 8000
  • Sets the entry point to run the API

API Reference

POST /predict

Classifies log entries as benign, suspicious, or malicious.

Request:

{
  "logs": ["user login successful", "failed login attempt from 192.168.1.100"]
}

Response:

{
  "predictions": ["benign", "suspicious"]
}

Error Responses:

  • Missing logs field:

    {
      "detail": "Missing 'logs' field in request"
    }
  • Empty logs list:

    {
      "detail": "Log list cannot be empty"
    }
  • Invalid JSON:

    {
      "detail": "Invalid JSON format"
    }

Log Analysis Methodology

CyberGuardAI analyzes logs through a multi-step process:

  1. Pattern Recognition:

    • Benign patterns: Successful operations, routine activities
    • Suspicious patterns: Failed login attempts, unusual access patterns
    • Malicious patterns: Attack signatures, exploitation attempts
  2. Web Attack Detection:

    • XSS detection: Identifies script tags, alert() functions, and JavaScript injection
    • SQL Injection: Detects SQL commands and syntax in unexpected contexts
    • Command Injection: Identifies shell commands and suspicious character sequences
    • Directory Traversal: Detects path manipulation attempts (../../../etc/passwd)
    • CSRF: Identifies cross-site request forgery patterns
  3. Special Case Handling:

    • Long logs are truncated to the last 1000 characters
    • Suspicious pattern checking occurs before malicious pattern checking
    • Case-insensitive matching is used for attack signatures
  4. Machine Learning Analysis:

    • BERT model analyzes the semantic meaning of log entries
    • Contextual understanding helps identify novel or complex threats
    • Confidence scores determine final classification

This hybrid approach ensures high accuracy for known threats while maintaining the ability to detect novel attacks.

Configuration

The system is configured using a YAML file located at config/config.yaml. Key configuration options include:

model:
  name: "bert-base-uncased"
  max_length: 512
  num_labels: 3  # Benign, Suspicious, Malicious
training:
  batch_size: 16
  epochs: 5
  learning_rate: 2e-5
data:
  input_path: "data_processing/raw/logs.csv"
  processed_path: "data_processing/processed/processed_logs.csv"
  model_path: "data_processing/models/CybergGuard_model"
api:
  host: "0.0.0.0"
  port: 8000

Project Structure

CyberGuardAI/
β”œβ”€β”€ config/
β”‚   └── config.yaml         # Configuration file
β”œβ”€β”€ data_processing/
β”‚   β”œβ”€β”€ models/             # Trained model files
β”‚   β”œβ”€β”€ processed/          # Processed data
β”‚   └── raw/                # Raw input data
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ preprocess_data.py  # Data preprocessing script
β”‚   └── generate_data.py    # Sample data generation
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ api.py              # FastAPI implementation
β”‚   β”œβ”€β”€ data_processing.py  # Data processing utilities
β”‚   β”œβ”€β”€ inference.py        # Inference logic
β”‚   β”œβ”€β”€ model.py            # Model definition
β”‚   └── train.py            # Training script
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_data_processing.py
β”‚   └── test_inference.py
β”œβ”€β”€ ui/
β”‚   β”œβ”€β”€ static/             # UI static assets
β”‚   β”œβ”€β”€ templates/          # UI HTML templates
β”‚   β”œβ”€β”€ app.py              # UI server
β”‚   └── requirements.txt    # UI dependencies
β”œβ”€β”€ Dockerfile              # Docker configuration
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ setup.py                # Package setup
└── README.md               # This file

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

an intelligent cybersecurity log analysis system that uses transformer-based machine learning models to detect suspicious and malicious activities in system logs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published