CyberGuardAI is an intelligent cybersecurity log analysis system that uses transformer-based machine learning models to detect suspicious and malicious activities in system logs. The system combines the power of BERT models with rule-based pattern matching to provide highly accurate classifications of security events.
- Background
- Features
- Architecture
- Installation
- Usage
- API Reference
- Configuration
- Project Structure
- Contributing
- License
In today's cybersecurity landscape, organizations face an overwhelming volume of log data from various systems. Manual analysis of these logs is time-consuming and error-prone. CyberGuardAI addresses this challenge by providing an automated system that can:
- Process large volumes of log data efficiently
- Classify logs as benign, suspicious, or malicious
- Provide a simple API for integration with existing security systems
- Deploy easily in containerized environments
The system uses a hybrid approach that combines the flexibility of machine learning with the reliability of rule-based pattern matching, ensuring high accuracy while minimizing false positives.
This is a comprehensive AI project for cybersecurity incident identification using deep learning foundation models. The solution leverages a transformer-based model fine-tuned for log analysis, achieving high accuracy and low false positives.
- Intelligent Log Classification: Categorizes logs as benign, suspicious, or malicious
- Hybrid Detection System: Combines machine learning with rule-based pattern matching
- REST API: Simple HTTP API for easy integration
- Interactive UI Demo: Web interface for visualizing log analysis results
- Docker Support: Ready for containerized deployment
- Customizable Rules: Easily extend the pattern matching rules for specific use cases
- Robust Error Handling: User-friendly error messages for API clients
- Scalable Architecture: Designed for processing large volumes of log data
CyberGuardAI follows a modular architecture with the following components:
- Data Processing Module: Handles log preprocessing, tokenization, and feature extraction
- Model Module: Implements the BERT-based neural network for log classification
- Inference Module: Combines model predictions with rule-based pattern matching
- API Module: Provides a REST API for interacting with the system
The system uses a hybrid approach for classification:
- Rule-Based Component: Fast pattern matching for known attack patterns
- ML Component: BERT-based deep learning for novel and complex patterns
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CyberGuardAI System β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ----ββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββ β
β β β β β β β β β β
β β Data β β Model β β Inference β β API β β
β β Processing ββββββΊβ Training ββββββΊβ Engine ββββββΊβ Server β β
β β Module β β Module β β β β β β
β β β β β β β β β β
β βββββββ¬ββββββββ βββββββββββββββ ββββββββ¬βββββββ ββββββ¬ββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββ ββββββββββββββββ βββββββββββββ
β β Raw Logs & β β Prediction β β REST ββ
β β Sample Data β β Logic β β API ββ
β β Generation β β β β Endpointsββ
β βββββββββββββββ ββββββββ¬ββββββββ βββββββββββββ
β β β
ββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β UI Layer β
β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β β β β β β β
β β Log Input & β β Prediction β β Statistics β β
β β Sample Display β β Visualization β β Dashboard β β
β β β β β β β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CyberGuardAI uses a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model for log classification. The model architecture includes:
- BERT Base Layer: Pre-trained BERT model that understands contextual relationships in text
- Classification Head: Custom layers added on top of BERT for the specific task of log classification
- Output Layer: Final layer with softmax activation to produce classification probabilities
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CyberGuardAI Model β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β BERT Base Model β β
β β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β Transformer β β Transformer β β Transformer β β β
β β β Layer 1 βββΊβ Layer 2 βββΊβ Layer N β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β β
β βββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Classification Head β β
β β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β Linear β β Dropout β β Linear β β β
β β β Layer βββΊβ Layer βββΊβ Layer β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β β
β βββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Output Layer β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Softmax Activation β β β
β β β β β β
β β β βββββββββββ ββββββββββ-β βββββββββββ β β β
β β β β Benign β βSuspiciousβ βMaliciousβ β β β
β β β β Class β β Class β β Class β β β β
β β β βββββββββββ ββββββββββ-β βββββββββββ β β β
β β β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The inference system employs a hybrid approach:
-
Pattern Recognition:
- Benign patterns: Successful operations, routine activities
- Suspicious patterns: Failed login attempts, unusual access patterns
- Malicious patterns: Attack signatures, exploitation attempts
-
Web Attack Detection:
- XSS detection: Identifies script tags, alert() functions, and JavaScript injection
- SQL Injection: Detects SQL commands and syntax in unexpected contexts
- Command Injection: Identifies shell commands and suspicious character sequences
- Directory Traversal: Detects path manipulation attempts (../../../etc/passwd)
- CSRF: Identifies cross-site request forgery patterns
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Inference System Workflow β
ββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Input Log Entry β
β β
ββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Preprocessing β
β β
β ββββββββββββββ-βββββ βββββββββββββββββββ β
β β Truncation β β Normalization β β
β β (if > 1000 chars)β β β β
β ββββββββββββββββ-βββ βββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Pattern Matching β
β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ
β β Benign β β Suspicious β β Malicious ββ
β β Patterns β β Patterns β β Patterns ββ
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ
β β
ββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ¬ββββββββββββββββ
β β
β Pattern Found β No Strong Match
βΌ βΌ
βββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
β β β β
β Return Classification β β BERT Model Analysis β
β Based on Pattern β β β
β β ββββββββββββββββ¬ββββββββββββββ
βββββββββββββββββββββββββββββββ β
βΌ
ββββββββββββββββββββββββββββββ
β β
β Return Classification β
β Based on Model Prediction β
β β
ββββββββββββββββββββββββββββββ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β β β β
β Log Sources ββββββΊβ Preprocessing ββββββΊβ Feature β
β (CSV/Generated)β β Pipeline β β Extraction β
β β β β β β
βββββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β β β β
β Prediction βββββββ Inference βββββββ Model Training β
β Results β β Engine β β & Evaluation β
β β β β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ
β β β β
β API Response ββββββΊβ UI β
β Generation β β Visualization β
β β β β
βββββββββββββββββββ βββββββββββββββββββ
This hybrid approach ensures high accuracy for known threats while maintaining the ability to detect novel attacks.
- Python 3.12+
- PyTorch 2.0+
- Docker (optional, for containerized deployment)
-
Clone the repository:
git clone https://github.com/arifazim/CyberGuardAI.git cd CyberGuardAI
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install --upgrade pip setuptools>=68.0.0 pip install -r requirements.txt pip install -e .
Note: For Python 3.12 compatibility, ensure you're using setuptools>=68.0.0 and PyYAML>=6.0.1.
Before training the model, you need to preprocess your log data:
python scripts/preprocess_data.py
This script will:
- Create sample data if no input data exists
- Clean and normalize log text
- Split data into training and validation sets
- Convert labels to numerical format
- Save processed data to the configured output path
The preprocessing steps include:
- Loading raw log data from CSV or generating sample data if none exists
- Cleaning text by removing special characters and normalizing whitespace
- Encoding labels (benign=0, suspicious=1, malicious=2)
- Balancing the dataset to ensure equal representation of classes
- Saving the processed data for training
To train the model on your preprocessed data:
python src/train.py
The training process will:
- Load preprocessed data from the configured path
- Initialize the BERT tokenizer and model
- Tokenize logs using BERT tokenizer with padding and truncation
- Split data into training and validation sets
- Train the model for the configured number of epochs (default: 5)
- Save the trained model to the configured path (data_processing/models/CybergGuard_model)
Training parameters are configurable in config/config.yaml
:
- Batch size: Controls memory usage and training speed
- Learning rate: Affects how quickly the model learns
- Number of epochs: Controls how many times the model sees the entire dataset
- Model name: The base BERT model to use (default: bert-base-uncased)
The model is trained using AdamW optimizer with a learning rate scheduler to improve convergence.
To start the API server locally:
python -m src.api
The API will be available at http://localhost:8000.
The API provides endpoints for:
- Predicting the classification of log entries
- Health check to verify the API is running
The API includes robust error handling for:
- Invalid JSON format
- Missing 'logs' field in request
- Empty log lists
- Internal server errors
CyberGuardAI includes a web-based UI for demonstrating the log analysis capabilities:
-
Install UI dependencies:
pip install -r ui/requirements.txt
-
Start the UI server (make sure the API is running first):
python ui/app.py
-
Open your browser and navigate to http://localhost:5001
The UI provides:
- A clean interface for entering log entries
- Side-by-side display of logs and their predictions
- Sample logs for quick demonstration, including:
- Benign logs (successful logins, updates, backups)
- Suspicious logs (failed login attempts, unusual access patterns)
- Malicious logs (XSS attacks, SQL injection, command injection)
- Statistics dashboard showing counts by classification category
- API status indicator to show connection status
- Responsive design for both desktop and mobile devices
The UI is implemented as a Flask application that serves as a proxy to the CyberGuardAI API, helping to avoid CORS issues.
-
Build the Docker image:
docker build -t cyberguardai:latest .
-
Run the container:
docker run -p 8000:8000 cyberguardai:latest
The API will be available at http://localhost:8000.
The Dockerfile:
- Uses a Python base image
- Installs all dependencies
- Installs the project as a package
- Exposes port 8000
- Sets the entry point to run the API
Classifies log entries as benign, suspicious, or malicious.
Request:
{
"logs": ["user login successful", "failed login attempt from 192.168.1.100"]
}
Response:
{
"predictions": ["benign", "suspicious"]
}
Error Responses:
-
Missing logs field:
{ "detail": "Missing 'logs' field in request" }
-
Empty logs list:
{ "detail": "Log list cannot be empty" }
-
Invalid JSON:
{ "detail": "Invalid JSON format" }
CyberGuardAI analyzes logs through a multi-step process:
-
Pattern Recognition:
- Benign patterns: Successful operations, routine activities
- Suspicious patterns: Failed login attempts, unusual access patterns
- Malicious patterns: Attack signatures, exploitation attempts
-
Web Attack Detection:
- XSS detection: Identifies script tags, alert() functions, and JavaScript injection
- SQL Injection: Detects SQL commands and syntax in unexpected contexts
- Command Injection: Identifies shell commands and suspicious character sequences
- Directory Traversal: Detects path manipulation attempts (../../../etc/passwd)
- CSRF: Identifies cross-site request forgery patterns
-
Special Case Handling:
- Long logs are truncated to the last 1000 characters
- Suspicious pattern checking occurs before malicious pattern checking
- Case-insensitive matching is used for attack signatures
-
Machine Learning Analysis:
- BERT model analyzes the semantic meaning of log entries
- Contextual understanding helps identify novel or complex threats
- Confidence scores determine final classification
This hybrid approach ensures high accuracy for known threats while maintaining the ability to detect novel attacks.
The system is configured using a YAML file located at config/config.yaml
. Key configuration options include:
model:
name: "bert-base-uncased"
max_length: 512
num_labels: 3 # Benign, Suspicious, Malicious
training:
batch_size: 16
epochs: 5
learning_rate: 2e-5
data:
input_path: "data_processing/raw/logs.csv"
processed_path: "data_processing/processed/processed_logs.csv"
model_path: "data_processing/models/CybergGuard_model"
api:
host: "0.0.0.0"
port: 8000
CyberGuardAI/
βββ config/
β βββ config.yaml # Configuration file
βββ data_processing/
β βββ models/ # Trained model files
β βββ processed/ # Processed data
β βββ raw/ # Raw input data
βββ scripts/
β βββ preprocess_data.py # Data preprocessing script
β βββ generate_data.py # Sample data generation
βββ src/
β βββ __init__.py
β βββ api.py # FastAPI implementation
β βββ data_processing.py # Data processing utilities
β βββ inference.py # Inference logic
β βββ model.py # Model definition
β βββ train.py # Training script
βββ tests/
β βββ test_data_processing.py
β βββ test_inference.py
βββ ui/
β βββ static/ # UI static assets
β βββ templates/ # UI HTML templates
β βββ app.py # UI server
β βββ requirements.txt # UI dependencies
βββ Dockerfile # Docker configuration
βββ requirements.txt # Python dependencies
βββ setup.py # Package setup
βββ README.md # This file
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.