Skip to content

A data mining project for detecting anomalies in SNCB's AR41 diesel train cooling systems to enhance operational efficiency and ensure safety.

Notifications You must be signed in to change notification settings

isakovaad/Data-Mining-SNCB

Repository files navigation

SNCB AR41 Anomaly Detection

A comprehensive data mining project for detecting anomalies in SNCB's AR41 diesel train cooling systems to enhance operational efficiency and ensure safety.

Project Overview

This project analyzes sensor data from SNCB's AR41 diesel trains to identify and predict anomalies in their cooling systems. The goal is to distinguish between normal operational patterns, sensor noise, and critical system failures that could indicate potential safety issues.

Team Members

  • Muhammad Qasim Khan - 000583564
  • Hareem Raza - 000583268
  • Dilbar Isakova - 000583178
  • Aryan Gupta - 000585712

Course: INFOH423 - Data Mining

Project Goals

  • Identify Anomalies: Detect sensor errors and differentiate them from noise
  • Predict Issues: Forecast potential system failures before they occur
  • Categorize Problems: Distinguish between:
    • Noise-related anomalies
    • Single-engine deviations (Potential Risk)
    • Dual-engine issues (Potential Hazard)

Project Structure

├── altitude.ipynb              # Elevation data processing
├── convert_to_parquet.py       # Data format conversion utility
├── elevation.ipynb             # Elevation analysis
├── group_by.ipynb             # Data grouping operations
├── step1-EDA.ipynb            # Exploratory Data Analysis
├── step2-data-preparation.ipynb # Data cleaning and preparation
├── train-speed.ipynb          # Speed calculation and analysis
└── README.md                  # Project documentation

Data Sources & Integration

The project integrates multiple data sources to provide comprehensive analysis:

1. Train Sensor Data

  • PC1 & PC2 Engine Data: RPM, Oil Pressure, Air Temperature, Water Temperature, Oil Temperature
  • GPS Data: Latitude, longitude, timestamps
  • Operational Data: Speed calculations, journey identification

2. External Data Sources

  • Weather Data: OpenWeatherMap API for temperature, humidity, weather conditions
  • Elevation Data: Open-Elevation API for altitude information
  • Station Data: Belgian train stations from iRail/stations repository
  • Workshop Data: SNCB service station locations

Data Processing Pipeline

Data Cleaning & Preparation

  1. Timestamp Conversion: UTC to Belgian time with DST handling
  2. NULL Value Handling:
    • Isolated nulls: Linear interpolation
    • Continuous nulls: Identified as train stops/maintenance
  3. Outlier Removal: GPS errors, temperature overflows, erroneous RPM values
  4. Journey Identification: Automatic segmentation of train trips

Feature Engineering

  • Speed Calculation: Haversine formula for GPS-based speed
  • Rolling Averages: Temporal features for pattern recognition
  • Threshold Monitoring: Temperature and pressure limit tracking
  • Anomaly Labeling: Potential Risk/Hazard classification

Exploratory Data Analysis Findings

Key Observations

  • Temperature Overflows: Sensor values of ~33,000 and ~66,000 during cold weather
  • GPS Sensor Issues: Erratic speed readings exceeding 120 km/h maximum
  • Engine Correlations: Strong correlation between RPM and oil pressure
  • Operational Patterns: Clear day/night usage patterns with maintenance windows

Anomaly Categories

  1. Sensor Errors: Temperature overflows, GPS inaccuracies
  2. Potential Risks: Single engine approaching threshold values
  3. Potential Hazards: Both engines exceeding safe operating limits

Machine Learning Models

1. Isolation Forest

  • Purpose: Unsupervised anomaly detection
  • Performance: Silhouette Score: 0.54
  • Strengths: Effective at identifying outliers in high-dimensional data

2. One-Class SVM (RBF Kernel)

  • Purpose: Boundary-based anomaly detection
  • Performance: Silhouette Score: 0.43
  • Strengths: Good at defining normal operation boundaries

3. VAR (Vector Autoregression) Model

  • Purpose: Time series forecasting and anomaly prediction
  • Features: Rolling forecast with real-time adaptation
  • Methodology: Residual-based anomaly detection using statistical thresholds

Performance Evaluation

Metrics Used

  • Silhouette Scores: Cluster validation
  • Anomaly Scores: Deviation measurement
  • Ground Truth Comparison: Validation against known anomalies
  • Statistical Tests: Durbin-Watson test for residual analysis

Results

  • Potential Hazard Detection: 86.2% (Isolation Forest), 91.4% (SVM)
  • Potential Risk Detection: 69.5% (Isolation Forest), 72.53% (SVM)

Usage Instructions

Prerequisites

pip install pandas numpy matplotlib seaborn scikit-learn statsmodels

Running the Analysis

  1. Data Preparation:

    jupyter notebook step2-data-preparation.ipynb
  2. Exploratory Data Analysis:

    jupyter notebook step1-EDA.ipynb
  3. Feature Engineering:

    jupyter notebook train-speed.ipynb
    jupyter notebook elevation.ipynb
    jupyter notebook altitude.ipynb

Data Format Conversion

python convert_to_parquet.py

Key Findings & Recommendations

For SNCB Operations:

  1. Sensor Maintenance: Implement designated overflow signals and investigate temperature sensor causation
  2. Early Warning Systems: Deploy predictive alerts for high-temperature scenarios
  3. GPS System Updates: Inspect and upgrade GPS systems for improved accuracy
  4. Calibration Programs: Regular RPM sensor calibration and maintenance

Technical Insights:

  • Temperature Thresholds: Air (65°C), Water (100°C), Oil (115°C)
  • Critical Patterns: Both engines exceeding thresholds simultaneously
  • Seasonal Effects: Higher anomaly rates during winter months
  • Operational Windows: Clear distinction between service and maintenance periods

Technical Specifications

Data Processing:

  • Sampling Rate: 30-second intervals for time series analysis
  • Missing Data: <5% after cleaning procedures
  • Journey Segmentation: Automatic based on speed and time thresholds

Model Parameters:

  • Isolation Forest: Optimized contamination levels and tree counts
  • SVM: RBF kernel with tuned gamma and nu parameters
  • VAR Model: Automatic lag selection using AIC/BIC criteria

Dashboard & Visualization

The project includes an interactive dashboard for real-time monitoring and historical analysis of train anomalies, providing:

Contributing

This project was developed as part of an academic course. For questions or collaboration opportunities, please contact the team members listed above.

License

This project is developed for educational purposes as part of the INFOH423 Data Mining course.

Acknowledgments

  • SNCB for providing the train sensor data
  • OpenWeatherMap for weather data API
  • Open-Elevation for elevation data
  • iRail Community for Belgian train station data

About

A data mining project for detecting anomalies in SNCB's AR41 diesel train cooling systems to enhance operational efficiency and ensure safety.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published