A comprehensive data mining project for detecting anomalies in SNCB's AR41 diesel train cooling systems to enhance operational efficiency and ensure safety.
This project analyzes sensor data from SNCB's AR41 diesel trains to identify and predict anomalies in their cooling systems. The goal is to distinguish between normal operational patterns, sensor noise, and critical system failures that could indicate potential safety issues.
- Muhammad Qasim Khan - 000583564
- Hareem Raza - 000583268
- Dilbar Isakova - 000583178
- Aryan Gupta - 000585712
Course: INFOH423 - Data Mining
- Identify Anomalies: Detect sensor errors and differentiate them from noise
- Predict Issues: Forecast potential system failures before they occur
- Categorize Problems: Distinguish between:
- Noise-related anomalies
- Single-engine deviations (Potential Risk)
- Dual-engine issues (Potential Hazard)
├── altitude.ipynb # Elevation data processing
├── convert_to_parquet.py # Data format conversion utility
├── elevation.ipynb # Elevation analysis
├── group_by.ipynb # Data grouping operations
├── step1-EDA.ipynb # Exploratory Data Analysis
├── step2-data-preparation.ipynb # Data cleaning and preparation
├── train-speed.ipynb # Speed calculation and analysis
└── README.md # Project documentation
The project integrates multiple data sources to provide comprehensive analysis:
- PC1 & PC2 Engine Data: RPM, Oil Pressure, Air Temperature, Water Temperature, Oil Temperature
- GPS Data: Latitude, longitude, timestamps
- Operational Data: Speed calculations, journey identification
- Weather Data: OpenWeatherMap API for temperature, humidity, weather conditions
- Elevation Data: Open-Elevation API for altitude information
- Station Data: Belgian train stations from iRail/stations repository
- Workshop Data: SNCB service station locations
- Timestamp Conversion: UTC to Belgian time with DST handling
- NULL Value Handling:
- Isolated nulls: Linear interpolation
- Continuous nulls: Identified as train stops/maintenance
- Outlier Removal: GPS errors, temperature overflows, erroneous RPM values
- Journey Identification: Automatic segmentation of train trips
- Speed Calculation: Haversine formula for GPS-based speed
- Rolling Averages: Temporal features for pattern recognition
- Threshold Monitoring: Temperature and pressure limit tracking
- Anomaly Labeling: Potential Risk/Hazard classification
- Temperature Overflows: Sensor values of ~33,000 and ~66,000 during cold weather
- GPS Sensor Issues: Erratic speed readings exceeding 120 km/h maximum
- Engine Correlations: Strong correlation between RPM and oil pressure
- Operational Patterns: Clear day/night usage patterns with maintenance windows
- Sensor Errors: Temperature overflows, GPS inaccuracies
- Potential Risks: Single engine approaching threshold values
- Potential Hazards: Both engines exceeding safe operating limits
- Purpose: Unsupervised anomaly detection
- Performance: Silhouette Score: 0.54
- Strengths: Effective at identifying outliers in high-dimensional data
- Purpose: Boundary-based anomaly detection
- Performance: Silhouette Score: 0.43
- Strengths: Good at defining normal operation boundaries
- Purpose: Time series forecasting and anomaly prediction
- Features: Rolling forecast with real-time adaptation
- Methodology: Residual-based anomaly detection using statistical thresholds
- Silhouette Scores: Cluster validation
- Anomaly Scores: Deviation measurement
- Ground Truth Comparison: Validation against known anomalies
- Statistical Tests: Durbin-Watson test for residual analysis
- Potential Hazard Detection: 86.2% (Isolation Forest), 91.4% (SVM)
- Potential Risk Detection: 69.5% (Isolation Forest), 72.53% (SVM)
pip install pandas numpy matplotlib seaborn scikit-learn statsmodels
-
Data Preparation:
jupyter notebook step2-data-preparation.ipynb
-
Exploratory Data Analysis:
jupyter notebook step1-EDA.ipynb
-
Feature Engineering:
jupyter notebook train-speed.ipynb jupyter notebook elevation.ipynb jupyter notebook altitude.ipynb
python convert_to_parquet.py
- Sensor Maintenance: Implement designated overflow signals and investigate temperature sensor causation
- Early Warning Systems: Deploy predictive alerts for high-temperature scenarios
- GPS System Updates: Inspect and upgrade GPS systems for improved accuracy
- Calibration Programs: Regular RPM sensor calibration and maintenance
- Temperature Thresholds: Air (65°C), Water (100°C), Oil (115°C)
- Critical Patterns: Both engines exceeding thresholds simultaneously
- Seasonal Effects: Higher anomaly rates during winter months
- Operational Windows: Clear distinction between service and maintenance periods
- Sampling Rate: 30-second intervals for time series analysis
- Missing Data: <5% after cleaning procedures
- Journey Segmentation: Automatic based on speed and time thresholds
- Isolation Forest: Optimized contamination levels and tree counts
- SVM: RBF kernel with tuned gamma and nu parameters
- VAR Model: Automatic lag selection using AIC/BIC criteria
The project includes an interactive dashboard for real-time monitoring and historical analysis of train anomalies, providing:
- Link: https://public.tableau.com/app/profile/dilbar.isakova/viz/project_17028532561000/Data-Mining
- Geographic visualization of anomaly locations
- Time series plots with forecasting
- Statistical summaries and trend analysis
- Alert systems for critical conditions
This project was developed as part of an academic course. For questions or collaboration opportunities, please contact the team members listed above.
This project is developed for educational purposes as part of the INFOH423 Data Mining course.
- SNCB for providing the train sensor data
- OpenWeatherMap for weather data API
- Open-Elevation for elevation data
- iRail Community for Belgian train station data