Machine Failure Prediction using Sensor Data

This project develops and compares various machine learning models to predict potential machine failures based on sensor data. Leveraging a robust, automated pipeline, the solution aims to proactively identify equipment malfunctions, minimize costly downtime, and optimize maintenance schedules.

🌟 Features

Automated Data Pipeline: End-to-end process from data acquisition to model training and evaluation, managed by Python scripts and a Makefile.
Comprehensive Feature Engineering: Includes handling categorical data, log transformations for skewed numerical features, and intelligent feature selection.
Diverse Model Experimentation: Explores Random Forest, Logistic Regression, Gaussian Naive Bayes, Bernoulli Naive Bayes, and Categorical Naive Bayes classifiers.
Hyperparameter Optimization: Utilizes GridSearchCV with roc_auc scoring to find optimal model configurations.
Reproducible Results: All model training and output generation can be replicated with simple make commands.
Actionable Insights: Provides key performance metrics, confusion matrices, and feature importance analysis to guide maintenance strategies.

📊 Dataset

The project utilizes the Machine Failure Prediction using Sensor Data dataset from Kaggle. This dataset contains various sensor readings and operational parameters, with a binary target variable indicating whether a machine failure occurred.

📁 Project Structure

The repository is designed for clarity, modularity, and reproducibility:

machine-failure-prediction/
├── data/                       # Stores the raw data (data.csv)
├── outputs/                    # Contains generated model metrics, results, and intermediate files
│   └── model_X/                # Separate directory for each model's outputs
├── src/
│   ├── config.py               # Centralized constants (e.g., RANDOM_STATE)
│   ├── main.py                 # Orchestrates model training and data generation
│   ├── model_1/                # Specific implementation for Model 1 (Random Forest)
│   │   ├── model_train.py      # Defines the model and its GridSearchCV parameters
│   │   ├── pipeline.py         # Assembles the full preprocessing and model pipeline
│   │   ├── selection.py        # Handles data loading and train/test splitting
│   │   └── treatment.py        # Contains feature engineering and preprocessing steps
│   ├── model_2/                # Model 2 implementation (Random Forest, removed categories)
│   └── ... (model_3/ to model_9/) # Implementations for other experimental models
├── .gitignore                  # Specifies files/directories to be ignored by Git
├── create_all.bash             # Script for creating all models and filling outputs folder (call it from make all_models)
├── get_data.py                 # Script to fetch the dataset from Kaggle
├── LICENSE                     # Project license (GPL3)
├── Makefile                    # Automation for running models and generating outputs
├── README.md                   # This file
└── requirements.txt            # Python dependencies

🚀 Methodology

This project follows a systematic approach to machine failure prediction:

Data Acquisition: The get_data.py script automatically fetches the data.csv file from its Kaggle source and places it in the data/ directory.
Data Preprocessing & Feature Engineering:
- selection.py: Loads the dataset, defines the features (FIELDS) and target (TARGET), and performs a stratified train-test split to ensure representative data distribution.
- treatment.py: Applies crucial transformations:
  - Log Transformation: footfall data is transformed using a log function (LogCpTransformer) to normalize its distribution.
  - Categorical Handling: Integer-based categorical variables (tempMode, AQ, USS, CS, VOC, IP) are explicitly converted to object type, then One-Hot Encoded to create boolean features.
Model Experimentation & Training (model_train.py): The project systematically explores various machine learning models:
- Model 1, 2, 3: Utilize Random Forest Classifier with n_estimators and min_samples_leaf tuned via GridSearchCV on roc_auc score. Model 2 explored removing less relevant categories, and Model 3 tested omitting the footfall log transformation, neither of which significantly improved performance.
- Model 4, 5, 6: Employ Logistic Regression, tuning penalty (l1, l2, elasticnet) and max_iter. Model 5 tested without the footfall log transformation, showing a slight detriment. Model 6 applied a square root transformation to RP and removed less relevant categories.
- Model 7: Uses Gaussian Naive Bayes.
- Model 8: Employs Bernoulli Naive Bayes, tuning alpha and binarize, specifically removing non-categorical data.
- Model 9: Uses Categorical Naive Bayes, tuning alpha, also removing non-categorical data.
Pipeline Orchestration (pipeline.py): Each model version defines its own PIPELINE using sklearn.pipeline.Pipeline, chaining together the preprocessing steps (footfall transformation, categorical conversion, one-hot encoding) with the chosen model and its GridSearchCV configuration.
Automated Execution (main.py & Makefile):
- main.py is the central script that loads the specified model's pipeline, trains it, makes predictions, and generates all evaluation metrics and result files (e.g., model_metrics.json, confusion_matrix.csv, feature_importance.json, real_proba_table.csv) into the outputs/ directory.
- The Makefile automates the entire process:
  - make model MODEL=X: Runs and generates outputs for a specific model (e.g., make model MODEL=1).
  - make all_models: Iterates through all implemented models (1-9) and generates their respective outputs.

📈 Key Results

The experiments demonstrated that Model 1 (Random Forest) achieved the best overall performance, particularly in terms of predicting machine failures with high confidence.

Model 1 Performance (Random Forest)

Metric	Training Set	Test Set
Accuracy	92.45%	92.59%
AUC	0.976	0.975
Recall	91.40%	93.67%
Precision	90.54%	89.16%
F1-Score	90.97%	91.36%

The high test recall (93.67%) is particularly critical for a machine failure prediction system, as it indicates the model's strong ability to correctly identify actual failures, minimizing missed opportunities for proactive maintenance.

Test Confusion Matrix (Model 1):

	Predicted No Fail	Predicted Fail
Actual No Fail	101	9
Actual Fail	5	74

This confusion matrix visually confirms the model's effectiveness, showing a low number of false negatives (missed failures).

Key Feature Importance (Model 1):

The Random Forest model identified crucial sensor readings influencing failure predictions:

VOC (Volatile Organic Compound): 54.3% importance
AQ (Air Quality): 22.7% importance
USS (Ultrasonic Sensor): 11.8% importance
...and other features.

These insights can direct engineering teams to focus on specific sensor data for early anomaly detection and root cause analysis.

Probability Threshold Analysis (Model 1):

The analysis of predicted probabilities (real_proba_table.csv, cumulative_proportions.json, cumulative_findings.json) allows for informed decision-making on operational thresholds. For instance, considering the top 50% of predictions (by probability of failure) allows us to capture 100% of actual failures. This provides a clear trade-off between the number of predicted failures and the coverage of real failures.

Comparative Model Performance Summary

While Model 1 (Random Forest) emerged as the top performer, the experimentation with other models provided valuable insights:

Model 2 (Random Forest - removed categories): Showed a slight decrease in performance compared to Model 1, indicating that the initially included categories, even if less impactful individually, contribute positively to the overall model.
Model 3 (Random Forest - no log footfall): Performance was almost identical to Model 1, suggesting that the log transformation of footfall did not significantly alter the Random Forest's predictive capability in this specific scenario.
Model 4 & 5 (Logistic Regression): Performed robustly with test AUCs of 0.968 and 0.973 respectively, demonstrating good generalized linear model capabilities for this problem. Model 5's slight drop when footfall wasn't logged highlights the sensitivity of linear models to feature distributions.
Model 6 (Logistic Regression - sqrt RP & removed categories): Achieved a very high test recall and precision (both 92.4%), making it a strong contender, potentially balancing missed failures with false alarms very well.
Model 7 (Gaussian Naive Bayes): Showed a lower accuracy and AUC compared to tree-based and linear models (test AUC 0.954), but maintained a very high recall (93.67%), suggesting it is good at catching failures but at the cost of more false positives.
Model 8 (Bernoulli Naive Bayes): Performed reasonably well (test AUC 0.969), showing good precision (0.90) and recall (0.91), especially considering its simpler nature and handling of binary data.
Model 9 (Categorical Naive Bayes): Exhibited solid performance (test AUC 0.969), with balanced precision and recall, demonstrating its effectiveness when dealing solely with categorical features.

Comparative Model Performance Summary

While Model 1 (Random Forest) emerged as the top performer, the experimentation with other models provided valuable insights:

Model	Accuracy	AUC	Recall	Precision	F1-Score
Model 1	0.9259	0.9755	0.9367	0.8916	0.9136
Model 2	0.9206	0.9709	0.9367	0.8809	0.9080
Model 3	0.9259	0.9754	0.9367	0.8916	0.9136
Model 4	0.9206	0.9680	0.9241	0.8902	0.9068
Model 5	0.9153	0.9733	0.9241	0.8795	0.9012
Model 6	0.9365	0.9701	0.9241	0.9241	0.9241
Model 7	0.8942	0.9545	0.9367	0.8315	0.8810
Model 8	0.9206	0.9697	0.9114	0.9000	0.9057
Model 9	0.9153	0.9696	0.8987	0.8987	0.8987

Overall, the Random Forest models (especially Model 1 and 6) proved to be the most effective, striking a strong balance between identifying true failures and maintaining prediction accuracy. The other models provided valuable benchmarks and showed the impact of different feature engineering and model choices.

⚙️ How to Run the Project

To replicate the project's environment, download the data, train the models, and generate all output files:

Clone the repository:

git clone https://github.com/EdmilsonRodrigues/machine-failure-prediction.git
cd machine-failure-prediction

Set up the environment and fetch data: Preferrably, use conda, activate your environment and run:
```
pip install -r requirements.txt
```
Generate all model outputs:
```
make all_models
```
This command will iterate through all defined models (1-9), train each, and save their respective metrics, confusion matrices, feature importances, and probability analyses into the outputs/ directory.
Generate output for a specific model:
```
make model MODEL=1
```
Replace 1 with any model number (1-9) to run just that specific model.

🔧 Extending with New Models

This project's modular design makes it easy to experiment with new models or variations. If you wish to introduce a new model (e.g., Model 10), simply:

Copy an existing model directory:
```
cp -r src/model_1 src/model_10
```
(Or copy any other model_X directory that's closest to your new model's structure).
Modify the files within src/model_10/:
- Adjust model_train.py to define your new model and its GridSearchCV parameters.
- Update treatment.py and selection.py if your new model requires different preprocessing or feature selection.
- Ensure pipeline.py correctly chains all your new steps.
Run your new model:
```
make model MODEL=10
```
This will generate all the outputs for your new model in outputs/model_10/. You can then include its results in your results_analysis.ipynb for comparison!

🔮 Future Work

Explore advanced time-series models (e.g., LSTMs, Transformers) to leverage the sequential nature of sensor data more effectively.
Add advanced visualization using a jupyter notebook to make easier to visualize the data.
Investigate anomaly detection techniques to identify unusual sensor patterns that might precede failures without relying solely on historical failure labels.
Integrate the prediction system with real-time data streams and develop a monitoring dashboard for operational teams.
Conduct a deeper cost-benefit analysis of prediction accuracy vs. false positives/negatives in a real-world industrial setting.

📄 License

This project is licensed under the GPL3 License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Failure Prediction using Sensor Data

🌟 Features

📊 Dataset

📁 Project Structure

🚀 Methodology

📈 Key Results

Model 1 Performance (Random Forest)

Comparative Model Performance Summary

Comparative Model Performance Summary

⚙️ How to Run the Project

🔧 Extending with New Models

🔮 Future Work

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
notebooks		notebooks
outputs		outputs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
create_all.bash		create_all.bash
get_data.py		get_data.py
requirements.txt		requirements.txt

License

EdmilsonRodrigues/machine_failure_prediction

Folders and files

Latest commit

History

Repository files navigation

Machine Failure Prediction using Sensor Data

🌟 Features

📊 Dataset

📁 Project Structure

🚀 Methodology

📈 Key Results

Model 1 Performance (Random Forest)

Comparative Model Performance Summary

Comparative Model Performance Summary

⚙️ How to Run the Project

🔧 Extending with New Models

🔮 Future Work

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages