This project develops and compares various machine learning models to predict potential machine failures based on sensor data. Leveraging a robust, automated pipeline, the solution aims to proactively identify equipment malfunctions, minimize costly downtime, and optimize maintenance schedules.
- Automated Data Pipeline: End-to-end process from data acquisition to model training and evaluation, managed by Python scripts and a
Makefile
. - Comprehensive Feature Engineering: Includes handling categorical data, log transformations for skewed numerical features, and intelligent feature selection.
- Diverse Model Experimentation: Explores Random Forest, Logistic Regression, Gaussian Naive Bayes, Bernoulli Naive Bayes, and Categorical Naive Bayes classifiers.
- Hyperparameter Optimization: Utilizes
GridSearchCV
withroc_auc
scoring to find optimal model configurations. - Reproducible Results: All model training and output generation can be replicated with simple
make
commands. - Actionable Insights: Provides key performance metrics, confusion matrices, and feature importance analysis to guide maintenance strategies.
The project utilizes the Machine Failure Prediction using Sensor Data dataset from Kaggle. This dataset contains various sensor readings and operational parameters, with a binary target variable indicating whether a machine failure occurred.
The repository is designed for clarity, modularity, and reproducibility:
machine-failure-prediction/
├── data/ # Stores the raw data (data.csv)
├── outputs/ # Contains generated model metrics, results, and intermediate files
│ └── model_X/ # Separate directory for each model's outputs
├── src/
│ ├── config.py # Centralized constants (e.g., RANDOM_STATE)
│ ├── main.py # Orchestrates model training and data generation
│ ├── model_1/ # Specific implementation for Model 1 (Random Forest)
│ │ ├── model_train.py # Defines the model and its GridSearchCV parameters
│ │ ├── pipeline.py # Assembles the full preprocessing and model pipeline
│ │ ├── selection.py # Handles data loading and train/test splitting
│ │ └── treatment.py # Contains feature engineering and preprocessing steps
│ ├── model_2/ # Model 2 implementation (Random Forest, removed categories)
│ └── ... (model_3/ to model_9/) # Implementations for other experimental models
├── .gitignore # Specifies files/directories to be ignored by Git
├── create_all.bash # Script for creating all models and filling outputs folder (call it from make all_models)
├── get_data.py # Script to fetch the dataset from Kaggle
├── LICENSE # Project license (GPL3)
├── Makefile # Automation for running models and generating outputs
├── README.md # This file
└── requirements.txt # Python dependencies
This project follows a systematic approach to machine failure prediction:
- Data Acquisition: The
get_data.py
script automatically fetches thedata.csv
file from its Kaggle source and places it in thedata/
directory. - Data Preprocessing & Feature Engineering:
selection.py
: Loads the dataset, defines the features (FIELDS
) and target (TARGET
), and performs a stratified train-test split to ensure representative data distribution.treatment.py
: Applies crucial transformations:- Log Transformation:
footfall
data is transformed using a log function (LogCpTransformer
) to normalize its distribution. - Categorical Handling: Integer-based categorical variables (
tempMode
,AQ
,USS
,CS
,VOC
,IP
) are explicitly converted to object type, then One-Hot Encoded to create boolean features.
- Log Transformation:
- Model Experimentation & Training (
model_train.py
): The project systematically explores various machine learning models:- Model 1, 2, 3: Utilize Random Forest Classifier with
n_estimators
andmin_samples_leaf
tuned viaGridSearchCV
onroc_auc
score. Model 2 explored removing less relevant categories, and Model 3 tested omitting thefootfall
log transformation, neither of which significantly improved performance. - Model 4, 5, 6: Employ Logistic Regression, tuning
penalty
(l1
,l2
,elasticnet
) andmax_iter
. Model 5 tested without thefootfall
log transformation, showing a slight detriment. Model 6 applied a square root transformation toRP
and removed less relevant categories. - Model 7: Uses Gaussian Naive Bayes.
- Model 8: Employs Bernoulli Naive Bayes, tuning
alpha
andbinarize
, specifically removing non-categorical data. - Model 9: Uses Categorical Naive Bayes, tuning
alpha
, also removing non-categorical data.
- Model 1, 2, 3: Utilize Random Forest Classifier with
- Pipeline Orchestration (
pipeline.py
): Each model version defines its ownPIPELINE
usingsklearn.pipeline.Pipeline
, chaining together the preprocessing steps (footfall
transformation, categorical conversion, one-hot encoding) with the chosen model and itsGridSearchCV
configuration. - Automated Execution (
main.py
&Makefile
):main.py
is the central script that loads the specified model's pipeline, trains it, makes predictions, and generates all evaluation metrics and result files (e.g.,model_metrics.json
,confusion_matrix.csv
,feature_importance.json
,real_proba_table.csv
) into theoutputs/
directory.- The
Makefile
automates the entire process:make model MODEL=X
: Runs and generates outputs for a specific model (e.g.,make model MODEL=1
).make all_models
: Iterates through all implemented models (1-9) and generates their respective outputs.
The experiments demonstrated that Model 1 (Random Forest) achieved the best overall performance, particularly in terms of predicting machine failures with high confidence.
Metric | Training Set | Test Set |
---|---|---|
Accuracy | 92.45% | 92.59% |
AUC | 0.976 | 0.975 |
Recall | 91.40% | 93.67% |
Precision | 90.54% | 89.16% |
F1-Score | 90.97% | 91.36% |
The high test recall (93.67%) is particularly critical for a machine failure prediction system, as it indicates the model's strong ability to correctly identify actual failures, minimizing missed opportunities for proactive maintenance.
Test Confusion Matrix (Model 1):
Predicted No Fail | Predicted Fail | |
---|---|---|
Actual No Fail | 101 | 9 |
Actual Fail | 5 | 74 |
This confusion matrix visually confirms the model's effectiveness, showing a low number of false negatives (missed failures).
Key Feature Importance (Model 1):
The Random Forest model identified crucial sensor readings influencing failure predictions:
VOC
(Volatile Organic Compound): 54.3% importanceAQ
(Air Quality): 22.7% importanceUSS
(Ultrasonic Sensor): 11.8% importance- ...and other features.
These insights can direct engineering teams to focus on specific sensor data for early anomaly detection and root cause analysis.
Probability Threshold Analysis (Model 1):
The analysis of predicted probabilities (real_proba_table.csv
, cumulative_proportions.json
, cumulative_findings.json
) allows for informed decision-making on operational thresholds. For instance, considering the top 50% of predictions (by probability of failure) allows us to capture 100% of actual failures. This provides a clear trade-off between the number of predicted failures and the coverage of real failures.
While Model 1 (Random Forest) emerged as the top performer, the experimentation with other models provided valuable insights:
- Model 2 (Random Forest - removed categories): Showed a slight decrease in performance compared to Model 1, indicating that the initially included categories, even if less impactful individually, contribute positively to the overall model.
- Model 3 (Random Forest - no log footfall): Performance was almost identical to Model 1, suggesting that the log transformation of
footfall
did not significantly alter the Random Forest's predictive capability in this specific scenario. - Model 4 & 5 (Logistic Regression): Performed robustly with test AUCs of 0.968 and 0.973 respectively, demonstrating good generalized linear model capabilities for this problem. Model 5's slight drop when
footfall
wasn't logged highlights the sensitivity of linear models to feature distributions. - Model 6 (Logistic Regression - sqrt RP & removed categories): Achieved a very high test recall and precision (both 92.4%), making it a strong contender, potentially balancing missed failures with false alarms very well.
- Model 7 (Gaussian Naive Bayes): Showed a lower accuracy and AUC compared to tree-based and linear models (test AUC 0.954), but maintained a very high recall (93.67%), suggesting it is good at catching failures but at the cost of more false positives.
- Model 8 (Bernoulli Naive Bayes): Performed reasonably well (test AUC 0.969), showing good precision (0.90) and recall (0.91), especially considering its simpler nature and handling of binary data.
- Model 9 (Categorical Naive Bayes): Exhibited solid performance (test AUC 0.969), with balanced precision and recall, demonstrating its effectiveness when dealing solely with categorical features.
While Model 1 (Random Forest) emerged as the top performer, the experimentation with other models provided valuable insights:
Model | Accuracy | AUC | Recall | Precision | F1-Score |
---|---|---|---|---|---|
Model 1 | 0.9259 | 0.9755 | 0.9367 | 0.8916 | 0.9136 |
Model 2 | 0.9206 | 0.9709 | 0.9367 | 0.8809 | 0.9080 |
Model 3 | 0.9259 | 0.9754 | 0.9367 | 0.8916 | 0.9136 |
Model 4 | 0.9206 | 0.9680 | 0.9241 | 0.8902 | 0.9068 |
Model 5 | 0.9153 | 0.9733 | 0.9241 | 0.8795 | 0.9012 |
Model 6 | 0.9365 | 0.9701 | 0.9241 | 0.9241 | 0.9241 |
Model 7 | 0.8942 | 0.9545 | 0.9367 | 0.8315 | 0.8810 |
Model 8 | 0.9206 | 0.9697 | 0.9114 | 0.9000 | 0.9057 |
Model 9 | 0.9153 | 0.9696 | 0.8987 | 0.8987 | 0.8987 |
Overall, the Random Forest models (especially Model 1 and 6) proved to be the most effective, striking a strong balance between identifying true failures and maintaining prediction accuracy. The other models provided valuable benchmarks and showed the impact of different feature engineering and model choices.
To replicate the project's environment, download the data, train the models, and generate all output files:
- Clone the repository:
git clone https://github.com/EdmilsonRodrigues/machine-failure-prediction.git cd machine-failure-prediction
- Set up the environment and fetch data:
Preferrably, use conda, activate your environment and run:
pip install -r requirements.txt
- Generate all model outputs:
This command will iterate through all defined models (1-9), train each, and save their respective metrics, confusion matrices, feature importances, and probability analyses into the
make all_models
outputs/
directory. - Generate output for a specific model:
Replace
make model MODEL=1
1
with any model number (1-9) to run just that specific model.
This project's modular design makes it easy to experiment with new models or variations. If you wish to introduce a new model (e.g., Model 10), simply:
- Copy an existing model directory:
(Or copy any other
cp -r src/model_1 src/model_10
model_X
directory that's closest to your new model's structure). - Modify the files within
src/model_10/
:- Adjust
model_train.py
to define your new model and itsGridSearchCV
parameters. - Update
treatment.py
andselection.py
if your new model requires different preprocessing or feature selection. - Ensure
pipeline.py
correctly chains all your new steps.
- Adjust
- Run your new model:
This will generate all the outputs for your new model in
make model MODEL=10
outputs/model_10/
. You can then include its results in yourresults_analysis.ipynb
for comparison!
- Explore advanced time-series models (e.g., LSTMs, Transformers) to leverage the sequential nature of sensor data more effectively.
- Add advanced visualization using a jupyter notebook to make easier to visualize the data.
- Investigate anomaly detection techniques to identify unusual sensor patterns that might precede failures without relying solely on historical failure labels.
- Integrate the prediction system with real-time data streams and develop a monitoring dashboard for operational teams.
- Conduct a deeper cost-benefit analysis of prediction accuracy vs. false positives/negatives in a real-world industrial setting.
This project is licensed under the GPL3 License. See the LICENSE
file for details.