Assignment for UECS3483 Data Mining
This repository presents a comparative machine learning study using different algorithms for heart disease prediction based on patient medical records. It includes code for data preprocessing, model training, overfitting evaluation, performance evaluation, and final model saving using joblib.
The objective of this project is to compare multiple machine learning algorithms to identify the best perfoming model for predicting heart disease. At the data preprocessing, two version of dataset are kept, which are dataset with outliers and dataset without outliers, to assess the impact of outliers on model performance. To evaluate and reduce the risk of overfitting, cross-validation and accuracy scores are applied. The models are evaluated using using performing matrics including precision, recall, F1-score, accuracy, AUC score, and confusion matrics. The model with highest AUC score is selected as final model and saved for future application.
Tool/Library | Description |
---|---|
Google Colab | Cloud-based platform for running Python code. |
Python | Programming language used for data preprocessing, model building, and saving. |
pandas | For handling structured data and dataframes. |
numpy | For numerical operations. |
matplotlib | For plotting data and creating visualizations. |
seaborn | For advanced statistical data visualization. |
scikit-learn | For encoding, scaling, splitting data, and preprocessing pipelines. |
scipy.stats | For statistical analysis. |
statsmodels | Used to compute Variance Inflation Factor (VIF) to detect multicollinearity. |
scikit-learn | Provides tools for preprocessing, traditional machine learning models, cross-validation, evaluation metrics, pipelines, and hyperparameter tuning. |
scikit-learn.metrics | For performance evaluation metrics like accuracy, precision, recall, F1-score, ROC curve, AUC, and confusion matrix. |
scikit-learn.model_selection | For train-test splitting, cross-validation, and hyperparameter tuning. |
scikit-learn.ensemble | Includes ensemble models like Random Forest, AdaBoost, Gradient Boosting, and Voting Classifier. |
scikit-learn.linear_model | For logistic regression. |
scikit-learn.tree | For decision tree models. |
scikit-learn.svm | For support vector machines (SVC). |
scikit-learn.naive_bayes | For Naive Bayes classifier. |
scikit-learn.neural_network | For multilayer perceptron (MLP) classifier. |
scikit-learn.neighbors | For k-nearest neighbors (KNN) classifier. |
xgboost | For gradient boosting models. |
lightgbm | For efficient gradient boosting on decision trees. |
catboost | For categorical boosting. |
joblib | For saving the final trained model. |
warnings | Used to suppress warnings for cleaner output. |
- AdaBoost
- CatBoost
- Decision Tree
- GradientBoosting
- k--Nearest Neighbors (KNN)
- LightGBM
- Logistic Regression
- Multilayer Perceptron (MLP)
- Naïve Bayes
- Random Forest
- Support Vector Machine (SVM)
- VotingClassifier
- XGBoost
- Voting (Random Forest, LightGBM, XGBoost)
- Data collection
- Total of 1027 records
- Total of 14 features
- 13 independent features and 1 dependent feature
- 8 categorical features (7 nominal features and 2 ordinal features) and 5 numerical features (4 ratio features and 1 interval features)
- Exploratory Data Analysis (EDA) on Original Dataset
- Data splitting
- 70% Training data
- 15% Validation data
- 15% Testing data
- Data preprocessing
- Handling duplicates
- Removing duplicated records
- Handling Missing values
- Removing null values
- Finding best parameters by RandomizedSearchCV
- Imputation missing values using Random Forest Algorithm
- Handling Outliers
- Analyzing skewness in numerical features
- Applying Yeo-Johnson transformation (for absolute skewness > 1)
- Identifying outliers using Z-score (±3 standard deviations)
- Capping outliers using 5th and 95th percentiles
- Saving both original (with outliers) and capped (without outliers) datasets
- Handling duplicates
- Feature scaling
- Applying standardization (z-score scaling) to numerical features
- Feature Importance Visualization
- Using Random Forest to visualize feature importance on both datasets (with and without outliers)
- Categorical features encoding
- One-hot encoding for nominal categorical features
- Ordinal encoding for ordinal categorical features
- Dataset Selection (With vs. Without Outliers)
- Selecting 3 base algorithms
- Performing hyperparameter tuning using HalvingRandomSearchCV
- Evaluating risk of overfitting by 3-fold cross-validation
- Comparing performance metrics (precision, recall, F1, accuracy, AUC) and confusion matrics
- Selecting the better-performing dataset for final modeling
- Final Model Training
- Training the remaining algorithms using the selected dataset
- Performing hyperparameter tuning using HalvingRandomSearchCV and applying the best parameters on these models
- Evaluating risk of overfitting by 3-fold cross-validation
- Evaluating performance metrics (precision, recall, F1, accuracy, AUC) and confusion matrics
- Building voting model by top 3 models based on validation performance
- Evaluating risk of overfitting by 3-fold cross-validation
- Evaluating performance metrics (precision, recall, F1, accuracy, AUC) and confusion matrics
- Final Evaluation on Testing Set
- Comparing all models using ROC curves and AUC scores
- Selecting the model with the highest AUC as the final model
- Model Saving
- Saving the best model using joblib for future application
- Part 1: https://colab.research.google.com/drive/1NDyaXNd7FHiQRihu1Ll3bN3VJFJu_IzL?usp=sharing
- Part 2 & 3: https://colab.research.google.com/drive/1gHpx92KScLWY7mMaqZQ8aqTz0gIWCRwP?usp=sharing
- [@Yu-2008] (https://github.com/Yu-2008)
- [@Cammy276] (https://github.com/Cammy276)
- [@LIOWKEHAN] (https://github.com/LIOWKEHAN)
Contributions are always welcome!
To get started:
- Fork the repository to your GitHub account.
- Create a new branch for your feature or fix:
git checkout -b your-feature-name
- Make your changes and commit them with a clear message:
git commit -m "Add: Description of your change"
- Push your branch to your forked repository:
git push origin your-feature-name
- Open a Pull Request from your branch to the main project.
Feel free to open an issue first if you'd like to discuss your idea before implementing it.