This repository contains the code and resources for the research paper "Explorative Approach with Bayesian Networks for Metabolic Syndrome." The project uses Bayesian networks to model the probabilistic relationships between various medical factors associated with metabolic syndrome.
- Project Overview
- Dataset
- Methodology
- Code Structure and Usage
- Setup and Installation
- Corrections and Improvements
- Results
- Contributing
- License
Metabolic syndrome is a cluster of conditions that occur together, increasing the risk of heart disease, stroke, and type 2 diabetes. This project uses an explorative approach with Bayesian networks to investigate the probabilistic dependencies among these conditions. By modeling these relationships, we can gain a better understanding of the syndrome's underlying mechanisms.
The dataset used in this project is data.csv
, which contains information on the following variables:
DBP
: Diastolic Blood PressureSBP
: Systolic Blood PressureFastingBloodSugar
: Fasting Blood SugarTriglycerides
: Triglyceride levelsHDLcholestrol
: High-Density Lipoprotein (HDL) CholesterolLDLcholestrol
: Low-Density Lipoprotein (LDL) CholesterolIDFMetsynd
: Diagnostic variable for metabolic syndrome based on International Diabetes Federation criteria.ATPMetssynd
: Diagnostic variable for metabolic syndrome based on ATP III criteria.
The project explores three types of Bayesian Networks to model the data:
- Gaussian Bayesian Networks: For continuous data.
- Multinomial Bayesian Networks: For discrete data.
- Hybrid Bayesian Networks: For a mix of continuous and discrete data.
Three different approaches for learning the network structure are implemented:
- Score-based: These algorithms use a scoring function to find the best network structure (e.g., Hill-Climbing).
- Constraint-based: These algorithms use conditional independence tests to learn the network structure (e.g., Grow-Shrink).
- Hybrid: These algorithms combine elements of both score-based and constraint-based methods (e.g., RSMAX2).
The R code for this project is in the main_file.Rmd
file. It includes the following steps:
- Data Loading and Preprocessing: The
data.csv
file is loaded, and missing values are removed. - Data Discretization: Both unsupervised and manual methods are used to discretize the data for multinomial and hybrid models.
- Model Building: Various Bayesian network models are built using different structure learning algorithms.
- Model Evaluation: The models are evaluated using scoring functions and k-fold cross-validation.
To run the R code, you need to have R and RStudio installed. You will also need to install the following packages:
install.packages(c("bnlearn", "deal", "gRain", "gRbase", "gRim"))
You will also need to have the data.csv
file in the same directory as the R Markdown file.
The following improvements have been made to the original R code:
- File Paths: The original code used hardcoded file paths. The corrected version assumes the data file is in the same directory as the script.
- Code Clarity: Added comments to the code to explain the different steps and methodologies.
- Code Organization: Removed unused code and organized the script to be more readable and maintainable.
The analysis suggests that the hybrid-based learning structure for a Gaussian Bayesian Network, when fitted with manually discretized data, provides the most accurate and interpretable model of the relationships between the factors contributing to metabolic syndrome. The model achieves an accuracy of approximately 91% in predicting the ATPMetssynd
diagnostic variable.
Contributions to this project are welcome. Please feel free to open an issue or submit a pull request with any improvements or bug fixes.
This project is licensed under the MIT License. See the LICENSE
file for more details.