This project develops a machine learning pipeline to predict obesity levels based on lifestyle and demographic factors. The dataset includes 1900 entries with 17 features, capturing factors such as diet, physical activity, and family history. The target variable, NObeyesdad
, categorizes individuals into seven obesity levels: Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II, and Obesity Type III.
The project focuses on comprehensive data preprocessing, exploratory data analysis (EDA), feature engineering, and model evaluation. A custom Gaussian Naive Bayes classifier is implemented alongside other models (Random Forest, SVC, KNN, LightGBM, Decision Tree) to explore relationships between lifestyle factors and obesity. Key insights highlight the influence of diet, physical activity, and genetic factors on obesity levels, providing valuable implications for healthcare applications.
The dataset is divided into training (train_dataset.csv
) and testing (test_dataset.csv
) sets, each with 1900 rows and 17 columns. Features include:
- Demographic Features: Gender, Age, Height, Weight
- Lifestyle Features:
- FAVC (frequent high-calorie food consumption, yes/no)
- FCVC (frequency of vegetable consumption, scale 1-3)
- NCP (number of main meals per day)
- CAEC (frequency of food consumption between meals: Never, Sometimes, Frequently, Always)
- CH2O (daily water intake, scale 1-3)
- FAF (physical activity frequency, scale 0-3)
- TUE (time using technology, scale 0-3)
- CALC (alcohol consumption frequency: Never, Sometimes, Frequently, Always)
- MTRANS (transportation mode: Automobile, Bike, Motorbike, Public Transportation, Walking)
- Other Features: family_history_with_overweight, SMOKE, SCC (calorie monitoring, yes/no)
- Target Variable: NObeyesdad (7 obesity levels)
- Missing Values:
- FCVC: 12 missing values imputed with mean (~2.42)
- CALC: 28 missing values imputed with mode ("Sometimes")
- Class Distribution: Imbalanced, with Obesity Type III (~20%) most frequent and Insufficient Weight (~10%) least frequent.
To run this project, ensure Python 3.6+ is installed. Clone the repository and install dependencies:
git clone https://github.com/your-username/obesity-prediction.git
cd obesity-prediction
pip install -r requirements.txt
Key dependencies (listed in requirements.txt
):
numpy
pandas
matplotlib
seaborn
scikit-learn
statsmodels
lightgbm
Install them using:
pip install numpy pandas matplotlib seaborn scikit-learn statsmodels lightgbm
- Prepare Data: Place
train_dataset.csv
andtest_dataset.csv
in the project directory. - Run Notebook: Open
Obesity_Prediction_Final.ipynb
in Jupyter Notebook or JupyterLab to execute preprocessing, EDA, model training, and evaluation. - Custom Naive Bayes: The notebook includes a custom Gaussian Naive Bayes implementation using 7 features: BMI, family_history_with_overweight, Age, FAF, CH2O, NCP, CAEC.
- Visualizations and Evaluation: Explore visualizations and model performance metrics (precision, recall, F1-score, confusion matrix).
Run the notebook with:
jupyter notebook Obesity_Prediction_Final.ipynb
- Data Loading: Loaded training and testing datasets using
pandas.read_csv
. - Missing Value Handling:
- FCVC: Imputed with mean (~2.42) to preserve distribution.
- CALC: Imputed with mode ("Sometimes") for categorical consistency.
- Encoding Categorical Variables:
- Binary (e.g., Gender, FAVC): Encoded with
LabelEncoder
(0/1). - Ordinal (e.g., CAEC, CALC): Mapped to numerical values (e.g., Never: 0, Always: 3).
- Nominal (MTRANS): One-hot encoded into 5 binary columns.
- Target (NObeyesdad): Encoded with
LabelEncoder
(0-6).
- Binary (e.g., Gender, FAVC): Encoded with
- Scaling Numerical Features: Standardized with
StandardScaler
and rounded to reduce noise. - Feature Engineering: Created BMI feature using
Weight / (Height^2)
. - Feature Selection:
- Forward Selection: Identified 9 key features: Weight, family_history_with_overweight, Age, CAEC, FCVC, FAF, Height, NCP, Gender.
- Backward Elimination: Retained 13 features, excluding SMOKE, TUE, SCC.
- Naive Bayes used 7 features for simplicity: BMI, family_history_with_overweight, Age, FAF, CH2O, NCP, CAEC.
Visualizations provided insights into data distributions and relationships:
- Obesity Distribution Bar Plot: Revealed class imbalance, with Obesity Type III most prevalent.
- Target Distribution with Gender: Showed similar patterns across genders, with slight differences (e.g., more males in Obesity Type II).
- Gender Distribution: Balanced (~950 males, ~950 females).
- Categorical Variable Count Plots: Highlighted dominance of Public Transportation (~1200) and family history of overweight (~1500).
- Vegetable Consumption vs. Obesity (Violin Plot): Low FCVC (<=1) correlated with higher obesity levels; high FCVC (>2) linked to Normal/Insufficient Weight.
Six models were trained and tuned using GridSearchCV
:
- Random Forest: Tuned
n_estimators
andmax_depth
for robust ensemble predictions. - SVC: Optimized
C
andkernel
to capture non-linear relationships. - KNN: Adjusted
n_neighbors
for local pattern recognition. - LightGBM: Tuned
n_estimators
,learning_rate
, andmax_depth
for efficient gradient boosting. - Decision Tree: Optimized
max_depth
,min_samples_split
, andmin_samples_leaf
for interpretability. - Naive Bayes: Custom Gaussian implementation for numerical features.
The NaiveBayes
class is tailored for numerical features:
- Training: Computes class priors and feature-wise mean/variance.
- Prediction: Uses log-probabilities for numerical stability, selecting the class with the highest probability.
- Features: Uses 7 features (BMI, family_history_with_overweight, Age, FAF, CH2O, NCP, CAEC) for simplicity and interpretability.
- Performance: High precision for extreme classes (e.g., Obesity Type III) but lower recall for overlapping classes (e.g., Overweight Level I/II).
Models were evaluated on a 20% test set using precision, recall, F1-score (macro-averaged), and confusion matrices:
- Random Forest: Balanced performance, interpretable feature importance, but computationally intensive.
- SVC: Effective for non-linear patterns, though slower to train.
- KNN: Simple but sensitive to imbalanced data and scaling.
- LightGBM: Robust to imbalanced data, fewer misclassifications for minority classes.
- Decision Tree: Highly interpretable but prone to overfitting.
- Naive Bayes: Efficient and precise for extreme classes, limited by feature independence assumption.
- Lifestyle Factors: Low vegetable consumption (FCVC <= 1), frequent high-calorie food intake (FAVC, ~1600 individuals), and low physical activity (FAF) strongly correlate with higher obesity levels.
- Genetic Influence: Family history of overweight (~1500 individuals) is a significant predictor.
- Transportation: Public Transportation dominates (~1200), while Walking (~200) and Bike (~50) correlate with lower obesity levels.
- Top Model: LightGBM excelled in handling imbalanced data and categorical features.
- Key Features: Weight, family_history_with_overweight, and BMI were critical predictors.
- Healthcare Implications: The analysis underscores the importance of diet, physical activity, and genetic factors in obesity, informing targeted interventions.