This Project is part of a MATH 444 class at CSUN - Project 2
This project analyzes global life expectancy data using various statistical and machine learning techniques. Our goal is to identify key socio-economic and health factors influencing life expectancy and build accurate models to predict it.
- Brandon Ismalej
- Jittapatana (Patrick) Prayoonpruk
- Mishek Sambahangphe
- Source: World Health Organization (WHO) - Global Health Observatory (GHO)
- Period: 2000–2015
- Countries: 193
- Size: 2938 rows, 22 columns
- Target Variable: Life Expectancy (continuous)
- Health Indicators: BMI, HIV/AIDS, Immunization rates
- Socioeconomic Indicators: GDP, Population, Schooling, Income composition of resources
- Demographic Data: Adult Mortality, Infant Deaths, Under-five Deaths
-
Training Set: Data from 2014
-
Testing Set: Data from 2015
-
Missing Value Handling:
- Imputed using
missForest
(non-parametric Random Forest-based method) - 10 iterations, 100 trees per forest
- Imputed using
-
Post-Imputation Size:
- Training: 183×22
- Testing: 183×22
Selected Predictors:
- Income Composition of Resources
- HIV/AIDS
- Adult Mortality
- Total Expenditure
- Applied to the target variable to stabilize variance
- Optimal λ (lambda): 1.0303
Model | Adjusted R² | RMSE | Key Features |
---|---|---|---|
Linear Regression (Baseline) | 0.796 | — | Adult Mortality, HIV/AIDS, Income Composition |
Stepwise Regression | 0.8666 | — | + Total Expenditure |
Box-Cox Transformed Regression | 0.8664 | 2.99 | Same as stepwise |
Generalized Least Squares (GLS) | — | 2.63 | Improved residual handling |
CatBoost with PCA | 0.8849 | 2.75 | Dimensionality reduction (90% variance kept) |
CatBoost without PCA | — | 1.92 | Best performance overall |
- Algorithm: Gradient Boosting (handles categorical & continuous)
- Feature Selection: All-Subset Regression
- PCA: Used in one version to reduce dimensionality
- Performance: CatBoost without PCA achieved the lowest RMSE and highest R² across all models
- Applied before CatBoost (in one model)
- Retained 90% of original variance
- However, the non-PCA CatBoost model outperformed it
- Generalized Least Squares was the best linear model
- CatBoost without PCA delivered the most accurate life expectancy predictions
- PCA, while helpful for dimensionality reduction, did not enhance CatBoost performance in this case
life_expectancy_model.ipynb
: Jupyter notebook with all code and analysisdata/
: Cleaned and imputed datasetmodels/
: Trained model outputs and diagnosticsplots/
: Visualizations for EDA and model diagnostics
- Include more recent data post-2015
- Experiment with ensemble stacking and deep learning methods
- Explore regional and continent-based subgroup modeling
- Add interpretable ML tools (e.g., SHAP, LIME) for feature impact explanations