Author: Daniel S. Demoz
🌐 CLICK HERE TO ACCESS THE DASHBOARD
No downloads required - works instantly in any web browser!
- 📈 Overview: Dataset statistics and diabetes distribution
- 🔍 EDA: Interactive data exploration with filtering
- 🤖 Models: Machine learning model performance comparison
- 🔮 Prediction: Personalized diabetes risk assessment
- Real-time visualizations with Plotly charts
- Interactive filtering and data exploration
- Personalized risk prediction based on health factors
- Professional design with responsive layout
- No installation required - works in any browser
This comprehensive dashboard analyzes diabetes risk factors using two major health survey datasets from North America, enabling cross-country comparison and validation of diabetes prediction models.
Cross-Country Validation: By using datasets from both the United States and Canada, this project provides:
- Validation of risk factors across different healthcare systems
- Comparison of model performance in different populations
- Insights into cultural and systemic differences in health data collection
- Robustness testing of machine learning models across borders
- Source: CDC BRFSS
- Records: 253,680 individuals
- Coverage: All 50 US states, DC, and territories
- Features: BMI, blood pressure, cholesterol, smoking, physical activity, diet, demographics
- Source: Statistics Canada
- Records: 108,252 individuals
- Coverage: All Canadian provinces and territories
- Features: Self-reported and adjusted BMI, detailed smoking status, WHO physical activity guidelines, fruit/vegetable consumption
Harmonization Process:
- Column Mapping: Aligned similar health indicators across datasets
- Value Recoding: Standardized categorical variables (diabetes, high BP, cholesterol)
- Missing Data Handling: Different strategies for each dataset's unique coding
- Feature Engineering: Created comparable variables for cross-dataset analysis
Key Differences Handled:
- BMI: BRFSS (measured) vs CCHS (self-reported + adjusted)
- Smoking: BRFSS (binary) vs CCHS (detailed categories)
- Physical Activity: BRFSS (binary) vs CCHS (minutes + WHO guidelines)
| Model | 🇺🇸 BRFSS (US) | 🇨🇦 CCHS (Canada) | Performance Gap |
|---|---|---|---|
| Logistic Regression | 0.8149 | 0.7729 | +0.0420 |
| Random Forest | 0.7913 | 0.7055 | +0.0858 |
| XGBoost | 0.8206 | 0.7315 | +0.0891 |
Consistent Risk Factors Across Countries:
- High Blood Pressure: Most important predictor in both datasets
- BMI: Strong predictor in both US (measured) and Canada (self-reported)
- High Cholesterol: Significant risk factor in both populations
Model Performance Insights:
- Logistic Regression: Most consistent performance across countries
- Tree-based Models: Better performance on US data, likely due to data granularity differences
- Overall: Models show comparable predictive ability with slight US advantage
| Aspect | 🇺🇸 BRFSS (US) | 🇨🇦 CCHS (Canada) |
|---|---|---|
| Sample Size | 253,680 | 108,252 |
| Data Collection | Phone surveys | Mixed methods |
| BMI Measurement | Measured/calculated | Self-reported + adjusted |
| Smoking Detail | Binary (yes/no) | Detailed categories |
| Physical Activity | Binary engagement | Minutes + WHO guidelines |
| Missing Data | Minimal | Coded as special values |
Built with:
- HTML5, CSS3, JavaScript
- Plotly.js for interactive visualizations
- Responsive design for all devices
Data Sources:
- US: BRFSS 2015 (253,680+ records)
- Canada: CCHS (108,252+ records)
Models Used:
- Logistic Regression, Random Forest, XGBoost
- Cross-validated on both datasets
- SMOTE for class imbalance handling
Healthcare System Comparison:
- US: Market-based healthcare system
- Canada: Universal healthcare system
- Impact: Different access patterns may affect health outcomes and data quality
Methodological Validation:
- Cross-validation: Models trained on one country tested on another
- Feature consistency: Same risk factors identified across different populations
- Robustness: Models perform well despite data collection differences
Public Health Insights:
- Universal risk factors: BMI, blood pressure, cholesterol are consistent predictors
- Cultural differences: Smoking and physical activity patterns vary between countries
- Data quality: Different survey methodologies provide complementary insights
Primary Sources:
- BRFSS 2015: CDC Behavioral Risk Factor Surveillance System
- CCHS: Statistics Canada - Canadian Community Health Survey
Additional References:
Author: Daniel S. Demoz
Repository: GitHub
This dashboard is for educational and research purposes. Medical decisions should not be based solely on this tool. Please consult healthcare professionals for medical advice.