- Overview
- Why This Project Matters
- Dataset Information
- Technologies Used
- Machine Learning Models
- Model Performance
- Web Application Features
- Installation & Setup
- Usage
- Project Structure
- Model Outputs & Visualizations
- Deployment
- Future Enhancements
- Contributing
The Diabetes Prognosis and Risk Assessment System is a comprehensive machine learning application that predicts diabetes risk levels based on 21 health indicators. This project demonstrates a complete ML pipeline from data analysis and model development to production deployment via a user-friendly web interface.
The system classifies patients into three risk categories:
- No Risk (Class 0): Patient shows no significant diabetes risk factors
- Mild Risk (Class 1): Patient has moderate diabetes risk factors requiring monitoring
- Severe Risk (Class 2): Patient has high diabetes risk factors requiring immediate attention
- Early Detection: Identifies diabetes risk before clinical symptoms appear
- Preventive Care: Enables proactive healthcare interventions
- Cost Reduction: Reduces healthcare costs through early prevention
- Population Health: Supports public health screening programs
- Complete ML Pipeline: Demonstrates end-to-end machine learning workflow
- Model Comparison: Evaluates multiple algorithms for optimal performance
- Production Ready: Deployed web application with professional UI/UX
- Scalable Architecture: Modular design for easy maintenance and updates
- File:
diabetes_012_health_indicators.xls
- Size: 22.9 MB
- Format: Excel spreadsheet with comprehensive health indicators
- HighBP: High Blood Pressure (0 = No, 1 = Yes)
- HighChol: High Cholesterol (0 = No, 1 = Yes)
- Stroke: History of Stroke (0 = No, 1 = Yes)
- HeartDiseaseorAttack: Heart Disease or Attack History (0 = No, 1 = Yes)
- Smoker: Smoking Status (0 = No, 1 = Yes)
- PhysActivity: Physical Activity (0 = No, 1 = Yes)
- Fruits: Regular Fruit Consumption (0 = No, 1 = Yes)
- Veggies: Regular Vegetable Consumption (0 = No, 1 = Yes)
- HvyAlcoholConsump: Heavy Alcohol Consumption (0 = No, 1 = Yes)
- BMI: Body Mass Index (Continuous variable)
- MentHlth: Mental Health Issues (0-30 days)
- PhysHlth: Physical Health Issues (0-30 days)
- GenHlth: General Health (1-5 scale: Excellent to Poor)
- CholCheck: Cholesterol Check (0 = No, 1 = Yes)
- AnyHealthcare: Any Healthcare Coverage (0 = No, 1 = Yes)
- NoDocbcCost: No Doctor due to Cost (0 = No, 1 = Yes)
- DiffWalk: Difficulty Walking (0 = No, 1 = Yes)
- Sex: Gender (0 = Female, 1 = Male)
- Age: Age in years
- Education: Education Level (1-6 scale)
- Income: Income Level (1-8 scale)
# Machine Learning & Data Science
scikit-learn==1.2.2 # ML algorithms and preprocessing
pandas==1.5.3 # Data manipulation and analysis
numpy==1.24.3 # Numerical computing
joblib==1.2.0 # Model serialization
# Visualization
matplotlib==3.7.1 # Static plotting
seaborn==0.12.2 # Statistical data visualization
# Web Application
streamlit==1.37.0 # Web app framework
- Python: 3.12
- IDE: PyCharm/IntelliJ IDEA
- Version Control: Git
- Deployment: Heroku (Production Ready)
Status: โ Currently Deployed
# Model Configuration
LogisticRegression(
max_iter=2000,
solver='lbfgs',
random_state=42
)
Key Features:
- Linear decision boundaries for multi-class classification
- Standard scaling preprocessing
- Coefficient-based feature importance
- ROC curve analysis
- Confusion matrix evaluation
Why Chosen for Production:
- High interpretability for healthcare decisions
- Stable performance across different datasets
- Fast inference time for real-time predictions
- Well-understood by medical professionals
Status: ๐งช Experimental Model
# Model Configuration
RandomForestClassifier(
n_estimators=100,
random_state=42
)
Key Features:
- Ensemble learning with 100 decision trees
- Built-in feature importance ranking
- Robust to overfitting
- Handles non-linear relationships
Status: ๐งช Advanced Experimental Model
# Model Configuration
GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
random_state=42,
validation_fraction=0.1,
n_iter_no_change=10
)
Key Features:
- Sequential learning for improved accuracy
- Hyperparameter tuning with RandomizedSearchCV
- Early stopping to prevent overfitting
- Advanced uncertainty detection
- Enhanced prediction confidence analysis
Model | Accuracy | Status | Key Strengths |
---|---|---|---|
Logistic Regression | 84.83% | ๐ข Production | Interpretable, Fast, Reliable |
Random Forest | 84.11% | ๐ก Research | Robust, Feature Importance |
Gradient Boosting | ~85% | ๐ก Research | High Accuracy, Advanced Features |
Accuracy: 84.83%
Overall Performance: 0.85
Classification Report:
precision recall f1-score support
0 0.85 0.95 0.90 35346
1 0.42 0.21 0.28 4631
2 0.74 0.64 0.69 10759
accuracy 0.85 50736
macro avg 0.67 0.60 0.62 50736
weighted avg 0.81 0.85 0.82 50736
- Class 0 (No Risk): Excellent precision (0.85) and recall (0.95)
- Class 1 (Mild Risk): Challenging to predict due to subtle symptoms
- Class 2 (Severe Risk): Good balance of precision (0.74) and recall (0.64)
- Modern Healthcare UI: Professional color scheme with medical theme
- Responsive Layout: Three-column design optimized for various screen sizes
- Interactive Forms: User-friendly input controls with validation
- Real-time Feedback: Instant prediction results with visual indicators
- 21 comprehensive health indicator inputs
- Dropdown selections for categorical variables
- Numeric inputs with validation ranges
- User-friendly labels and descriptions
# Prediction Process
input_data โ preprocessing โ scaling โ model_prediction โ risk_classification
- Prediction Probabilities: Bar chart showing confidence levels
- Feature Importance: Interactive chart of decision factors
- Risk Assessment: Color-coded risk level indicators
/* Custom Healthcare Theme */
- Primary Color: #4CAF50 (Medical Green)
- Background: #f0f2f6 (Clean White-Gray)
- Accent Colors: Professional medical palette
- Typography: Clear, readable medical fonts
Python 3.12+
pip (Python package manager)
Git (for version control)
git clone https://github.com/your-username/predictive-model-for-assessing-the-risk-of-disease.git
cd predictive-model-for-assessing-the-risk-of-disease
# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Check if all models are present
ls *.joblib
# Expected output:
# logistic_regression_model.joblib
# scaler.joblib
streamlit run app.py
The application will be available at: http://localhost:8501
- Launch the web application
- Input patient's health indicators
- Click "Predict Health Risk"
- Review the risk classification and probability scores
- Use feature importance chart to understand key risk factors
- No Risk: Routine monitoring, lifestyle counseling
- Mild Risk: Increased monitoring, preventive interventions
- Severe Risk: Immediate clinical evaluation, intensive management
# Explore Jupyter notebooks
jupyter notebook "Health Risk Prediction Model_Logistic Regression.ipynb"
jupyter notebook "Health Risk Prediction Model(Random Forest Classifier) (1).ipynb"
jupyter notebook "Health Risk Prediction Model( Gradient Boosting Classifier) (1).ipynb"
- Compare accuracy scores across different algorithms
- Analyze feature importance variations
- Evaluate ROC curves and confusion matrices
๐ predictive-model-for-assessing-the-risk-of-disease/
โโโ ๐ app.py # Streamlit web application
โโโ ๐ diabetes_012_health_indicators.xls # Training dataset
โโโ ๐ Health Risk Prediction Model_Logistic Regression.ipynb
โโโ ๐ Health Risk Prediction Model( Gradient Boosting Classifier) (1).ipynb
โโโ ๐ Health Risk Prediction Model(Random Forest Classifier) (1).ipynb
โโโ ๐ค logistic_regression_model.joblib # Trained production model
โโโ โ๏ธ scaler.joblib # Data preprocessing scaler
โโโ ๐ Procfile # Heroku deployment configuration
โโโ ๐ฆ requirements.txt # Python dependencies
โโโ ๐ README.md # Project documentation
โโโ ๐ .git/ # Git version control
โโโ ๐ .idea/ # IDE configuration files
Predicted No Risk Mild Risk Severe Risk
Actual
No Risk 33,579 892 875
Mild Risk 971 983 2,677
Severe Risk 1,905 1,355 7,499
Interpretation:
- True Negatives (No Risk): 33,579 correctly identified
- True Positives (Severe Risk): 7,499 correctly identified
- False Positives: Minimal misclassification of healthy patients
- False Negatives: Some high-risk patients classified as lower risk
# ROC Curve Performance
AUC Score: ~0.85 (Excellent discrimination ability)
Clinical Significance:
- AUC > 0.8: Excellent predictive performance
- High Sensitivity: Good at identifying actual diabetes risk
- High Specificity: Good at correctly identifying low-risk patients
- General Health (GenHlth) - Overall health assessment
- BMI - Body Mass Index (obesity indicator)
- High Blood Pressure (HighBP) - Cardiovascular risk factor
- Age - Age-related diabetes risk
- High Cholesterol (HighChol) - Metabolic risk factor
- Physical Health (PhysHlth) - Physical health issues
- Income - Socioeconomic health determinant
- Heart Disease or Attack - Cardiovascular comorbidity
- Education - Health literacy indicator
- Difficulty Walking (DiffWalk) - Physical mobility indicator
# Accuracy Comparison Bar Chart
Models = ['Logistic Regression', 'Random Forest', 'Gradient Boosting']
Accuracy = [84.83%, 84.11%, ~85%]
The model outputs probability scores for each risk class:
- Class 0 (No Risk): Probability of no diabetes risk
- Class 1 (Mild Risk): Probability of mild diabetes risk
- Class 2 (Severe Risk): Probability of severe diabetes risk
Example Output:
Patient Risk Assessment:
โโโ No Risk: 15%
โโโ Mild Risk: 25%
โโโ Severe Risk: 60% โ Final Prediction
# Procfile
web: streamlit run app.py
# Runtime Requirements
Python 3.12
Streamlit 1.37.0
All dependencies in requirements.txt
# Login to Heroku
heroku login
# Create new app
heroku create diabetes-risk-predictor
# Deploy
git push heroku main
# Open application
heroku open
# Set production configurations
heroku config:set ENVIRONMENT=production
heroku config:set MODEL_PATH=logistic_regression_model.joblib
heroku config:set SCALER_PATH=scaler.joblib
# Run locally with hot reload
streamlit run app.py --server.runOnSave true
- Deep Learning Models: Neural networks for complex pattern recognition
- Ensemble Methods: Combine multiple models for improved accuracy
- Real-time Learning: Update models with new patient data
- Explainable AI: Advanced model interpretability features
- Multi-language Support: Localization for global healthcare
- Mobile Application: Native iOS/Android apps
- EMR Integration: Electronic Medical Record system connectivity
- API Development: RESTful API for third-party integrations
- Usage Analytics: Track application usage patterns
- Model Performance Monitoring: Continuous model evaluation
- A/B Testing: Compare different model versions
- Patient Feedback System: Collect outcome data for improvement
- HIPAA Compliance: Healthcare data privacy standards
- Data Encryption: End-to-end data protection
- User Authentication: Secure login system
- Audit Logging: Comprehensive activity tracking
- Fork the Repository
- Create Feature Branch:
git checkout -b feature/new-model
- Make Changes: Implement new features or improvements
- Test Thoroughly: Ensure all tests pass
- Submit Pull Request: Detailed description of changes
- Model Development: New algorithms and improvements
- UI/UX Enhancement: Better user interface design
- Documentation: Improve project documentation
- Testing: Add comprehensive test coverage
- Performance Optimization: Speed and efficiency improvements
# Follow PEP 8 style guidelines
# Use type hints for function parameters
# Include docstrings for all functions
# Write unit tests for new features
# Update documentation for changes
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset Source: CDC Behavioral Risk Factor Surveillance System
- Machine Learning Libraries: scikit-learn, pandas, numpy
- Web Framework: Streamlit for rapid application development
- Visualization: matplotlib and seaborn for data visualization
- Healthcare Community: For feedback and validation
- Project Maintainer: Sabareesh S P
- Email: sabareeshsp7@gmail.com
โ
Production Ready: Deployed web application with professional UI
โ
High Accuracy: 84.83% accuracy in diabetes risk prediction
โ
Comprehensive: 21 health indicators for thorough assessment
โ
Scalable: Modular architecture for easy maintenance
โ
Well Documented: Complete documentation and code comments
โ
Multiple Models: Comparison of different ML algorithms
โ
Healthcare Focused: Designed for medical professionals
โ
Open Source: MIT license for community contribution
This project demonstrates the power of machine learning in healthcare, providing an accessible tool for diabetes risk assessment that can potentially save lives through early detection and intervention.