Skip to content

oxayavongsa/aai-540-mlops-final-group-4

Repository files navigation

πŸ«€ Aorta Guard: Cardiovascular Disease Detection Pipeline

This repository contains the machine learning pipeline for detecting cardiovascular disease using clinical and lifestyle indicators. Developed as part of the AAI-540 MLOps course, the project includes data ingestion, cleaning, feature engineering, model training, batch inference, feature store setup, and SageMaker + CloudWatch monitoring.


🎯 Objective

Dataset Source: Cardiovascular Disease Dataset on Kaggle

Cardiovascular disease (CVD) remains the leading cause of death globally. Many clinical interventions are reactive rather than preventive. This project uses machine learning to identify individuals at risk based on routine health metrics proactively.

We aim to shift from reactive care to proactive prevention using accessible, structured clinical data.


πŸ“ Project Structure for relevant files

β”œβ”€β”€ ci_cd/
β”‚   └── ci_cd_complete.ipynb
β”‚
β”œβ”€β”€ cloudwatch_and_monitoring/
β”‚   β”œβ”€β”€ cardio_cloudwatch_and_data_reports.ipynb
β”‚   └── cardio_mointoring_endpoint_scheduling.ipynb
β”‚
β”œβ”€β”€ data_assets/
β”‚   β”œβ”€β”€ cloudwatch_files/
β”‚   β”‚   β”œβ”€β”€ constraints.json
β”‚   β”‚   └── statistics.json
β”‚   β”œβ”€β”€ data_splits/                                        # Data Split 40/10/40/10
β”‚   β”‚   β”œβ”€β”€ cardio_prod_split40.csv
β”‚   β”‚   β”œβ”€β”€ cardio_test_split10.csv
β”‚   β”‚   β”œβ”€β”€ cardio_train_split40.csv
β”‚   β”‚   └── cardio_val_split10.csv
β”‚   β”œβ”€β”€ logistic/
β”‚   β”‚   β”œβ”€β”€ inference.py
β”‚   β”‚   β”œβ”€β”€ logistic_model.pkl
β”‚   β”‚   └── logistic_model.tar.gz
β”‚   β”œβ”€β”€ random_forest/
β”‚   β”‚   β”œβ”€β”€ final_rf_model.tar.gz
β”‚   β”‚   └── inference_rf.py
β”‚   β”œβ”€β”€ cardio_cleaned.csv
β”‚   β”œβ”€β”€ cardio_engineered.csv                               # Cleaned & Engineered Dataset
β”‚   └── cardio_train.csv                                    # Original Dataset
β”‚
β”œβ”€β”€ feature_store/
β”‚   └── cardio_engineered_feature_store_setup.ipynb
β”‚
β”œβ”€β”€ image_output/                                            # All Visuals used for this project
β”‚
β”œβ”€β”€ notebooks_pipeline/
β”‚   β”œβ”€β”€ Models/                                              # Logistic Baseline and Random Forest
β”‚   β”œβ”€β”€ cardio_final_model.ipynb                             # Completed Final Notebook
β”‚   └── cardio_inference_transform_both_models.ipynb         # Batch Transform Job for both models
β”‚
β”œβ”€β”€ requirements.txt                                        
β”œβ”€β”€ MIT License
└── README.md

πŸ§ͺ Setup & Dependencies

Requirements

To install the full environment:

pip install -r requirements.txt

πŸ“Š Dataset Summary

The following datasets were used throughout the Aorta Guard machine learning pipeline, including training, validation, testing, production inference, and monitoring. All files are organized under data_assets/:

File Path Label Shape Description
data_splits/cardio_train_split40.csv Training Split (40%) (27,355, 24) Used to train both baseline and optimized models
data_splits/cardio_val_split10.csv Validation Split (10%) (6,838, 24) Used to validate and tune hyperparameters
data_splits/cardio_test_split10.csv Test Split (10%) (6,838, 24) Used to evaluate final model performance
data_splits/cardio_prod_split40.csv Production Reserve (40%) (27,354, 24) Held-out dataset for production inference and monitoring
data_splits/cardio_prod_no_label.csv Prod No-Label (Logistic/RF) (27,354, 23) Inference-ready dataset for production use (labels removed)
data_splits/cardio_prod_split40_cat.csv Production – Categorical Only (27,354, 23) Categorical columns subset (used for drift/bias monitoring)
data_splits/cardio_prod_split40_num.csv Production – Numeric Only (27,354, 23) Numerical columns subset (used for drift/bias monitoring)
data_splits/cardio_prod_split40_no_label.csv Production – No Label Split (27,354, 23) Alternate no-label version used in monitoring and transform jobs
data_splits/cardio_column_mapping.json Column Index Mapping β€” JSON file mapping index to column names for interpretability
cloudwatch_files/statistics.json Baseline Statistics β€” Generated by SageMaker Model Monitor for schema and distribution baseline
cloudwatch_files/constraints.json Data Constraints β€” Defines schema expectations and quality rules for monitoring

πŸ“Š Visual Insights Summary

  • Top Feature Importances: The Random Forest model ranked systolic_bp, chol_bmi_ratio, and bmi as the most important predictors of cardiovascular disease. This confirms the critical role of circulatory and metabolic health markers in early detection. Final Model Feature Importance
  • Cardio Outcome by Age Group: Risk increases significantly in individuals in their 50s and 60s. The highest number of positive cases is concentrated in the 50s group, emphasizing the need for proactive screening in middle age. cardio_outcome_by_age_group
  • BMI Distribution: Higher BMI is associated with greater cardiovascular risk, especially among those classified as overweight or obese.
  • Blood Pressure Categories: Stage 1 and Stage 2 hypertension are more common in cardio-positive individuals, linking elevated blood pressure to disease risk.
  • BMI Category Trends: Obese and overweight individuals showed greater prevalence of disease, highlighting BMI's diagnostic relevance.
  • Cholesterol/BMI Ratio: This ratio is slightly elevated in cardio-positive cases, suggesting metabolic imbalance or lipid-related risk.
  • Pulse Pressure Observations: Higher pulse pressure ranges were observed in the cardio-positive group, suggesting greater vascular strain.
  • Pairplot Observations: In BMI vs. chol_bmi_ratio, a clear inverse relationship and cluster separation between classes appear, hinting at decision boundaries the model may exploit.

🧠 Feature Engineering

This pipeline engineered and transformed features from the cleaned cardiovascular dataset to enhance model accuracy and interpretability. The process included outlier removal, derived metrics, and binning based on clinical relevance.

  • Input Sources: Clinical, lifestyle, and demographic indicators.
  • Engineered Features:
    • bmi: Body Mass Index (from height and weight)
    • pulse_pressure: Difference between systolic and diastolic BP
    • age_years: Converted from days to years
    • chol_bmi_ratio: Cholesterol level divided by BMI
    • age_gluc_interaction: Interaction between age and glucose level
    • lifestyle_score: Composite score from smoking, alcohol, and physical activity
  • Categorical Binning:
    • bp_category: Hypertension stage (normal, stage1, stage2)
    • bmi_category: Weight classification (underweight, normal, overweight, obese)
    • age_group: Age buckets (30s, 40s, 50s, 60s)
  • Final Feature Count: 24 input features
  • Final Output File: cardio_engineered.csv stored in data_assets/

These engineered features enabled both linear (Logistic Regression) and non-linear (Random Forest) models to capture medical patterns that would be missed with raw data alone.


πŸ”¬ Model Overview

We developed and evaluated three versions of our cardiovascular disease prediction models:

βš™οΈ Baseline Model: Logistic Regression

  • Algorithm: Logistic Regression (Scikit-learn)
  • Hyperparameters: max_iter=1000, random_state=42
  • Training Set: 40% of the cleaned and engineered dataset
  • Validation Accuracy: 73%
  • Validation AUC: 0.791
  • Inference Files:
    • logistic_model.pkl
    • inference.py
    • logistic_model.tar.gz
  • Batch Input: cardio_prod_no_label.csv
  • Stored In: data_assets/logistic/

🌲 Initial Model: Random Forest

  • Algorithm: Random Forest Classifier (Scikit-learn)
  • Hyperparameters: n_estimators=100, random_state=42
  • Training Set: 40% of the engineered dataset
  • Validation Accuracy: 73%
  • Validation AUC: 0.797
  • Inference Files:
    • final_rf_model.joblib
    • inference_rf.py
    • final_rf_model.tar.gz
  • Batch Input: cardio_prod_no_label_rf.csv
  • Stored In: data_assets/random_forest/

🏁 Final Model: Random Forest (Tuned)

  • Improved Hyperparameters: Tuned with RandomizedSearchCV
  • Performance: Slight AUC improvement over baseline
  • Reason for Selection: Better feature importance explanations and flexible deployment
  • Deployment: Used in real-time monitoring endpoint (cardio-logistic-monitor-endpoint)
  • Monitoring: Integrated with SageMaker Model Monitor and CloudWatch Dashboards

πŸ” Model Evaluation

The precision-recall curve illustrates the trade-off between sensitivity (recall) and the precision of our classifier at various thresholds. This is especially helpful in imbalanced medical datasets like cardiovascular prediction, where false positives and false negatives carry significant clinical weight.

precision_recall_curve_comparison

πŸ“‘ Monitoring & CloudWatch Insights

Our SageMaker deployment includes real-time monitoring using Amazon CloudWatch. The dashboard tracks CPU and memory utilization, disk activity, and invocation error rates. Below is a 3-hour snapshot of model monitoring activity. cloudwatch_results_3hr_060925


πŸ§ͺ How to Test the Final Model

import boto3

# Initialize runtime client (change as needed)
runtime = boto3.client('sagemaker-runtime', region_name='us-east-1')

# Replace with your deployed endpoint name
endpoint_name = "cardio-logistic-monitor-endpoint"

# Example payload: a comma-separated string of 23 feature values
payload = "50,2,5.51,136.69,110,80,1,1,0,0,1,21.98,50s,Normal,30,4.55,66.12,50,0,stage1,normal,50,-1"

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="text/csv",
    Body=payload
)

prediction = response["Body"].read().decode("utf-8")
print("Prediction:", prediction.strip())

🧠 System Architecture Overview

The diagram below illustrates the full architecture of the Aorta Guard: Machine Learning-Based Cardiovascular Risk Prediction System, built on AWS SageMaker. It showcases how raw data stored in Amazon S3 flows through Athena queries, preprocessing in SageMaker notebooks, and into Feature Store for engineered feature versioning. The CI/CD pipeline automates model training, evaluation, and deployment, while batch inference jobs deliver risk predictions at scale. Post-deployment, the system leverages SageMaker Model Monitor and Amazon CloudWatch to ensure model quality, detect drift, and maintain infrastructure health. This modular, production-ready design supports data lineage, monitoring, and retraining workflows in a clinical decision support context. Architecture Diagram


πŸŽ₯ Presentation Video

Watch our full project walkthrough below, showcasing the data pipeline, CI/CD integration, model training, deployment, and monitoring in action:

Watch the video


πŸ‘₯ Team Info

AAI-540 Group 4 – Aorta Guard

  • Prema Mallikarjunan
  • Outhai Xayavongsa (Team Lead)

About

MLOps - Machine Learning System on Cardiovascular Disease Detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •