🫀 Aorta Guard: Cardiovascular Disease Detection Pipeline

This repository contains the machine learning pipeline for detecting cardiovascular disease using clinical and lifestyle indicators. Developed as part of the AAI-540 MLOps course, the project includes data ingestion, cleaning, feature engineering, model training, batch inference, feature store setup, and SageMaker + CloudWatch monitoring.

🎯 Objective

Dataset Source: Cardiovascular Disease Dataset on Kaggle

Cardiovascular disease (CVD) remains the leading cause of death globally. Many clinical interventions are reactive rather than preventive. This project uses machine learning to identify individuals at risk based on routine health metrics proactively.

We aim to shift from reactive care to proactive prevention using accessible, structured clinical data.

📁 Project Structure for relevant files

├── ci_cd/
│   └── ci_cd_complete.ipynb
│
├── cloudwatch_and_monitoring/
│   ├── cardio_cloudwatch_and_data_reports.ipynb
│   └── cardio_mointoring_endpoint_scheduling.ipynb
│
├── data_assets/
│   ├── cloudwatch_files/
│   │   ├── constraints.json
│   │   └── statistics.json
│   ├── data_splits/                                        # Data Split 40/10/40/10
│   │   ├── cardio_prod_split40.csv
│   │   ├── cardio_test_split10.csv
│   │   ├── cardio_train_split40.csv
│   │   └── cardio_val_split10.csv
│   ├── logistic/
│   │   ├── inference.py
│   │   ├── logistic_model.pkl
│   │   └── logistic_model.tar.gz
│   ├── random_forest/
│   │   ├── final_rf_model.tar.gz
│   │   └── inference_rf.py
│   ├── cardio_cleaned.csv
│   ├── cardio_engineered.csv                               # Cleaned & Engineered Dataset
│   └── cardio_train.csv                                    # Original Dataset
│
├── feature_store/
│   └── cardio_engineered_feature_store_setup.ipynb
│
├── image_output/                                            # All Visuals used for this project
│
├── notebooks_pipeline/
│   ├── Models/                                              # Logistic Baseline and Random Forest
│   ├── cardio_final_model.ipynb                             # Completed Final Notebook
│   └── cardio_inference_transform_both_models.ipynb         # Batch Transform Job for both models
│
├── requirements.txt                                        
├── MIT License
└── README.md

🧪 Setup & Dependencies

Requirements

To install the full environment:

pip install -r requirements.txt

📊 Dataset Summary

The following datasets were used throughout the Aorta Guard machine learning pipeline, including training, validation, testing, production inference, and monitoring. All files are organized under data_assets/:

File Path	Label	Shape	Description
`data_splits/cardio_train_split40.csv`	Training Split (40%)	(27,355, 24)	Used to train both baseline and optimized models
`data_splits/cardio_val_split10.csv`	Validation Split (10%)	(6,838, 24)	Used to validate and tune hyperparameters
`data_splits/cardio_test_split10.csv`	Test Split (10%)	(6,838, 24)	Used to evaluate final model performance
`data_splits/cardio_prod_split40.csv`	Production Reserve (40%)	(27,354, 24)	Held-out dataset for production inference and monitoring
`data_splits/cardio_prod_no_label.csv`	Prod No-Label (Logistic/RF)	(27,354, 23)	Inference-ready dataset for production use (labels removed)
`data_splits/cardio_prod_split40_cat.csv`	Production – Categorical Only	(27,354, 23)	Categorical columns subset (used for drift/bias monitoring)
`data_splits/cardio_prod_split40_num.csv`	Production – Numeric Only	(27,354, 23)	Numerical columns subset (used for drift/bias monitoring)
`data_splits/cardio_prod_split40_no_label.csv`	Production – No Label Split	(27,354, 23)	Alternate no-label version used in monitoring and transform jobs
`data_splits/cardio_column_mapping.json`	Column Index Mapping	—	JSON file mapping index to column names for interpretability
`cloudwatch_files/statistics.json`	Baseline Statistics	—	Generated by SageMaker Model Monitor for schema and distribution baseline
`cloudwatch_files/constraints.json`	Data Constraints	—	Defines schema expectations and quality rules for monitoring

📊 Visual Insights Summary

Top Feature Importances: The Random Forest model ranked systolic_bp, chol_bmi_ratio, and bmi as the most important predictors of cardiovascular disease. This confirms the critical role of circulatory and metabolic health markers in early detection.
Cardio Outcome by Age Group: Risk increases significantly in individuals in their 50s and 60s. The highest number of positive cases is concentrated in the 50s group, emphasizing the need for proactive screening in middle age.
BMI Distribution: Higher BMI is associated with greater cardiovascular risk, especially among those classified as overweight or obese.
Blood Pressure Categories: Stage 1 and Stage 2 hypertension are more common in cardio-positive individuals, linking elevated blood pressure to disease risk.
BMI Category Trends: Obese and overweight individuals showed greater prevalence of disease, highlighting BMI's diagnostic relevance.
Cholesterol/BMI Ratio: This ratio is slightly elevated in cardio-positive cases, suggesting metabolic imbalance or lipid-related risk.
Pulse Pressure Observations: Higher pulse pressure ranges were observed in the cardio-positive group, suggesting greater vascular strain.
Pairplot Observations: In BMI vs. chol_bmi_ratio, a clear inverse relationship and cluster separation between classes appear, hinting at decision boundaries the model may exploit.

🧠 Feature Engineering

This pipeline engineered and transformed features from the cleaned cardiovascular dataset to enhance model accuracy and interpretability. The process included outlier removal, derived metrics, and binning based on clinical relevance.

Input Sources: Clinical, lifestyle, and demographic indicators.
Engineered Features:
- bmi: Body Mass Index (from height and weight)
- pulse_pressure: Difference between systolic and diastolic BP
- age_years: Converted from days to years
- chol_bmi_ratio: Cholesterol level divided by BMI
- age_gluc_interaction: Interaction between age and glucose level
- lifestyle_score: Composite score from smoking, alcohol, and physical activity
Categorical Binning:
- bp_category: Hypertension stage (normal, stage1, stage2)
- bmi_category: Weight classification (underweight, normal, overweight, obese)
- age_group: Age buckets (30s, 40s, 50s, 60s)
Final Feature Count: 24 input features
Final Output File: cardio_engineered.csv stored in data_assets/

These engineered features enabled both linear (Logistic Regression) and non-linear (Random Forest) models to capture medical patterns that would be missed with raw data alone.

🔬 Model Overview

We developed and evaluated three versions of our cardiovascular disease prediction models:

⚙️ Baseline Model: Logistic Regression

Algorithm: Logistic Regression (Scikit-learn)
Hyperparameters: max_iter=1000, random_state=42
Training Set: 40% of the cleaned and engineered dataset
Validation Accuracy: 73%
Validation AUC: 0.791
Inference Files:
- logistic_model.pkl
- inference.py
- logistic_model.tar.gz
Batch Input: cardio_prod_no_label.csv
Stored In: data_assets/logistic/

🌲 Initial Model: Random Forest

Algorithm: Random Forest Classifier (Scikit-learn)
Hyperparameters: n_estimators=100, random_state=42
Training Set: 40% of the engineered dataset
Validation Accuracy: 73%
Validation AUC: 0.797
Inference Files:
- final_rf_model.joblib
- inference_rf.py
- final_rf_model.tar.gz
Batch Input: cardio_prod_no_label_rf.csv
Stored In: data_assets/random_forest/

🏁 Final Model: Random Forest (Tuned)

Improved Hyperparameters: Tuned with RandomizedSearchCV
Performance: Slight AUC improvement over baseline
Reason for Selection: Better feature importance explanations and flexible deployment
Deployment: Used in real-time monitoring endpoint (cardio-logistic-monitor-endpoint)
Monitoring: Integrated with SageMaker Model Monitor and CloudWatch Dashboards

🔍 Model Evaluation

The precision-recall curve illustrates the trade-off between sensitivity (recall) and the precision of our classifier at various thresholds. This is especially helpful in imbalanced medical datasets like cardiovascular prediction, where false positives and false negatives carry significant clinical weight.

📡 Monitoring & CloudWatch Insights

Our SageMaker deployment includes real-time monitoring using Amazon CloudWatch. The dashboard tracks CPU and memory utilization, disk activity, and invocation error rates. Below is a 3-hour snapshot of model monitoring activity.

🧪 How to Test the Final Model

import boto3

# Initialize runtime client (change as needed)
runtime = boto3.client('sagemaker-runtime', region_name='us-east-1')

# Replace with your deployed endpoint name
endpoint_name = "cardio-logistic-monitor-endpoint"

# Example payload: a comma-separated string of 23 feature values
payload = "50,2,5.51,136.69,110,80,1,1,0,0,1,21.98,50s,Normal,30,4.55,66.12,50,0,stage1,normal,50,-1"

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="text/csv",
    Body=payload
)

prediction = response["Body"].read().decode("utf-8")
print("Prediction:", prediction.strip())

🧠 System Architecture Overview

The diagram below illustrates the full architecture of the Aorta Guard: Machine Learning-Based Cardiovascular Risk Prediction System, built on AWS SageMaker. It showcases how raw data stored in Amazon S3 flows through Athena queries, preprocessing in SageMaker notebooks, and into Feature Store for engineered feature versioning. The CI/CD pipeline automates model training, evaluation, and deployment, while batch inference jobs deliver risk predictions at scale. Post-deployment, the system leverages SageMaker Model Monitor and Amazon CloudWatch to ensure model quality, detect drift, and maintain infrastructure health. This modular, production-ready design supports data lineage, monitoring, and retraining workflows in a clinical decision support context.

🎥 Presentation Video

Watch our full project walkthrough below, showcasing the data pipeline, CI/CD integration, model training, deployment, and monitoring in action:

👥 Team Info

AAI-540 Group 4 – Aorta Guard

Prema Mallikarjunan
Outhai Xayavongsa (Team Lead)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🫀 Aorta Guard: Cardiovascular Disease Detection Pipeline

🎯 Objective

📁 Project Structure for relevant files

🧪 Setup & Dependencies

Requirements

📊 Dataset Summary

📊 Visual Insights Summary

🧠 Feature Engineering

🔬 Model Overview

⚙️ Baseline Model: Logistic Regression

🌲 Initial Model: Random Forest

🏁 Final Model: Random Forest (Tuned)

🔍 Model Evaluation

📡 Monitoring & CloudWatch Insights

🧪 How to Test the Final Model

🧠 System Architecture Overview

🎥 Presentation Video

👥 Team Info

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
ci_cd		ci_cd
cloudwatch_and_monitoring		cloudwatch_and_monitoring
data_assets		data_assets
feature_store		feature_store
image_output		image_output
notebooks_pipeline		notebooks_pipeline
.gitignore		.gitignore
MIT License		MIT License
README.md		README.md
requirements.txt		requirements.txt

oxayavongsa/aai-540-mlops-final-group-4

Folders and files

Latest commit

History

Repository files navigation

🫀 Aorta Guard: Cardiovascular Disease Detection Pipeline

🎯 Objective

📁 Project Structure for relevant files

🧪 Setup & Dependencies

Requirements

📊 Dataset Summary

📊 Visual Insights Summary

🧠 Feature Engineering

🔬 Model Overview

⚙️ Baseline Model: Logistic Regression

🌲 Initial Model: Random Forest

🏁 Final Model: Random Forest (Tuned)

🔍 Model Evaluation

📡 Monitoring & CloudWatch Insights

🧪 How to Test the Final Model

🧠 System Architecture Overview

🎥 Presentation Video

👥 Team Info

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages