This repository contains the machine learning pipeline for detecting cardiovascular disease using clinical and lifestyle indicators. Developed as part of the AAI-540 MLOps course, the project includes data ingestion, cleaning, feature engineering, model training, batch inference, feature store setup, and SageMaker + CloudWatch monitoring.
Dataset Source: Cardiovascular Disease Dataset on Kaggle
Cardiovascular disease (CVD) remains the leading cause of death globally. Many clinical interventions are reactive rather than preventive. This project uses machine learning to identify individuals at risk based on routine health metrics proactively.
We aim to shift from reactive care to proactive prevention using accessible, structured clinical data.
βββ ci_cd/
β βββ ci_cd_complete.ipynb
β
βββ cloudwatch_and_monitoring/
β βββ cardio_cloudwatch_and_data_reports.ipynb
β βββ cardio_mointoring_endpoint_scheduling.ipynb
β
βββ data_assets/
β βββ cloudwatch_files/
β β βββ constraints.json
β β βββ statistics.json
β βββ data_splits/ # Data Split 40/10/40/10
β β βββ cardio_prod_split40.csv
β β βββ cardio_test_split10.csv
β β βββ cardio_train_split40.csv
β β βββ cardio_val_split10.csv
β βββ logistic/
β β βββ inference.py
β β βββ logistic_model.pkl
β β βββ logistic_model.tar.gz
β βββ random_forest/
β β βββ final_rf_model.tar.gz
β β βββ inference_rf.py
β βββ cardio_cleaned.csv
β βββ cardio_engineered.csv # Cleaned & Engineered Dataset
β βββ cardio_train.csv # Original Dataset
β
βββ feature_store/
β βββ cardio_engineered_feature_store_setup.ipynb
β
βββ image_output/ # All Visuals used for this project
β
βββ notebooks_pipeline/
β βββ Models/ # Logistic Baseline and Random Forest
β βββ cardio_final_model.ipynb # Completed Final Notebook
β βββ cardio_inference_transform_both_models.ipynb # Batch Transform Job for both models
β
βββ requirements.txt
βββ MIT License
βββ README.md
To install the full environment:
pip install -r requirements.txt
The following datasets were used throughout the Aorta Guard machine learning pipeline, including training, validation, testing, production inference, and monitoring. All files are organized under data_assets/
:
File Path | Label | Shape | Description |
---|---|---|---|
data_splits/cardio_train_split40.csv |
Training Split (40%) | (27,355, 24) | Used to train both baseline and optimized models |
data_splits/cardio_val_split10.csv |
Validation Split (10%) | (6,838, 24) | Used to validate and tune hyperparameters |
data_splits/cardio_test_split10.csv |
Test Split (10%) | (6,838, 24) | Used to evaluate final model performance |
data_splits/cardio_prod_split40.csv |
Production Reserve (40%) | (27,354, 24) | Held-out dataset for production inference and monitoring |
data_splits/cardio_prod_no_label.csv |
Prod No-Label (Logistic/RF) | (27,354, 23) | Inference-ready dataset for production use (labels removed) |
data_splits/cardio_prod_split40_cat.csv |
Production β Categorical Only | (27,354, 23) | Categorical columns subset (used for drift/bias monitoring) |
data_splits/cardio_prod_split40_num.csv |
Production β Numeric Only | (27,354, 23) | Numerical columns subset (used for drift/bias monitoring) |
data_splits/cardio_prod_split40_no_label.csv |
Production β No Label Split | (27,354, 23) | Alternate no-label version used in monitoring and transform jobs |
data_splits/cardio_column_mapping.json |
Column Index Mapping | β | JSON file mapping index to column names for interpretability |
cloudwatch_files/statistics.json |
Baseline Statistics | β | Generated by SageMaker Model Monitor for schema and distribution baseline |
cloudwatch_files/constraints.json |
Data Constraints | β | Defines schema expectations and quality rules for monitoring |
- Top Feature Importances: The Random Forest model ranked
systolic_bp
,chol_bmi_ratio
, andbmi
as the most important predictors of cardiovascular disease. This confirms the critical role of circulatory and metabolic health markers in early detection. - Cardio Outcome by Age Group: Risk increases significantly in individuals in their 50s and 60s. The highest number of positive cases is concentrated in the 50s group, emphasizing the need for proactive screening in middle age.
- BMI Distribution: Higher BMI is associated with greater cardiovascular risk, especially among those classified as overweight or obese.
- Blood Pressure Categories: Stage 1 and Stage 2 hypertension are more common in cardio-positive individuals, linking elevated blood pressure to disease risk.
- BMI Category Trends: Obese and overweight individuals showed greater prevalence of disease, highlighting BMI's diagnostic relevance.
- Cholesterol/BMI Ratio: This ratio is slightly elevated in cardio-positive cases, suggesting metabolic imbalance or lipid-related risk.
- Pulse Pressure Observations: Higher pulse pressure ranges were observed in the cardio-positive group, suggesting greater vascular strain.
- Pairplot Observations: In BMI vs. chol_bmi_ratio, a clear inverse relationship and cluster separation between classes appear, hinting at decision boundaries the model may exploit.
This pipeline engineered and transformed features from the cleaned cardiovascular dataset to enhance model accuracy and interpretability. The process included outlier removal, derived metrics, and binning based on clinical relevance.
- Input Sources: Clinical, lifestyle, and demographic indicators.
- Engineered Features:
bmi
: Body Mass Index (from height and weight)pulse_pressure
: Difference between systolic and diastolic BPage_years
: Converted from days to yearschol_bmi_ratio
: Cholesterol level divided by BMIage_gluc_interaction
: Interaction between age and glucose levellifestyle_score
: Composite score from smoking, alcohol, and physical activity
- Categorical Binning:
bp_category
: Hypertension stage (normal, stage1, stage2)bmi_category
: Weight classification (underweight, normal, overweight, obese)age_group
: Age buckets (30s, 40s, 50s, 60s)
- Final Feature Count: 24 input features
- Final Output File:
cardio_engineered.csv
stored indata_assets/
These engineered features enabled both linear (Logistic Regression) and non-linear (Random Forest) models to capture medical patterns that would be missed with raw data alone.
We developed and evaluated three versions of our cardiovascular disease prediction models:
- Algorithm: Logistic Regression (Scikit-learn)
- Hyperparameters:
max_iter=1000
,random_state=42
- Training Set: 40% of the cleaned and engineered dataset
- Validation Accuracy: 73%
- Validation AUC: 0.791
- Inference Files:
logistic_model.pkl
inference.py
logistic_model.tar.gz
- Batch Input:
cardio_prod_no_label.csv
- Stored In:
data_assets/logistic/
- Algorithm: Random Forest Classifier (Scikit-learn)
- Hyperparameters:
n_estimators=100
,random_state=42
- Training Set: 40% of the engineered dataset
- Validation Accuracy: 73%
- Validation AUC: 0.797
- Inference Files:
final_rf_model.joblib
inference_rf.py
final_rf_model.tar.gz
- Batch Input:
cardio_prod_no_label_rf.csv
- Stored In:
data_assets/random_forest/
- Improved Hyperparameters: Tuned with
RandomizedSearchCV
- Performance: Slight AUC improvement over baseline
- Reason for Selection: Better feature importance explanations and flexible deployment
- Deployment: Used in real-time monitoring endpoint (
cardio-logistic-monitor-endpoint
) - Monitoring: Integrated with SageMaker Model Monitor and CloudWatch Dashboards
The precision-recall curve illustrates the trade-off between sensitivity (recall) and the precision of our classifier at various thresholds. This is especially helpful in imbalanced medical datasets like cardiovascular prediction, where false positives and false negatives carry significant clinical weight.
Our SageMaker deployment includes real-time monitoring using Amazon CloudWatch. The dashboard tracks CPU and memory utilization, disk activity, and invocation error rates. Below is a 3-hour snapshot of model monitoring activity.
import boto3
# Initialize runtime client (change as needed)
runtime = boto3.client('sagemaker-runtime', region_name='us-east-1')
# Replace with your deployed endpoint name
endpoint_name = "cardio-logistic-monitor-endpoint"
# Example payload: a comma-separated string of 23 feature values
payload = "50,2,5.51,136.69,110,80,1,1,0,0,1,21.98,50s,Normal,30,4.55,66.12,50,0,stage1,normal,50,-1"
response = runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="text/csv",
Body=payload
)
prediction = response["Body"].read().decode("utf-8")
print("Prediction:", prediction.strip())
The diagram below illustrates the full architecture of the Aorta Guard: Machine Learning-Based Cardiovascular Risk Prediction System, built on AWS SageMaker. It showcases how raw data stored in Amazon S3 flows through Athena queries, preprocessing in SageMaker notebooks, and into Feature Store for engineered feature versioning. The CI/CD pipeline automates model training, evaluation, and deployment, while batch inference jobs deliver risk predictions at scale. Post-deployment, the system leverages SageMaker Model Monitor and Amazon CloudWatch to ensure model quality, detect drift, and maintain infrastructure health. This modular, production-ready design supports data lineage, monitoring, and retraining workflows in a clinical decision support context.
Watch our full project walkthrough below, showcasing the data pipeline, CI/CD integration, model training, deployment, and monitoring in action:
AAI-540 Group 4 β Aorta Guard
- Prema Mallikarjunan
- Outhai Xayavongsa (Team Lead)