This project demonstrates an end-to-end ELT pipeline, processing three datasets to answer the business question:
- "Find the patient(s) with the most generated minutes."
The solution uses Python for data extraction, cleaning, and loading; SQL for data manipulation and querying in AWS Athena.
- Python 3.8+.
- Amazon Web Services:
- An S3 bucket for storing processed datasets.
- Athena access configured with
AWS
CLI for querying.
- Steps are converted to minutes using the formula: minutes = steps * 0.002.
- A single patient can submit steps multiple times and complete multiple exercises.
- Multiple patients can have the same total minutes, so the output may include multiple rows.
- Airflow Setup: While the project includes an Airflow DAG for automation, the Airflow setup may encounter configuration or dependency issues and is currently not operational. As a fallback, you can manually run the Python scripts and Athena queries as described below.
- The query assumes datasets are consistent and S3 files are correctly formatted.
Install all required Python packages
pip install -r requirements.txt
Run ETL pipeline to clean, validate, and upload the datasets to S3
python main.py
Create external tables in Athena
python tests/test_table_creation.py
Validate Athena queries
python tests/test_athena_queries.py
Expected Results:
Top Patients Query: Outputs a ranked list of patients with the most generated minutes. Any issues or missing tables will be reported in the output logs.