Design and implement a data warehouse to manage automobile accident cases across all 49 states in the US, using a star schema and Snowflake for the data warehouse architecture.
- Data Source: This project uses data on Kaggle including 2 datasets: US Accidents (2016 - 2023) and Traffic Accidents and Vehicles
- US Accidents (2016 - 2023: This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023
- Traffic Accidents and Vehicles: every line in the file represents the involvement of a unique vehicle in a unique traffic accident, featuring various vehicle and passenger properties as columns
 
- Extract Data: Data is extractedfromcsvfile theningestedintoMinIOdata lake inbronzefolder usingPythonandAirflow
- Transform Data: Data is retrieved from MinIO'sbronzedirectory usingSparkandFastAPIto performtransformationandcleaning, then the output isloadedintoMinIO'ssilverdirectory.
- Load Data: Once the data has been cleaned, we load it into the Snowflakedatawarehouseat SchemaStagingusingPythonandAirflow.
- Warehouse: Data is loaded into stagingschema inSnowflake, Build and deploydata warehousewithStar Schemaarchitecture by creatingdimensionandfacttables, to do this we useDBTtotransformandcheck data.
- Serving: Analyze data to improve road safety, identify high-risk accident areas to implement preventative measures. Identify factors that contribute to accidents (weather, road conditions, human error). Then visualize and create reports with Power BI.
- Package and Orchestration: Components are packaged using Dockerand orchestrated usingApache Airflow.
- Apache Airflow
- Apache Spark
- Docker
- Dbt
- Snowflake
- MinIO
- FastApi
- Power BI

