A scalable ETL pipeline to collect, transform, and visualize weather data, showcasing data engineering skills for modern cloud-based analytics.
This project demonstrates a robust ETL pipeline that fetches weather data from two APIs (OpenWeatherMap and WeatherAPI), processes it using Apache Spark in Databricks, stores it in a Star Schema in MS SQL Server, orchestrates the workflow with Apache Airflow, and visualizes trends in Power BI. Built to showcase proficiency in Data Integration, Database Development, Data Warehouse Development, Cloud Concepts, and Engineering Excellence, this project is ideal for data engineering roles in the U.S. tech industry.
Keywords: Data Engineering, ETL Pipeline, Apache Airflow, Databricks, Spark, MS SQL Server, Power BI, Python, SQL, Delta Lake, Data Warehouse, Star Schema, Cloud Analytics.
- Extract: Fetches JSON data from OpenWeatherMap (temperature, humidity) and WeatherAPI (precipitation) hourly, storing raw data in Databricks DBFS (
dbfs:/Volumes/workspace/default/weather/raw/
). - Transform: Processes data in Databricks using PySpark, handling Spark part-files and NaN values, and saves as CSV in
dbfs:/Volumes/workspace/default/weather/transformed/
. - Load: Populates a Star Schema in MS SQL Server (
weather_dwh
database) for analytical queries. - Orchestration: Uses Apache Airflow 2.8.0 to schedule and manage the ETL workflow with an hourly DAG (
weather_etl
). - Visualization: Power BI dashboard displays temperature and precipitation trends by city and date.
- Python: Pandas, Requests, pymssql for data extraction and loading.
- SQL: Star Schema design, analytical queries in MS SQL Server.
- Apache Airflow 2.8.0: Workflow orchestration with Databricks integration.
- Databricks Community Edition: Apache Spark, Delta Lake for data transformation.
- MS SQL Server: Data Warehouse with Star Schema.
- Power BI: Visualization of weather trends.
- DBFS: Storage for raw and transformed data (simulating cloud storage).
- MS SQL Server (Express or Developer Edition) with two databases:
airflow_db
(Airflow metadata) andweather_dwh
(Data Warehouse). - Databricks Community Edition account.
- API keys for OpenWeatherMap and WeatherAPI.
- Python 3.8+.
- Install dependencies:
pip install apache-airflow==2.8.0 apache-airflow-providers-microsoft-mssql==3.9.2 apache-airflow-providers-databricks==6.7.0 pandas requests pymssql pytest databricks-cli python-dotenv connexion[swagger-ui]
- Configure MS SQL Server:
- Create databases:
airflow_db
andweather_dwh
. - Enable
READ_COMMITTED_SNAPSHOT
:ALTER DATABASE airflow_db SET READ_COMMITTED_SNAPSHOT ON;
- Create databases:
- Configure Airflow:
- Update
~/projects/airflow/pet/airflow.cfg
:sql_alchemy_conn = mssql+pymssql://sa:<your_password>@localhost:1433/airflow_db
- Initialize database:
airflow db init
. - Create admin user:
airflow users create --username admin --password admin --firstname Admin --lastname Admin --role Admin --email kavoon.dev@gmail.com
. - Start Airflow:
airflow webserver -p 8080
andairflow scheduler
.
- Update
- Configure Databricks CLI and Airflow connection:
- Set up
databricks_default
connection in Airflow UI (Admin > Connections).
- Set up
- Create
.env
file in the project root with:OPENWEATHER_API_KEY=your_openweather_key WEATHERAPI_API_KEY=your_weatherapi_key
- Install Databricks CLI:
pip install databricks-cli
.
extract_weather.py
: Fetches weather data from APIs and saves to DBFS.transform_weather.py
: Processes data in Databricks, handles part-files and NaN values, saves as CSV.load_to_mssql.py
: Loads transformed data into MS SQL Server Star Schema.weather_dag.py
: Airflow DAG for hourly ETL orchestration.tests/
: Unit tests using pytest for extract and transform steps.
- Run
extract_weather.py
to fetch and store raw data in DBFS. - Execute the Databricks Notebook (
/Users/your_username/transform_weather
) to transform data. - Run
load_to_mssql.py
to load data intoweather_dwh
. - Activate the
weather_etl
DAG in Airflow UI (localhost:8080
). - Connect Power BI to MS SQL Server (
weather_dwh
) and build a dashboard with temperature and precipitation trends.
- SOLID Principles: Classes like
WeatherExtractor
andWeatherTransformer
follow Single Responsibility. - DRY: Reusable functions for data extraction and transformation.
- KISS: Simple SQL queries and modular Python code.
- Unit Tests: Pytest ensures reliability of extract and transform steps.
- Data Quality: Handles Spark part-files and NaN values for robust processing.
- Documentation: Clear setup and usage instructions for reproducibility.
This project showcases skills critical for data engineering roles:
- Data Integration: Seamlessly combines data from multiple APIs using Python and Airflow.
- Database Development: Implements a Star Schema in MS SQL Server for analytical queries.
- Data Warehouse Development: Leverages Delta Lake and Star Schema for scalable analytics.
- Cloud Concepts: Uses Databricks and DBFS, simulating Azure Blob Storage workflows.
- Engineering Excellence: Modular, tested, and well-documented code adhering to SOLID, DRY, and KISS principles.
- Certifications: Built with knowledge from the Astronomer Airflow Fundamentals Certification.
Contact: kavoon.dev@gmail.com | Web-site
MIT License