Skip to content

An end-to-end Apache Airflow pipeline that extracts NASA’s Astronomy Picture of the Day (APOD) using an API, transforms the data, and loads it into a PostgreSQL database. Built using Astro CLI and Docker.

Notifications You must be signed in to change notification settings

Veneel77/Nasa-apod-etl-pipeline

Repository files navigation

Project Overview: Airflow ETL Pipeline with Postgres and API Integration

This project involves creating an ETL (Extract, Transform, Load) pipeline using Apache Airflow. The pipeline extracts data from an external API(in this case , NASA's Astronomy Picture of the Day(APOD) API), transforms the data, and loads it into a Postgres database. The entire workflow is orchestrated by Airflow, a platform that allows scheduling, monitoring, and managing workflows.

The Project leverages Docker to run Airflow and Postgres as services, ensuring an isolated and reproducible environment. We also utilize Airflow hooks and operators to handle the ETL process efficiently.

Key Components of the Project: Airflow for Orchestration:

Airflow is used to define, schedule, and monitor the entire ETL pipeline. It manages task dependencies, ensuring that the process runs sequentially and reliably. The Airflow DAG(Directed Acyclic Graph) defines the workflow, which includes tasks like data extraction, transformation, and loading. Postgres Database:

A PostgreSQL database is used to store the extracted and transformed data. Postgres is hosted in a Docker container, making it easy to manage and ensuring data persistence through Docker volumes. We interact with Postgres using Airflow’s PostgresHook and PostgresOperator. NASA API (Astronomy Picture of the Day):

The external API used in this project is NASA’s APOD API, which provides data about the astronomy picture of the day, including metadata like the title, explanation, and the URL of the image. We use Airflow’s SimpleHttpOperator to extract data from the API. Objectives of the Project: Extract Data:

The pipeline extracts astronomy-related data from NASA’s APOD API on a scheduled basis (daily, in this case). Transform Data:

Transformations such as filtering or processing the API response are performed to ensure that the data is in a suitable format before being inserted into the database. Load Data into Postgres:

The transformed data is loaded into a Postgres database. The data can be used for further analysis, reporting, or visualization. Architecture and Workflow: The ETL pipeline is orchestrated in Airflow using a DAG (Directed Acyclic Graph). The pipeline consists of the following stages:

  1. Extract (E): The SimpleHttpOperator is used to make HTTP GET requests to NASA’s APOD API. The response is in JSON format, containing fields like the title of the picture, the explanation, and the URL to the image.
  2. Transform (T): The extracted JSON data is processed in the transform task using Airflow’s TaskFlow API (with the @task decorator). This stage involves extracting relevant fields like title, explanation, url, and date and ensuring they are in the correct format for the database.
  3. Load (L): The transformed data is loaded into a Postgres table using PostgresHook. If the target table doesn’t exist in the Postgres database, it is created automatically as part of the DAG using a create table task.

About

An end-to-end Apache Airflow pipeline that extracts NASA’s Astronomy Picture of the Day (APOD) using an API, transforms the data, and loads it into a PostgreSQL database. Built using Astro CLI and Docker.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published