This project implements a modern data transformation pipeline for airline industry analytics, designed to process and model customer review data from 500+ airlines across Skytrax Airline Quality. With over 100,000 reviews, it leverages dbt, Snowflake, and Apache Airflow, orchestrated via Astronomer, to create a scalable, production-ready workflow for comprehensive airline performance analysis.
.
├── data/ # Raw and processed data files
├── data_model/ # Data model diagrams and definitions
│ ├── schema.jpeg # Visual data model
│ └── schema.txt # Text-based schema description
├── dbt-dags/ # dbt models and Airflow DAGs
│ ├── dags/ # DAG definitions
│ ├── tests/ # Data quality tests
│ └── .astro/ # Astronomer CLI configuration
├── notebooks/ # Jupyter notebooks for Snowflake analysis
│ ├── snowflake_connection.ipynb
│ └── snowflake_connection.py
└── requirements.txt # Python package requirements

- Data Source: Skytrax Airline Reviews - 100k+ reviews from 500+ airlines
- Programming Language: Python 3.12.5
- Data Warehouse: Snowflake
- Transformation Tool: dbt
- Orchestration: Apache Airflow powered by Astronomer
Customer reviews are scraped from Skytrax AirlineQuality.com, capturing 100,000+ reviews from 500+ airlines worldwide, including structured and unstructured data elements such as:
- Flight Route (e.g., Singapore to Sydney)
- Aircraft Type (e.g., Boeing 777)
- Seat Type (e.g., Business Class)
- Type of Traveller (e.g., Solo Leisure)
- Date Flown (e.g., March 2025)
- Airline Information (Carrier name, brand identity)
- Star Ratings (Seat Comfort, Cabin Staff, Food & Beverages, Entertainment, Ground Service, Value for Money)
- Review Text and Submission Date
- Verification Flag (Trip Verified)
- Reviewer Info (Name, Country, Number of Reviews)
dim_customer
: Identity, loyalty, and flight historydim_aircraft
: Manufacturer, model, and seating layoutdim_location
: Airports, cities, and time zonesdim_date
: Calendar and fiscal date tracking
fct_review
: One row per customer review per flight across all airlines- Includes metrics (ratings), booleans (verified, recommended), categorical fields (seat type, travel type), and airline identifier
- Source Layer: Web scraping + staging
- Transformation Layer: dbt modeling + business logic
- Orchestration Layer: DAG scheduling and task dependency management via Astronomer
- Presentation Layer: Clean fact/dim tables for BI/reporting and cross-airline benchmarking
- Null checks
- Foreign key validation
- Freshness and completeness monitoring
- dbt tests for schema integrity and logic rules
Located in data_model/
:
schema.jpeg
: visual schema overviewschema.txt
: detailed textual schema
Located in dbt-dags/
:
- dbt model definitions and tests
- Airflow DAGs orchestrated via Astro CLI
- Modular structure for local development and deployment to Astronomer Cloud or Docker environments
Notebook resources in notebooks/
for:
- Establishing a connection to Snowflake
- Running exploratory queries
- Testing pipeline output
dbt-snowflake==1.9.2
pandas==2.2.3
snowflake-sqlalchemy==1.7.3
- Managed via
requirements.txt
This project uses GitHub Actions for automated data transformation pipeline:
- Push: Triggers on pushes to
main
branch - Pull Request: Runs on PRs to
main
branch - Schedule: Executes daily at 00:00 UTC
- Manual: Can be triggered via workflow_dispatch
-
Environment Setup
- Python 3.12 setup
- Dependencies installation
- dbt package installation
-
Data Transformation
- Runs dbt build process
- Uses Snowflake credentials from secrets
-
Notifications
- Sends email notifications on completion
- Includes run time, trigger info, and status
Let me know if you'd like a diagram for the Airflow DAG flow or a README.md
version with clickable section links and badges.
Business processes represent real-world events that generate measurable data. Across the global airline industry, the core business process is the collection of customer flight reviews. Each review submitted by a customer reflects their experience on a specific flight with any of the 500+ airlines and becomes a fact event. These reviews include detailed ratings on various service categories, forming the backbone of our comprehensive airline analytics platform.
The grain defines the level of detail stored in the fact table. For this model, the grain is defined as:
One customer review per flight.
Each row in the fct_review
table represents a unique review event containing metrics tied to a specific customer's flight experience across any airline in our dataset. This atomic grain ensures consistency and supports granular performance analysis across multiple service touchpoints and enables cross-airline benchmarking and industry-wide insights.
Dimension tables provide the who, what, where, when context for interpreting facts.
- Who:
dim_customer
— describes the reviewer throughcustomer_name
,nationality
, andnumber_of_flights
. - What:
dim_aircraft
— provides context on the aircraft viaaircraft_model
,aircraft_manufacturer
, andseat_capacity
. - Where:
dim_location
— captures the origin, destination, and transit points, using a combination ofcity
andairport_name
. - When:
dim_date
— captures bothdate_flown
anddate_submitted
and supports calendar and financial date logic (cal_year
,fin_quarter
, etc.).
Facts are the quantitative outputs from the review process, collected per flight review:
-
Ratings:
seat_comfort
,cabin_staff_service
,food_and_beverages
,inflight_entertainment
,ground_service
,wifi_and_connectivity
,value_for_money
-
Booleans:
verified
,recommended
-
Categorical Descriptions:
seat_type
,type_of_traveller
,airline
,review_text
These facts represent real customer input across 500+ airlines and form the foundation for performance dashboards, KPIs, customer sentiment insights, and comprehensive airline industry benchmarking.
This model follows a classic star schema structure where the fct_review
table sits at the center and joins to dimension tables via foreign keys:
Foreign Key in fct_review |
Dimension Table | Description |
---|---|---|
customer_id |
dim_customer |
Links each review to a specific customer |
date_submitted_id / date_flown_id |
dim_date |
Supports dual-date tracking (when submitted vs when flown) |
origin_location_id , destination_location_id , transit_location_id |
dim_location |
Connects review to flight locations |
aircraft_id |
dim_aircraft |
Captures aircraft-related context |
This schema supports efficient slicing, filtering, and aggregating reviews by date, location, customer, aircraft, and airline, enabling detailed insights across the global airline industry and comprehensive cross-carrier performance analysis.