Skip to content

A comprehensive ETL pipeline for analyzing passenger satisfaction data. Features a modern data architecture with Apache Airflow for extraction, dbt/Snowflake for transformation, Python/Pandas for cleaning, and interactive dashboards for visualization with NextJS.

License

Notifications You must be signed in to change notification settings

MarkPhamm/skytrax-reviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Skytrax Global Airlines Analytics Project

image

Access our Dashboard: Global Airlines Dashboard


Repositories

Repository Owner Purpose
skytrax_data_cleaning DucLe‑2005 Cleans raw scraped data and standardizes formats using modular Python functions.
skytrax_extract_load MarkPhamm Scrapes customer reviews for all airlines on Skytrax, stages them in S3, then loads to Snowflake via Airflow‑compatible ETL scripts.
skytrax_transformation MarkPhamm Handles dbt‑based data transformation on Snowflake with CI/CD workflows via GitHub Actions.
skytrax_dashboard_website nguyentienTCU A dashboard website for visualising insights from processed airline reviews.

Team Structure

Project Leadership

Engineering Teams


1. Project Overview

This end‑to‑end analytics initiative ingests, processes, and visualises customer‑review data for every airline covered by Skytrax (AirlineQuality.com). The architecture leverages industry‑standard tooling and cloud services to provide a robust, scalable foundation for airline‑wide sentiment, operational, and competitive analysis.

Self‑selection bias: Reviews on Skytrax are self‑reported. Passengers with extreme experiences (positive or negative) are more likely to post, so KPIs derived from this data skew away from the broader flying population. Our goal is therefore directional insight, not population‑level generalisation.


2. Architecture Overview

BritishAirways

2.1 Extraction Layer

Overview

The extraction layer gathers review data for all airlines from Skytrax, stores it in S3 and prepares it for downstream processing.

2.1.1 Technology Stack

  • Python 3.12 with Pandas
  • Apache Airflow
  • AWS S3
  • Docker
  • Snowflake

2.1.2 Data Source

Skytrax review pages, e.g. https://www.airlinequality.com/airline-reviews/{airline‑slug}/

Captured fields include: star ratings, review text, flight details, passenger metadata, and category scores.

2.1.3 Extraction Process

# From main_dag.py – Extract task definition
scrape_skytrax_data = BashOperator(
    task_id="scrape_skytrax_data",
    bash_command="chmod -R 777 /opt/***/data && python /opt/airflow/tasks/scraper_extract/scraper.py"
)

Steps

  1. Iterate through the Skytrax airline index.
  2. Request paginated review HTML for each carrier.
  3. Parse and normalise each review record.
  4. Persist results to raw_data.csv.

2.1.4 Data Cleaning & Initial Transformation

clean_data = BashOperator(
    task_id="clean_data",
    bash_command="python /opt/airflow/tasks/transform/transform.py"
)

Cleaning tasks standardise date formats, handle nulls, and enforce data‑type consistency before staging to S3.

2.1.5 AWS S3 Integration

upload_cleaned_data_to_s3 = BashOperator(
    task_id="upload_cleaned_data_to_s3",
    bash_command="chmod -R 777 /opt/airflow/data && python /opt/airflow/tasks/upload_to_s3.py"
)
  • Secure IAM roles
  • Server‑side encryption
  • Versioning enabled

2.1.6 Workflow Orchestration

with DAG(
    dag_id="skytrax_pipeline",
    schedule_interval="@daily",
    default_args=default_args,
    start_date=start_date,
    catchup=True,
    max_active_runs=1,
):
    scrape_skytrax_data >> note >> clean_data >> note_clean_data >> upload_cleaned_data_to_s3

2.1.7 Snowflake Integration

snowflake_copy_operator = BashOperator(
    task_id="snowflake_copy_from_s3",
    bash_command="pip install snowflake-connector-python python-dotenv && python /opt/airflow/tasks/snowflake_load.py"
)

2.2 Data Cleaning Layer

Key steps mirror the British Airways version but operate across carriers:

  1. Column Standardisation – snake_case, special‑character cleanup.
  2. Date Formatting – ISO 8601 for both submission and flight dates.
  3. Text Cleaning – verification flag extraction; nationality normalisation.
  4. Route Parsing – origin, destination, and connections.
  5. Aircraft Standardisation – unified Airbus/Boeing nomenclature.
  6. Rating Conversion – numeric Int64 fields for uniform analysis.

Outputs feed directly to Snowflake for transformation.


2.3 Transformation Layer

2.3.1 Data Model

A star schema identical in design to the airline‑specific version:

Table Purpose
fct_review One row per review per flight with quantitative metrics
dim_customer Passenger information
dim_aircraft Aircraft attributes
dim_location Airport / city keys for origin, destination, transit
dim_date Calendar table for submission & flight dates

Incremental dbt jobs maintain freshness while minimising warehouse spend.

2.3.2 Data Quality Framework

  • Schema & relationship tests
  • Custom business‑logic assertions (e.g. rating within 0–10)
  • Freshness & completeness checks

CI/CD triggers on code pushes, PRs, weekly schedules, and manual invocations.


2.4 Visualisation Layer

2.4.1 Dashboard Highlights

  • Interactive KPI Cards: Overall satisfaction, NPS‑like scores, category averages.
  • Multi‑Dimensional Filters: Airline, aircraft, route, cabin class, traveller type.
  • Data Explorer: Drag‑and‑drop or SQL‑like querying for power users.
  • RAG Chatbot: Natural‑language Q&A across the full corpus of reviews.

3. Key Business Insights (Illustrative)

3.1 Economy‑Class Passenger Trends

Problem 1 Details Findings

  • Across airlines, ground‑staff service and boarding efficiency dominate complaints.
  • Major international hubs (e.g. LHR, CDG, JFK) see the highest negative volume—often tied to long security queues and staff shortages.
  • 92 % of low‑rating Economy reviews cite at‑airport factors rather than in‑flight experience.

Problem 1 Recommendations

  • Collaborate with ground‑handling partners to boost staffing during peak waves.
  • Deploy self‑service kiosks and real‑time queue monitoring.

3.2 Premium‑Cabin Passenger Expectations

Findings

  • Business & First passengers focus on seat comfort, bedding quality, and connectivity speed.
  • Consistency gaps between aircraft sub‑fleets (older cabins vs. refurbished) drive dissatisfaction.
  • Food quality is the second‑largest driver of 4‑star‑and‑below ratings.

Recommendations

  • Accelerate fleet‑wide seat upgrade programmes.
  • Introduce chef‑curated rotating menus with regional options.
  • Guarantee minimum bandwidth per passenger on Wi‑Fi plans.

4. Next Steps

  1. Expand Data Sources – Integrate on‑time‑performance and DOT complaint data for richer modelling.
  2. Real‑Time Ingestion – Move to CDC‑style pipelines to surface insights within hours of review publication.
  3. Predictive Modelling – Use sentiment plus operational variables to forecast future NPS movement by airline and route.
  4. Monetisation – Offer benchmarking dashboards to airlines and airports via subscription.

© 2025 Skytrax Global Airlines Analytics Project

2. Architecture Overview

About

A comprehensive ETL pipeline for analyzing passenger satisfaction data. Features a modern data architecture with Apache Airflow for extraction, dbt/Snowflake for transformation, Python/Pandas for cleaning, and interactive dashboards for visualization with NextJS.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •