
Access our Dashboard: Global Airlines Dashboard
Repository | Owner | Purpose |
---|---|---|
skytrax_data_cleaning | DucLe‑2005 | Cleans raw scraped data and standardizes formats using modular Python functions. |
skytrax_extract_load | MarkPhamm | Scrapes customer reviews for all airlines on Skytrax, stages them in S3, then loads to Snowflake via Airflow‑compatible ETL scripts. |
skytrax_transformation | MarkPhamm | Handles dbt‑based data transformation on Snowflake with CI/CD workflows via GitHub Actions. |
skytrax_dashboard_website | nguyentienTCU | A dashboard website for visualising insights from processed airline reviews. |
- Data Engineering: Leonard Dau, Thieu Nguyen, Viet Lam Nguyen
- Software Engineering: Tien Nguyen, Anh Duc Le
- Data Science: Robin Tran, Trung Dam
- Scrum Master: Hien Dinh
This end‑to‑end analytics initiative ingests, processes, and visualises customer‑review data for every airline covered by Skytrax (AirlineQuality.com). The architecture leverages industry‑standard tooling and cloud services to provide a robust, scalable foundation for airline‑wide sentiment, operational, and competitive analysis.
Self‑selection bias: Reviews on Skytrax are self‑reported. Passengers with extreme experiences (positive or negative) are more likely to post, so KPIs derived from this data skew away from the broader flying population. Our goal is therefore directional insight, not population‑level generalisation.
The extraction layer gathers review data for all airlines from Skytrax, stores it in S3 and prepares it for downstream processing.
- Repository: all_airlines_extract_load
- Python 3.12 with Pandas
- Apache Airflow
- AWS S3
- Docker
- Snowflake
Skytrax review pages, e.g.
https://www.airlinequality.com/airline-reviews/{airline‑slug}/
Captured fields include: star ratings, review text, flight details, passenger metadata, and category scores.
# From main_dag.py – Extract task definition
scrape_skytrax_data = BashOperator(
task_id="scrape_skytrax_data",
bash_command="chmod -R 777 /opt/***/data && python /opt/airflow/tasks/scraper_extract/scraper.py"
)
Steps
- Iterate through the Skytrax airline index.
- Request paginated review HTML for each carrier.
- Parse and normalise each review record.
- Persist results to
raw_data.csv
.
clean_data = BashOperator(
task_id="clean_data",
bash_command="python /opt/airflow/tasks/transform/transform.py"
)
Cleaning tasks standardise date formats, handle nulls, and enforce data‑type consistency before staging to S3.
upload_cleaned_data_to_s3 = BashOperator(
task_id="upload_cleaned_data_to_s3",
bash_command="chmod -R 777 /opt/airflow/data && python /opt/airflow/tasks/upload_to_s3.py"
)
- Secure IAM roles
- Server‑side encryption
- Versioning enabled
with DAG(
dag_id="skytrax_pipeline",
schedule_interval="@daily",
default_args=default_args,
start_date=start_date,
catchup=True,
max_active_runs=1,
):
scrape_skytrax_data >> note >> clean_data >> note_clean_data >> upload_cleaned_data_to_s3
snowflake_copy_operator = BashOperator(
task_id="snowflake_copy_from_s3",
bash_command="pip install snowflake-connector-python python-dotenv && python /opt/airflow/tasks/snowflake_load.py"
)
- Repository: all_airlines_data_cleaning
- Stack: Python 3.12.5, Pandas, NumPy, Matplotlib, Seaborn
Key steps mirror the British Airways version but operate across carriers:
- Column Standardisation – snake_case, special‑character cleanup.
- Date Formatting – ISO 8601 for both submission and flight dates.
- Text Cleaning – verification flag extraction; nationality normalisation.
- Route Parsing – origin, destination, and connections.
- Aircraft Standardisation – unified Airbus/Boeing nomenclature.
- Rating Conversion – numeric Int64 fields for uniform analysis.
Outputs feed directly to Snowflake for transformation.
- Repository: all_airlines_transformation
- Stack: dbt (Core), Snowflake, Airflow (Astronomer), GitHub Actions
A star schema identical in design to the airline‑specific version:
Table | Purpose |
---|---|
fct_review | One row per review per flight with quantitative metrics |
dim_customer | Passenger information |
dim_aircraft | Aircraft attributes |
dim_location | Airport / city keys for origin, destination, transit |
dim_date | Calendar table for submission & flight dates |
Incremental dbt jobs maintain freshness while minimising warehouse spend.
- Schema & relationship tests
- Custom business‑logic assertions (e.g. rating within 0–10)
- Freshness & completeness checks
CI/CD triggers on code pushes, PRs, weekly schedules, and manual invocations.
- Repository: all_airlines_dashboard_website
- Live Site: Global Airlines Analytics Dashboard
- Stack: Next.js, TailwindCSS, Chart.js, LangChain, ChromaDB
- Interactive KPI Cards: Overall satisfaction, NPS‑like scores, category averages.
- Multi‑Dimensional Filters: Airline, aircraft, route, cabin class, traveller type.
- Data Explorer: Drag‑and‑drop or SQL‑like querying for power users.
- RAG Chatbot: Natural‑language Q&A across the full corpus of reviews.
- Across airlines, ground‑staff service and boarding efficiency dominate complaints.
- Major international hubs (e.g. LHR, CDG, JFK) see the highest negative volume—often tied to long security queues and staff shortages.
- 92 % of low‑rating Economy reviews cite at‑airport factors rather than in‑flight experience.
- Collaborate with ground‑handling partners to boost staffing during peak waves.
- Deploy self‑service kiosks and real‑time queue monitoring.
Findings
- Business & First passengers focus on seat comfort, bedding quality, and connectivity speed.
- Consistency gaps between aircraft sub‑fleets (older cabins vs. refurbished) drive dissatisfaction.
- Food quality is the second‑largest driver of 4‑star‑and‑below ratings.
Recommendations
- Accelerate fleet‑wide seat upgrade programmes.
- Introduce chef‑curated rotating menus with regional options.
- Guarantee minimum bandwidth per passenger on Wi‑Fi plans.
- Expand Data Sources – Integrate on‑time‑performance and DOT complaint data for richer modelling.
- Real‑Time Ingestion – Move to CDC‑style pipelines to surface insights within hours of review publication.
- Predictive Modelling – Use sentiment plus operational variables to forecast future NPS movement by airline and route.
- Monetisation – Offer benchmarking dashboards to airlines and airports via subscription.
© 2025 Skytrax Global Airlines Analytics Project