This repository is a template for the final project of the Big Data course. It provides a structured environment to build an end-to-end pipeline for analyzing US domestic flight data, performing exploratory data analysis (EDA), and building predictive models (e.g., for flight cancellations or delays).
- Python 3.x
- Access to Innopolis University Hadoop Cluster: Credentials for PostgreSQL, Hive, and HDFS.
- Secrets: Create a
secrets/
directory in the project root and add:secrets/.psql.pass
: Containing your PostgreSQL password on a single line.secrets/.hive.pass
: Containing your Hive password on a single line.
- Clone the repository:
git clone https://github.com/Nazgulitos/BigData-project.git cd BigData-project
- From the project root directory, execute:
ssh team15@hadoop-01.uni.innopolis.ru -p 22 cd BigData-project python3 -m venv venv source ./venv/bin/activate pip install -r requirements.txt bash main.sh
- All results and artifacts will be stored in the
output/
directory.
data/
: Contains the dataset files (raw and processed).models/
: Contains the trained Spark ML models.notebooks/
: Jupyter/Zeppelin notebooks for experimentation and learning (not used in the final pipeline).output/
: Stores results like CSVs, text files, images from the pipeline.scripts/
: Contains all.sh
and.py
scripts for pipeline stages.sql/
: Contains all.sql
(PostgreSQL) and.hql
(HiveQL) files.requirements.txt
: Python package dependencies.main.sh
: The main script to execute the entire pipeline.
The pipeline performs the following key stages:
- Data Collection: Downloads US flight data from Kaggle and performs initial Python-based preprocessing.
- Relational DB Storage: Builds a PostgreSQL database and ingests the preprocessed data.
- HDFS Ingestion: Uses Sqoop to import data from PostgreSQL into HDFS as AVRO files with Snappy compression.
- Data Warehousing & EDA: Sets up Hive external tables on the HDFS data, performs optimizations, and runs HQL queries for EDA.
- Predictive Data Analytics: Uses Spark ML to train, evaluate, and compare models (e.g., Logistic Regression, Random Forest) for predicting flight outcomes.
We didn't remove it, just in case.
main.sh
is Immutable: You cannot change the content ofmain.sh
. The grader will run this script as-is to assess your project. Ensure your individual scripts are correctly called and function as expected withinmain.sh
.- Notebooks are for Learning Only: The
notebooks/
folder is for your exploration. Your final pipeline logic must be in.py
scripts within thescripts/
folder. Thenotebooks/
folder might be deleted during grading to ensure your pipeline doesn't depend on its content. - Idempotency: Ensure your scripts can be run multiple times without errors (e.g., by dropping tables/directories before creating them).
- Paths: All scripts should be runnable from the project's root directory.