This project showcases scalable data engineering workflows built using Hadoop, MapReduce, Hive, and Sqoop. Across multiple modules, the project demonstrates real-world applications of big data technologies, including log aggregation, airline delay analysis, TF-IDF computation, and secondary sorting.
- Practice Hadoop-based data processing using real-world datasets
- Apply Hive for SQL-style querying and aggregation at scale
- Leverage Sqoop to bridge SQL and Hadoop ecosystems
- Build multi-stage MapReduce workflows for advanced analytics (e.g., TF-IDF)
- Optimize MapReduce tasks for performance and sort customization
- Ran MapReduce jobs on Shakespeare texts
- Extracted top 10 frequent terms
- Cleaned text by removing punctuation and converting to lowercase
- Compared vocabulary richness of Shakespeare vs. Austen
- Parsed Hadoop logs to compute severity-level counts per minute (INFO, WARN, ERROR, FATAL)
- Output includes structured lines per minute with total and breakdown counts
- Imported data via Sqoop
- Computed min, max, and average flight delays by carrier
- Output sorted by average delay
- Modified word count output to be sorted by frequency (descending)
- Implemented sort within MapReduce job using custom key manipulation
- Joined airline and delay tables using HiveQL
- Extracted airline name and computed average arrival delay
- Exported final table as CSV using Hive's external export
- Stepwise Hadoop Streaming pipeline:
- Term count per doc
- Total term count per doc
- Document frequency per term
- Final TF-IDF calculated using Hive and output as
(doc_id, term, tfidf × 1,000,000)
- Technologies: Hadoop, Hive, Sqoop, HDFS, HiveQL
- Programming: Python, Shell scripting, Streaming Mapper/Reducer
- Concepts: TF-IDF, MapReduce chaining, log analysis, airline performance tracking
- Data Sources: Shakespeare/Austen texts, Hadoop logs, Airline delay datasets
top-10-words.txt
: Ten most frequent cleaned words in Shakespeare corpuslog-summary.txt
: Log entry count breakdown by minute and severityairline-delay-summary.txt
: Carrier-wise delay statistics (min, max, avg)worst-average-arrival-delay.txt
: Airlines ranked by average delaytfidf-output.txt
: Final TF-IDF scores per term and document
Each module has its own script and supporting folder structure:
- Run using provided shell scripts (e.g.,
process-log-file
,terms-by-count
) - Hadoop Streaming used with custom mapper and reducer Python scripts
- Hive scripts executed using
hive < script.hive
- Sqoop used to selectively import SQL tables into HDFS
Ensure Docker-based Hadoop environment is properly set up with:
docker pull bigdata-hadoop:v7
docker run -it bigdata-hadoop:v7 /bin/bash