This project demonstrates the deployment and integration of Hadoop and its ecosystem tools (Apache Hive, HCatalog, and Apache Pig) for large-scale weather and air quality data analysis.
The dataset, sourced from the World Weather Repository (Kaggle), includes attributes such as temperature, humidity, visibility, UV index, and air quality indicators (CO, Ozone, NO2, PM2.5, PM10).
The goal of the project is to build a scalable big data pipeline for processing, cleaning, and analysing weather datasets and to visualise insights using Power BI for decision-making in climate-sensitive sectors.
- Understand Weather Patterns and Trends β uncover seasonal and geographical behaviours.
- Monitor and Assess Air Quality β evaluate the impact of pollutants on health and environment.
- Visualise Climate Insights β create dashboards in Power BI to communicate findings clearly.
- HDFS (Hadoop Distributed File System) β scalable data storage
- Apache Hive β SQL-like querying for structured data
- Apache HCatalog β metadata management
- Apache Pig β data transformation and cleaning
- Power BI β interactive data visualisation
-
Deployment of Hadoop Servers
Configured Hadoop on Oracle VirtualBox with HDFS, YARN, and supporting tools. -
Data Integration & Cleaning
- Imported Kaggle datasets into Hive via HCatalog
- Removed duplicates, null values, and outliers
- Standardised timestamps and categorised weather conditions
-
Descriptive Analytics
- Explored temperature, humidity, UV index, and pollutant levels
- Built analysis-ready tables (
weather_data_analysis,air_quality_data_analysis)
-
Data Visualisation (Power BI)
- Weather trends across months
- Air quality impact across regions and conditions
- Visibility and UV index analysis
You can view the full set of screenshots in the repository folders:
- HCatalog β available in the
Apache HCatalog/folder - Hive Data Cleaning and Processing β available in the
Apache Hive/folder - Pig Scripts β available in the
Apache Pig/folder - Power BI Dashboards Scrrenshots β available in the
Dashboard/folder
The interactive Power BI file (dashboard.pbix) is also included in the repository for direct exploration.