This project explores the relationship between educational attainment and nutritional quality using large-scale health and education data from the U.S. government. It uses big data toolsβHadoop, Hive, and Zeppelinβto process, query, and visualize insights from over 100,000 records across 13 years.
To set up and run this project locally, see the full step-by-step guide:
π Setup Instructions β
bigdata-education-nutrition/
βββ data/
β βββ omid_dataset.csv
βββ hive_queries/
β βββ queries.hql
βββ slides/
β βββ presentation.pptx
βββ screenshots/
β βββ *.png (Zeppelin charts)
βββ README.md
βββ setup_instructions.md
- Source: Data.gov - Nutrition, Physical Activity, and Obesity
- Years Covered: 2011β2023
- Records: 104,273
- Features:
- Educational Level
- Fruit & Vegetable Intake
- Obesity Status
- Physical Activity
- U.S. States & Regions
Tool | Purpose |
---|---|
HDFS | Distributed data storage |
Hive | SQL-like query processing |
Zeppelin | Data visualization and dashboard creation |
CSV | Tabular data format used for ingestion |
The project follows a four-stage big data pipeline:
- Upload
.csv
to HDFS - Create Hive table to mirror dataset schema
- Run HiveQL queries for analysis
- Visualize data using Apache Zeppelin
π Dataset
βββ csv file
β
ποΈ Collect & Storage
βββ Apache HDFS
β
π§ Analysing & Querying
βββ Apache Hive
β
π Visualisation
βββ Apache Zeppelin
Each stage builds on the previous one to ensure scalable storage, analysis, and reporting of large-scale health and education data.
SELECT Education, SUM(Data_Value * Sample_Size) / SUM(Sample_Size) AS weighted_avg_physical_activity
FROM omid_dataset
WHERE Class = 'Physical Activity' AND Data_Value IS NOT NULL
GROUP BY Education;
Insight: College graduates show the highest levels of physical activity (~31.23%).
SELECT Education, Question, SUM(Data_Value * Sample_Size) / SUM(Sample_Size) AS weighted_avg_obesity_rate
FROM omid_dataset
WHERE Class = 'Obesity / Weight Status' AND Data_Value IS NOT NULL
GROUP BY Education, Question;
Insight: Obesity is more prevalent in lower education groups (~36.8% vs. 23.9%).
SELECT LocationDesc, Education, SUM(Data_Value * Sample_Size) / SUM(Sample_Size) AS weighted_avg_low_veg_intake
FROM omid_dataset
WHERE Class = 'Fruits and Vegetables' AND Data_Value IS NOT NULL
GROUP BY LocationDesc, Education;
Insight: Poor nutrition (low vegetable intake) is common in states with lower education.
SELECT YearStart, Education, SUM(Data_Value * Sample_Size) / SUM(Sample_Size) AS weighted_avg_physical_activity
FROM omid_dataset
WHERE Class = 'Physical Activity' AND Data_Value IS NOT NULL
GROUP BY YearStart, Education;
Insight: The educationβactivity gap persists over time.
SELECT Education, SUM(Data_Value * Sample_Size) / SUM(Sample_Size) AS weighted_avg_low_veg_intake
FROM omid_dataset
WHERE Class = 'Fruits and Vegetables' AND Data_Value IS NOT NULL
GROUP BY Education;
Insight: Those with less than a high school education show the highest rates of low vegetable intake (~38.16%).
Screenshots from Zeppelin (place in /screenshots/
):
- Bar chart: Physical Activity by Education
- Bar chart: Obesity by Education
- Line chart: Trends over years
- Heatmap: Low veg intake by year & education
- Education level strongly predicts healthier behavior.
- More education β higher physical activity, lower obesity, better nutrition.
- Big data tools enable scalable, reproducible public health analytics.
- Centers for Disease Control and Prevention (2025). Nutrition, Physical Activity, and Obesity - BRFSS. data.gov
- Apache Hive: https://hive.apache.org
- Apache Zeppelin: https://zeppelin.apache.org
- π Source code, queries, and documentation: Licensed under GNU GPL v3.0
- πΌοΈ Slides and visual materials (e.g.,
.pptx
, images): Licensed under CC BY-NC-ND 4.0
Omid Hashemzadeh