Skip to content

πŸ“Š Big Data analysis project exploring the relationship between educational attainment and nutrition quality using Hadoop, Hive, and Zeppelin on a US-wide health dataset.

License

Notifications You must be signed in to change notification settings

omidcodes/bigdata-education-nutrition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š Big Data Analysis of Educational Attainment and Nutritional Quality

This project explores the relationship between educational attainment and nutritional quality using large-scale health and education data from the U.S. government. It uses big data toolsβ€”Hadoop, Hive, and Zeppelinβ€”to process, query, and visualize insights from over 100,000 records across 13 years.



πŸš€ Getting Started

To set up and run this project locally, see the full step-by-step guide:
πŸ“– Setup Instructions β†’

πŸ“‚ Project Structure

bigdata-education-nutrition/
β”œβ”€β”€ data/
β”‚   └── omid_dataset.csv
β”œβ”€β”€ hive_queries/
β”‚   └── queries.hql
β”œβ”€β”€ slides/
β”‚   └── presentation.pptx
β”œβ”€β”€ screenshots/
β”‚   └── *.png (Zeppelin charts)
β”œβ”€β”€ README.md
└── setup_instructions.md

πŸ“ˆ Dataset Overview


🧱 Technology Stack

Tool Purpose
HDFS Distributed data storage
Hive SQL-like query processing
Zeppelin Data visualization and dashboard creation
CSV Tabular data format used for ingestion

βš™οΈ System Architecture Flowchart

The project follows a four-stage big data pipeline:

  1. Upload .csv to HDFS
  2. Create Hive table to mirror dataset schema
  3. Run HiveQL queries for analysis
  4. Visualize data using Apache Zeppelin
πŸ“ Dataset
   └── csv file
        ↓
πŸ—„οΈ Collect & Storage
   └── Apache HDFS
        ↓
🧠 Analysing & Querying
   └── Apache Hive
        ↓
πŸ“Š Visualisation
   └── Apache Zeppelin

Each stage builds on the previous one to ensure scalable storage, analysis, and reporting of large-scale health and education data.


🧠 Key Hive Queries

1. πŸ“Š Physical Activity by Education Level

SELECT Education, SUM(Data_Value * Sample_Size) / SUM(Sample_Size) AS weighted_avg_physical_activity
FROM omid_dataset
WHERE Class = 'Physical Activity' AND Data_Value IS NOT NULL
GROUP BY Education;

Insight: College graduates show the highest levels of physical activity (~31.23%).


2. βš–οΈ Obesity Rates by Education Level

SELECT Education, Question, SUM(Data_Value * Sample_Size) / SUM(Sample_Size) AS weighted_avg_obesity_rate
FROM omid_dataset
WHERE Class = 'Obesity / Weight Status' AND Data_Value IS NOT NULL
GROUP BY Education, Question;

Insight: Obesity is more prevalent in lower education groups (~36.8% vs. 23.9%).


3. πŸ₯¦ Vegetable Intake by State and Education

SELECT LocationDesc, Education, SUM(Data_Value * Sample_Size) / SUM(Sample_Size) AS weighted_avg_low_veg_intake
FROM omid_dataset
WHERE Class = 'Fruits and Vegetables' AND Data_Value IS NOT NULL
GROUP BY LocationDesc, Education;

Insight: Poor nutrition (low vegetable intake) is common in states with lower education.


4. πŸ“‰ Trend: Physical Activity Over Time

SELECT YearStart, Education, SUM(Data_Value * Sample_Size) / SUM(Sample_Size) AS weighted_avg_physical_activity
FROM omid_dataset
WHERE Class = 'Physical Activity' AND Data_Value IS NOT NULL
GROUP BY YearStart, Education;

Insight: The education–activity gap persists over time.


5. πŸ₯— Low Vegetable Intake by Education Level

SELECT Education, SUM(Data_Value * Sample_Size) / SUM(Sample_Size) AS weighted_avg_low_veg_intake
FROM omid_dataset
WHERE Class = 'Fruits and Vegetables' AND Data_Value IS NOT NULL
GROUP BY Education;

Insight: Those with less than a high school education show the highest rates of low vegetable intake (~38.16%).


πŸ“Š Visualization Examples

Screenshots from Zeppelin (place in /screenshots/):

  • Bar chart: Physical Activity by Education
  • Bar chart: Obesity by Education
  • Line chart: Trends over years
  • Heatmap: Low veg intake by year & education

🎯 Key Takeaways

  • Education level strongly predicts healthier behavior.
  • More education β†’ higher physical activity, lower obesity, better nutrition.
  • Big data tools enable scalable, reproducible public health analytics.

πŸ“š References

  1. Centers for Disease Control and Prevention (2025). Nutrition, Physical Activity, and Obesity - BRFSS. data.gov
  2. Apache Hive: https://hive.apache.org
  3. Apache Zeppelin: https://zeppelin.apache.org

πŸ›‘οΈ License

  • πŸ”“ Source code, queries, and documentation: Licensed under GNU GPL v3.0
  • πŸ–ΌοΈ Slides and visual materials (e.g., .pptx, images): Licensed under CC BY-NC-ND 4.0

✍️ Author

Omid Hashemzadeh

About

πŸ“Š Big Data analysis project exploring the relationship between educational attainment and nutrition quality using Hadoop, Hive, and Zeppelin on a US-wide health dataset.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages