Skip to content

LazarosPan/Big-Data-Cloud-Computing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data & Cloud Computing – Assignments 1 & 2

Lazaros Panitsidis – MSc Data Science 2025


📌 Overview

The coursework is split into two independent assignments:

No. Focus Technology Goal
1 Classic MapReduce analytics on car-sales data Python + mrjob (stand-alone & Hadoop) Practise mapper/reducer patterns and Hadoop execution
2 Large-scale analytics on TechCrunch posts and car-sales data PySpark 3.5 (DataFrame API) Demonstrate Spark transformations, caching and window functions

Both notebooks are self-contained and resolve CSV paths automatically (via pathlib → file://…), so they run on any machine without edits.


📂 Datasets

Dataset File Size / Rows Description Used in
Vehicle Sales car_prices.csv 85 MB · 558,837 rows 16-column log of U.S. vehicle sales Assignments 1 & 2
TechCrunch Posts techcrunch_posts.csv 19 MB · 42,422 rows 2010 TechCrunch posts with metadata & in-links Assignment 2

Data-quality rule: drop rows missing any of the following: year, make, odometer, color, interior, sellingprice, saledate.


🗂️ Assignment 1 – MapReduce with mrjob

Tasks

  • Task 1: Yearly statistics – count, total value, avg odometer, avg age
  • Task 2: Yearly statistics by brand – same metrics grouped by make and year
  • Task 3: Sales by exterior color – count and total value per color
  • Task 4: Top color combinations – count per (exterior, interior) pair

Execution Modes

  • Stand-alone (local filesystem)
  • Distributed (Hadoop + HDFS)

Each task script (taskX.py) writes outputs to:

taskX_out/
├── taskX_standalone_<runtime>s
└── taskX_distributed_<runtime>s

🗂️ Assignment 2 – PySpark Analytics

TechCrunch Blog Tasks

  • 1. Most active dates – post count per date
  • 2. Most cited authors – total inlinks per author
  • 3. Average post length – average word count per author
  • 4. Author h-index – computed with window functions

Vehicle Sales Tasks

  • 1. Yearly stats – count, total value, avg odometer, avg age
  • 2. Yearly stats by brand – same metrics grouped by make and year
  • 3. Exterior color analysis – total sales and volume per color
  • 4. Color combination analysis – count per (exterior, interior) pair

Technical Highlights

  • Pure DataFrame API (no RDDs)
  • Auto-resolved file paths with pathlib.Path().as_uri()
  • Caching/unpersisting where needed for memory efficiency
  • Use of Spark SQL and window functions for aggregation logic

📁 Project Structure

Big-Data-Cloud-Computing/
├── BDCC_2025_Assignment1/
│ ├── Assignment_1_MapReduce_2025_Lazaros_Panitsidis.ipynb
│ ├── task1.py … task4.py
│ ├── car_prices.csv
│ └── task*_out/
└── BDCC_2025_Assignment2/
├── Assignment_2_Spark_2025_Lazaros_Panitsidis.ipynb
├── techcrunch_posts.csv
├── car_prices.csv
└── README.md

📤 Submission Instructions

Submit the following notebooks:

  • Assignment_1_MapReduce_2025_Lazaros_Panitsidis.ipynb
  • Assignment_2_Spark_2025_Lazaros_Panitsidis.ipynb

Do not include dataset files.
Ensure both notebooks run correctly on any machine without code edits.


👤 Author

Lazaros Panitsidis


📅 Deadlines

  • Assignment 1: 05/05/2025 at 22:00
  • Assignment 2: 02/06/2025 at 22:00

⚠️ No extensions will be granted. Please verify correctness, performance, and reproducibility before submission.