Big Data & Cloud Computing – Assignments 1 & 2

Lazaros Panitsidis – MSc Data Science 2025

📌 Overview

The coursework is split into two independent assignments:

No.	Focus	Technology	Goal
1	Classic MapReduce analytics on car-sales data	Python + mrjob (stand-alone & Hadoop)	Practise mapper/reducer patterns and Hadoop execution
2	Large-scale analytics on TechCrunch posts and car-sales data	PySpark 3.5 (DataFrame API)	Demonstrate Spark transformations, caching and window functions

Both notebooks are self-contained and resolve CSV paths automatically (via pathlib → file://…), so they run on any machine without edits.

📂 Datasets

Dataset	File	Size / Rows	Description	Used in
Vehicle Sales	car_prices.csv	85 MB · 558,837 rows	16-column log of U.S. vehicle sales	Assignments 1 & 2
TechCrunch Posts	techcrunch_posts.csv	19 MB · 42,422 rows	2010 TechCrunch posts with metadata & in-links	Assignment 2

Data-quality rule: drop rows missing any of the following: year, make, odometer, color, interior, sellingprice, saledate.

🗂️ Assignment 1 – MapReduce with mrjob

Tasks

Task 1: Yearly statistics – count, total value, avg odometer, avg age
Task 2: Yearly statistics by brand – same metrics grouped by make and year
Task 3: Sales by exterior color – count and total value per color
Task 4: Top color combinations – count per (exterior, interior) pair

Execution Modes

Stand-alone (local filesystem)
Distributed (Hadoop + HDFS)

Each task script (taskX.py) writes outputs to:

taskX_out/
├── taskX_standalone_<runtime>s
└── taskX_distributed_<runtime>s

🗂️ Assignment 2 – PySpark Analytics

TechCrunch Blog Tasks

1. Most active dates – post count per date
2. Most cited authors – total inlinks per author
3. Average post length – average word count per author
4. Author h-index – computed with window functions

Vehicle Sales Tasks

1. Yearly stats – count, total value, avg odometer, avg age
2. Yearly stats by brand – same metrics grouped by make and year
3. Exterior color analysis – total sales and volume per color
4. Color combination analysis – count per (exterior, interior) pair

Technical Highlights

Pure DataFrame API (no RDDs)
Auto-resolved file paths with pathlib.Path().as_uri()
Caching/unpersisting where needed for memory efficiency
Use of Spark SQL and window functions for aggregation logic

📁 Project Structure

Big-Data-Cloud-Computing/
├── BDCC_2025_Assignment1/
│ ├── Assignment_1_MapReduce_2025_Lazaros_Panitsidis.ipynb
│ ├── task1.py … task4.py
│ ├── car_prices.csv
│ └── task*_out/
└── BDCC_2025_Assignment2/
├── Assignment_2_Spark_2025_Lazaros_Panitsidis.ipynb
├── techcrunch_posts.csv
├── car_prices.csv
└── README.md

📤 Submission Instructions

Submit the following notebooks:

Assignment_1_MapReduce_2025_Lazaros_Panitsidis.ipynb
Assignment_2_Spark_2025_Lazaros_Panitsidis.ipynb

Do not include dataset files.
Ensure both notebooks run correctly on any machine without code edits.

👤 Author

Lazaros Panitsidis

📅 Deadlines

Assignment 1: 05/05/2025 at 22:00
Assignment 2: 02/06/2025 at 22:00

⚠️ No extensions will be granted. Please verify correctness, performance, and reproducibility before submission.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
BDCC_2025_Assignment1		BDCC_2025_Assignment1
BDCC_2025_Assignment2		BDCC_2025_Assignment2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Big Data & Cloud Computing – Assignments 1 & 2

📌 Overview

📂 Datasets

🗂️ Assignment 1 – MapReduce with mrjob

Tasks

Execution Modes

🗂️ Assignment 2 – PySpark Analytics

TechCrunch Blog Tasks

Vehicle Sales Tasks

Technical Highlights

📁 Project Structure

📤 Submission Instructions

👤 Author

📅 Deadlines

About

Uh oh!

Releases

Packages

Languages

License

LazarosPan/Big-Data-Cloud-Computing

Folders and files

Latest commit

History

Repository files navigation

Big Data & Cloud Computing – Assignments 1 & 2

📌 Overview

📂 Datasets

🗂️ Assignment 1 – MapReduce with mrjob

Tasks

Execution Modes

🗂️ Assignment 2 – PySpark Analytics

TechCrunch Blog Tasks

Vehicle Sales Tasks

Technical Highlights

📁 Project Structure

📤 Submission Instructions

👤 Author

📅 Deadlines

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages