Lazaros Panitsidis – MSc Data Science 2025
The coursework is split into two independent assignments:
No. | Focus | Technology | Goal |
---|---|---|---|
1 | Classic MapReduce analytics on car-sales data | Python + mrjob (stand-alone & Hadoop) | Practise mapper/reducer patterns and Hadoop execution |
2 | Large-scale analytics on TechCrunch posts and car-sales data | PySpark 3.5 (DataFrame API) | Demonstrate Spark transformations, caching and window functions |
Both notebooks are self-contained and resolve CSV paths automatically (via pathlib → file://…
), so they run on any machine without edits.
Dataset | File | Size / Rows | Description | Used in |
---|---|---|---|---|
Vehicle Sales | car_prices.csv | 85 MB · 558,837 rows | 16-column log of U.S. vehicle sales | Assignments 1 & 2 |
TechCrunch Posts | techcrunch_posts.csv | 19 MB · 42,422 rows | 2010 TechCrunch posts with metadata & in-links | Assignment 2 |
Data-quality rule: drop rows missing any of the following: year
, make
, odometer
, color
, interior
, sellingprice
, saledate
.
- Task 1: Yearly statistics – count, total value, avg odometer, avg age
- Task 2: Yearly statistics by brand – same metrics grouped by
make
andyear
- Task 3: Sales by exterior color – count and total value per color
- Task 4: Top color combinations – count per (exterior, interior) pair
- Stand-alone (local filesystem)
- Distributed (Hadoop + HDFS)
Each task script (taskX.py
) writes outputs to:
taskX_out/
├── taskX_standalone_<runtime>s
└── taskX_distributed_<runtime>s
- 1. Most active dates – post count per date
- 2. Most cited authors – total inlinks per author
- 3. Average post length – average word count per author
- 4. Author h-index – computed with window functions
- 1. Yearly stats – count, total value, avg odometer, avg age
- 2. Yearly stats by brand – same metrics grouped by make and year
- 3. Exterior color analysis – total sales and volume per color
- 4. Color combination analysis – count per (exterior, interior) pair
- Pure DataFrame API (no RDDs)
- Auto-resolved file paths with
pathlib.Path().as_uri()
- Caching/unpersisting where needed for memory efficiency
- Use of Spark SQL and window functions for aggregation logic
Big-Data-Cloud-Computing/
├── BDCC_2025_Assignment1/
│ ├── Assignment_1_MapReduce_2025_Lazaros_Panitsidis.ipynb
│ ├── task1.py … task4.py
│ ├── car_prices.csv
│ └── task*_out/
└── BDCC_2025_Assignment2/
├── Assignment_2_Spark_2025_Lazaros_Panitsidis.ipynb
├── techcrunch_posts.csv
├── car_prices.csv
└── README.md
Submit the following notebooks:
Assignment_1_MapReduce_2025_Lazaros_Panitsidis.ipynb
Assignment_2_Spark_2025_Lazaros_Panitsidis.ipynb
Do not include dataset files.
Ensure both notebooks run correctly on any machine without code edits.
Lazaros Panitsidis
- Assignment 1: 05/05/2025 at 22:00
- Assignment 2: 02/06/2025 at 22:00
⚠️ No extensions will be granted. Please verify correctness, performance, and reproducibility before submission.