Skip to content

Diploma thesis for ECE NTUA in the course 3189 Advanced Topics in Database Systems under the supervision of Prof. Dimitrios Tsoumakos.

Notifications You must be signed in to change notification settings

kon-si/ntua_thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diploma Thesis

Title: Data Lakehouse Performance Study.
Author: Konstantinos Sideris.
Supervisor: Dimitrios Tsoumakos.
Institution: Department of Computer Science, National Technical University of Athens.
Date: June 2025.

Abstract

As organisations increasingly adopt lakehouse architectures to support big data analytics, understand- ing the performance trade-offs of utilising enhanced storage layers instead of standard data lake ar- chitectures is essential. This masters dissertation aims to present a comprehensive performance eval- uation of two leading data lakehouse solutions, Delta Lake and Apache Hudi, focusing on both batch and stream processing workloads. Through the benchmarking process, we compare Delta Lake and Hudi against standard data lake implementations, which consist of a simple storage layer queried by an analytics engine, in this case, HDFS and Apache Spark. Being built on top of data lakes, lakehouses leverage their strengths, while simultaneously, introducing new features, such as ACID transactions, schema enforcement, schema evolution and data governance mechanisms, to address the issues data lakes face. Additionally, they introduce optimisations, such as indexing, data skipping, and parti- tion pruning, to further improve them. Throughout this thesis, we present these features and through benchmarks, evaluate how they improve performance and whether the added functionalities justify the use of lakehouses, even in cases where they may underperform.

Keywords: Big Data, Data Lakes, Batch Processing, Stream Processing, Delta Lake, Apache Hudi.

Repository Structure

  • batch_proc/ Files related to the batch processing section of the thesis.
  • setup/ Overview of the hardware, software used and the installation process.
  • stream_proc/ Files related to the stream processing section of the thesis.
  • ntua_thesis.pdf Full text of the diploma thesis in greek and english.
  • ntua_thesis_presentation.pdf Presentation of the diploma thesis in greek.

Full Text

The complete thesis is available here.

About

Diploma thesis for ECE NTUA in the course 3189 Advanced Topics in Database Systems under the supervision of Prof. Dimitrios Tsoumakos.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published