Title: Data Lakehouse Performance Study.
Author: Konstantinos Sideris.
Supervisor: Dimitrios Tsoumakos.
Institution: Department of Computer Science, National Technical University of Athens.
Date: June 2025.
As organisations increasingly adopt lakehouse architectures to support big data analytics, understand- ing the performance trade-offs of utilising enhanced storage layers instead of standard data lake ar- chitectures is essential. This masters dissertation aims to present a comprehensive performance eval- uation of two leading data lakehouse solutions, Delta Lake and Apache Hudi, focusing on both batch and stream processing workloads. Through the benchmarking process, we compare Delta Lake and Hudi against standard data lake implementations, which consist of a simple storage layer queried by an analytics engine, in this case, HDFS and Apache Spark. Being built on top of data lakes, lakehouses leverage their strengths, while simultaneously, introducing new features, such as ACID transactions, schema enforcement, schema evolution and data governance mechanisms, to address the issues data lakes face. Additionally, they introduce optimisations, such as indexing, data skipping, and parti- tion pruning, to further improve them. Throughout this thesis, we present these features and through benchmarks, evaluate how they improve performance and whether the added functionalities justify the use of lakehouses, even in cases where they may underperform.
Keywords: Big Data, Data Lakes, Batch Processing, Stream Processing, Delta Lake, Apache Hudi.
batch_proc/
Files related to the batch processing section of the thesis.setup/
Overview of the hardware, software used and the installation process.stream_proc/
Files related to the stream processing section of the thesis.ntua_thesis.pdf
Full text of the diploma thesis in greek and english.ntua_thesis_presentation.pdf
Presentation of the diploma thesis in greek.
The complete thesis is available here.