Name | |
---|---|
Lukas Chin | lchin10@bu.edu |
Lucia Gill | luciagil@bu.edu |
David Li | dav@bu.edu |
Jason Li | jli3469@bu.edu |
Jiawei Sun | alinajw@bu.edu |
Name | |
---|---|
Dan Lambridght | dlambrig@gmail.com |
Orran Krieger | okrieg@bu.edu |
Ata Turk | ataturk@bu.edu |
To install and run locally, ensure you have the dependencies:
sudo apt install openjdk-11-jdk maven libatomic1
Begin by first initializing Postgres + ClickHouse docker image. It is recommended to configure Docker so it can be run by non-root users.
scripts/initialize_postgres_docker.sh
scripts/initialize_clickhouse_docker.sh
Once the Docker containers have been initialized, ensure that they are up and running with:
docker ps
Now let's compile our HTAP system and use ETL!
cd etl-demo
mvn clean && mvn package && mvn spring-boot:run
This should compile the system and a simple webpage built with Spring Boot will be hosted on localhost:8081
. By visiting the webpage, you can see all of the parameters required to trigger the ETL process between the OLTP PostgreSQL and the OLAP ClickHouse database. Before starting the ETL process between the two databases, we need to first have data in the OLTP database to transfer over to the OLAP database! We compiled a quick method to generate mock "social media" data to trigger the ETL and observe the functionality of our program.
For more detailed instructions and function explanation, please check etl-demo_readme
To generate this mock data go back to the sql folder in root and run the python script to generate mock data. Then, go back to the scripts folder in root and run add_postgres_data.sh
to automatically add this mock data into the PostgreSQL database. For more detailed instructions, check out the SETUP.md.
cd ../sql/dummy-data
python3 testing_data_generator.py
cd ../../scripts
./add_postgres_data.sh
Now that we have mock data in PostgreSQL, we can trigger an ETL process to the OLAP ClickHouse database.
- OLTP Database URL:
jdbc:postgresql://localhost:5432/social_media
- OLTP Username:
postgres
- OLTP Password:
dbos
- OLAP Database URL:
jdbc:clickhouse://localhost:8123
- OLAP Username:
default
- OLAP Password: (left empty)
Click Start ETL
! This will finally begin the ETL process. Once the ETL process begins, it will continue processing ETL batches every 10 minutes until prompted to stop. You can view the OLAP ClickHouse database directly by clicking OLAP Database
. You can observe all rows in each tables in each database in ClickHouse.
ETL will only occur to newly added/changed data, so when ETL is started again, only new/changed data will be updated to the OLAP database, enhancing our ETL performance and speed.
You can also try our HTAP program here hosted on MOC NERC.
-
What is Epoxy
Epoxy is a protocol for providing ACID transactions across diverse data stores. It uses a primary transactional DBMS as a transaction coordinator for transactions among it and several potentially non-transactional secondary data stores. Epoxy is implemented in shim layers co-located with these data stores which intercept and interpose on client requests, but do not require modifications to the stores themselves.
-
Basic Structure
Epoxy leverages Postgres transactional database as the coordinator and extends multiversion concurrency control (MVCC) for cross-data store isolation. It provides isolation as well as atomicity and durability through its optimistic concurrency control (OCC) plus two-phase commit (2PC) protocol. Epoxy was implemented as a bolt-on shim layer for five diverse data stores: Postgres, MySQL, Elasticsearch, MongoDB, and Google Cloud Storage (GCS).
-
How it functions
- Each Epoxy transaction is linked to a snapshot, representing the set of all past transactions visible to it.
- Epoxy secondary store shims enhance record versions with metadata to facilitate transactional read operations. Visibility of a record version to a transaction is determined by the presence of beginTxn in the transaction's snapshot, and the absence of endTxn in the transaction's snapshot.
- Due to the two-phase commit (2PC) protocol used by Epoxy, the secondary stores prepare first inside their databases, and then the primary concludes the transaction commit (or the abort) if all operations succeed (or any operation fail) on all data stores.
The goal of this project is to create an implementation of Epoxy to enable accessing multiple databases (MySQL, ClickHouse, Postgres DB) or storage systems (S3) through a single interface to allow rapid read and writes between databases.
-
David
- Wants: David is the creator of a new social media platform called BUBook and needs to store every piece of information that every user does on his platform including liking posts, followers, favorites, etc.
- Needs: David needs a fast database to collect all of this information, and he wants it to be updated in real-time.
- How: When David likes a post, a write operation is triggered. Epoxy (coordinator) then starts the transaction, handling the necessary write operations to an OLTP database, which might be retrieving a row with all columns, then update specific columns. Next, the transaction is committed if all operations are successful in the related data stores.
-
Jason
- Wants: Jason is the main software developer of BUBook and wants to write a new algorithm every week to get better user retention on the platform.
- Needs: Jason needs to extract, transform, and load all of the data from David's fast database into an analytical database so that he can process the data and write a new and more effective algorithm.
- How: Jason uses an ETL (Extract, Transform, Load) pipeline to process the data. First, he extracts the relevant data from David's OLTP database. Then, he transforms this data into a format suitable for analysis, potentially aggregating user actions, calculating engagement metrics, and deriving insights. Finally, he loads this processed data into an OLAP database optimized for complex queries and data analysis. This ETL process runs periodically, ensuring Jason has up-to-date data for developing and refining his user retention algorithms.
-
Lucia
- Wants: Lucia is the marketing manager of BUBook and wants to better advertise the platform so that more new users register and begin using BUBook.
- Needs: Lucia needs to fetch the data from Jason's analytical database to create inferences on making more captivating digital ads.
- How: When Lucia fetches the data, a read operation is triggered. Epoxy then starts the transaction, handling the necessary write operations to an OLAP database, which might be retrieving informations from a column with all the rows. Next, the transaction is committed if all operations are successful in the related data stores.
The scope of this project is to build a Hybrid Transactional/Analytical Processing (HTAP) system using Epoxy, integrating both OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) databases through a unified interface. This solution of Epoxy will enable transactions across diverse data stores without relying on the costly two-phase commit protocol. We will provide a cloud and local version.
-
Integration of OLTP and OLAP Databases
- Implement an HTAP system using Epoxy to combine OLTP (Postgress) and OLAP (ClickHouse) databases.
- Handle both real-time and batched updates with a focus on atomicity (ensuring complete or no transactions) and isolation (transactions are executed independently of each other).
-
OLTP Operations (Handling Transactions)
- Develop and optimize OLTP operations in mySQL for real-time, high-frequency transactions.
- Support transactional integrity and maintain data consistency within transactions.
-
OLAP Operations (Running Analytics)
- Implement OLAP operations using DuckDB to handle large-scale analytical queries, enabling batch processing and complex data analysis.
- Ensure that OLAP queries can efficiently process large datasets without affecting the performance of real-time OLTP transactions.
-
ETL Workflow Development (new feature different from vanilla epoxy)
- Build an Extract-Transform-Load (ETL) process to transfer data between OLTP and OLAP systems, ensuring seamless updates and synchronization between the two systems.
- Improve write efficiency in ETL data transfer process with same level of isolation with Epoxy (ideally).
-
Performance Optimization
- Optimize the system for performance, including disk page compression and other improvements for minimizing time and resources during transactions.
-
Benchmarking and Testing
- Benchmark the HTAP system against existing solutions (e.g., Apache Hive) to evaluate its performance on an existing database.
-
Two-Phase Commit Protocol
-
Definition: two-phase commit protocol (2PC, tupac) is a type of atomic commitment protocol (ACP). It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction on whether to commit or abort (roll back) the transaction.
-
Two-phases:
In commit-request phase (or voting phase), the coordinator process sends a prepare request to all the participating nodes. Each participant tries to execute the transaction locally. If the local transaction was successful and the participant is ready to commit, it vote "Yes". Else, it vote "No".
In the commit phase, based on voting of the participants, the coordinator decides whether to commit (only if all have voted "Yes") or abort the transaction (if any has voted "No"), and notifies the result to all the participants. The participants then follow with the needed actions (commit or abort) with their local transactional resources (also called recoverable resources; e.g., database data) and their respective portions in the transaction's other output (if applicable).
-
Related resources: Wikipedia 2PC Definition, Two-phase commit and beyond by MURAT
-
The project will not rely on the traditional two-phase commit protocol for ensuring transactional integrity, as Epoxy is designed to operate without it.
-
-
Modifications to Database Systems
- There will be no modification on the OLTP and OLAP database systems, since the sole goal of this project is to integrate the two through Epoxy's interface.
A high-level architecture of the Hybrid Transactional/Analytical Processing (HTAP) system using Epoxy. The system is designed to handle both transactional and analytical workloads seamlessly, ensuring real-time insights and consistency.
Architectural Diagram from Paper
Simplified View of HTAP Diagram
Walkthrough of the Architectural Structure:
- Client Requests: User actions trigger requests that are received by the Hybrid Transactional/Analytical Processing Database system.
- Transaction Coordinator: A user action (e.g., placing an order, updating a record) triggers the Epoxy layer, which manages the transaction lifecycle and initiates communication with the OLTP database.
- Transactional Database: An Online Transaction Processing (OLTP) database (e.g. mySQL, FoundationDB) processes and stores transactional data in real-time.
- Epoxy Shim: Executes local operations and interacts with the Transaction Coordinator to ensure
- Data Updates: The OLTP database processes the transaction, updating relevant records in real-time. This could involve writing to multiple tables depending on the nature of the operation.
- ETL Pipeline: Epoxy replicates the changes through an ETL workflow to a columnar store (e.g., DuckDB) designed for analytical processing. This replication occurs simultaneously, ensuring that both transactional and analytical databases are updated in real time.
- Conflict Resolution: Before finalizing the transaction, Epoxy performs a consistency check to detect any conflicts between the OLTP and OLAP stores. This step is crucial for maintaining data integrity and ensuring that all changes are valid.
- Commit Process: If no conflicts are detected, the transaction is committed in the OLTP database. At this point, changes become visible across all involved data stores, allowing for immediate data access.
- Real-time Analytics: After the commit, users can run analytical queries on the OLAP store. These queries leverage the most up-to-date transactional data, enabling real-time insights and analytics that can inform decision-making and operational adjustments.
Design Implications and Discussion:
The design decisions for the HTAP system focus on leveraging Epoxy's transaction logic to enable fast, batched writes to OLAP data stores without conflicting with the OLTP transaction system. This approach ensures data consistency and integrity across both workloads. Additionally, developing an ETL (Extract-Transform-Load) workflow facilitates efficient data transfer between OLTP and OLAP systems, enhancing real-time analytics capabilities. By prioritizing these elements, the architecture remains robust and adaptable, supporting timely decision-making across various applications while maintaining high performance.
Minimum Viable Product:
- Implement the distributed transaction coordination system or develop a similar middleware for managing database operations across multiple data stores
- Include a OLTP and OLAP database in the Epoxy layer
- Write a ETL (Extract-Transform-load) implementation that works with Epoxy and comunicates between the two databases
- Test and Benchmark our program with complex databases
Stretch Goals:
- Implement S3 as database
- Improve write performance
- Work with Massachuset Massachuttes Open Cloud (MOC)
First, identify whether your databases are local or public. They must be either both local or both public.
If the databases are public:
- Make sure both databases have public IP, a public port number, and a username and password
- Access the website throught this link : https://etl-service-hybrid-tx-analytical-epoxy-31f481.apps.shift.nerc.mghpcc.org/
- For a step by step explaination follow the steps inside the CloudReadMe.md file inside the ETL folder.
If the databases are local:
- Clone this GitHub
- Save the docker image inside Apiary scripts folder
- Follow the steps inside the ETL README.md file.
If you are interested in using our mock datasets, follow the setup here.
Sprint 1. Find datasets and implement Epoxy
Sprint video | Sprint slides
Sprint 2. Add the two databases onto the Epoxy layer, and add dataset in the OLTP
Sprint video | Sprint slides
Sprint 3. Implement ETL
Sprint video | Sprint slides
Sprint 4. Benchmark against similar ETL implementations
Sprint video | Sprint slides
Sprint 5. Improve write performance
Sprint video | Sprint slides
Final Presentation
Sprint video | Sprint slides
- Epoxy original Paper: https://www.google.com/url?q=https://petereliaskraft.net/res/p2732-kraft.pdf&sa=D&source=editors&ust=1726857165369021&usg=AOvVaw2Pc4xnTr_hCMSMBesATRi4
- Epoxy Simplfied explination: https://www.google.com/url?q=https://muratbuffalo.blogspot.com/2023/11/epoxy-acid-transactions-across-diverse.html&sa=D&source=editors&ust=1726857165369400&usg=AOvVaw1rpSAR_4dFpcT6gvIVj1gX
- Epoxy github: https://github.com/DBOS-project/apiary
- ETL Cloud Website: https://etl-service-hybrid-tx-analytical-epoxy-31f481.apps.shift.nerc.mghpcc.org/