A streamlined ELT (Extract, Load, Transform) pipeline demonstrating real-time data processing from PostgreSQL to Apache Iceberg using RisingWave as the stream processor, following the medallion architecture pattern.
This project showcases how to build a modern data lake system that follows ELT principles with a clear separation into bronze, silver, and gold layers. The pipeline leverages RisingWave for real-time stream processing and StarRocks as the high-performance query engine.
For a detailed step-by-step tutorial, check out the Medium article.
- PostgreSQL: Source database containing transactional data
- RisingWave: Stream processing platform for real-time data transformation
- Apache Iceberg: Table format for huge analytic datasets
- Apache Amoro: Lakehouse management system with REST Catalog
- StarRocks: High-performance analytical database
- MinIO: S3-compatible object storage
The pipeline follows the medallion architecture with three distinct layers:
- Bronze Layer: Raw data ingestion from PostgreSQL
- Silver Layer: Cleaned and validated data
- Gold Layer: Business-level aggregations and final datasets
- Docker and Docker Compose
- Basic understanding of SQL and data engineering concepts
- Clone the repository:
git clone https://github.com/dwickyferi/iceberg-elt-with-risingwave
cd iceberg-elt-with-risingwave
- Start the services:
docker-compose up -d
- Access the components:
- RisingWave: localhost:4566 (PostgreSQL protocol)
- Database: dev
- User: postgres
- Password: (empty)
- Apache Amoro UI: http://localhost:1630
- Username: admin
- Password: admin
- StarRocks: localhost:9030 (MySQL protocol)
- User: root
- Password: (empty)
- Create sample tables in PostgreSQL (sales_raw and invoice_raw)
- Configure PostgreSQL for CDC (Change Data Capture)
- Set up PostgreSQL CDC source
- Create corresponding tables in RisingWave
- Implement data transformations
- Direct ingestion of raw data
- No transformations applied
- Preservation of source system data
- Data cleaning and validation
- Standardization of fields
- Basic transformations and joins
- Business-level aggregations
- Creation of summary tables
- Optimized for analytical queries
- Configure external catalog
- Set up table access
- Implement analytical queries
For complete implementation details and step-by-step instructions, please refer to the SQL scripts in the sql/
directory:
data_raw.sql
: Initial data setupbronze.sql
: Bronze layer implementationsilver.sql
: Silver layer transformationsgold.sql
: Gold layer aggregationsstarrocks.sql
: StarRocks integration
Contributions are welcome! Please feel free to submit a Pull Request.
For any questions or suggestions, please create an issue in the repository.