- Overview
- Key Benefits
- Components and Architecture
- Getting Started
- Using the Integration
- Implementing Hot/Cold Data Strategy
- Project Structure
- Troubleshooting
- Advanced Topics
- References
- License
This project demonstrates how to integrate ClickHouse with Apache Iceberg tables using ClickHouse's DataLakeCatalog connector. It provides a complete end-to-end setup where you can query and analyze data stored in Iceberg tables directly from ClickHouse without any ETL processes.
- Apache Iceberg: An open table format for huge analytic datasets
- ClickHouse: A high-performance columnar database
- Integration: ClickHouse connecting to Iceberg tables through DataLakeCatalog
Note: This integration uses ClickHouse version 25.3 or later. The DataLakeCatalog feature is marked as experimental in these versions.
- Cost Optimization: Store historical or cold data in object storage (S3/MinIO) while maintaining query capabilities
- Query Federation: Use ClickHouse's powerful analytics on both hot data (in ClickHouse native tables) and cold data (in Iceberg)
- Unified Data Access: Query across multiple storage tiers with a single interface
This project sets up a Docker-based environment with the following components:
- ClickHouse Server: Database engine for querying Iceberg data
- Iceberg REST Catalog: HTTP service for managing Iceberg metadata (using tabulario/iceberg-rest)
- MinIO: S3-compatible object storage for the actual data files
- Python Utilities: Tools for schema creation and data generation
┌─────────────────────────────────────────────────────────────────┐
│ │
│ ┌───────────────┐ ┌────────────────┐ │
│ │ ClickHouse │ │ Iceberg REST │ │
│ │ Server │───────│ Catalog │ │
│ └───────────────┘ └────────────────┘ │
│ │ │ │
│ │ │ │
│ │ │ │
│ │ ▼ │
│ │ ┌────────────────┐ │
│ │ │ MinIO/S3 │ │
│ │ │ Object Store │ │
│ │ └────────────────┘ │
│ │ ▲ │
│ │ │ │
│ └─────────────────────────┘ │
│ Direct data access │
│ │
└─────────────────────────────────────────────────────────────────┘
- Docker and Docker Compose
- Make (optional but recommended)
-
Clone the repository
git clone https://github.com/yourname/clickhouse-iceberg.git cd clickhouse-iceberg
-
Start all services:
make start
This will start ClickHouse, Iceberg REST Catalog, and MinIO services.
-
Check service status:
docker ps
-
Wait for initialization (services may take a minute to fully initialize)
Service | Access Method | URL/Port |
---|---|---|
ClickHouse | CLI Client | make clickhouse-client |
ClickHouse HTTP | Browser/curl | http://localhost:8123 |
MinIO Console | Browser | http://localhost:9001 (login: minio-root-user/minio-root-password) |
Iceberg REST API | HTTP | http://localhost:8181 |
Once connected to ClickHouse, create a database connection to Iceberg:
-- Enable experimental feature first
SET allow_experimental_database_iceberg = 1;
-- Create connection to Iceberg catalog
CREATE DATABASE iceberg_catalog
ENGINE = DataLakeCatalog('http://iceberg-catalog:8181', 'minio-root-user', 'minio-root-password')
SETTINGS
catalog_type = 'rest',
warehouse = 'warehouse',
storage_endpoint = 'http://minio:9000/warehouse';
-- List available tables in the Iceberg catalog
SHOW TABLES FROM iceberg_catalog;
-- Basic query
SELECT * FROM iceberg_catalog.`iot_battery.battery_v2` LIMIT 10;
-- Structured query with filters
SELECT
event_time,
battery_serial,
state_of_charge,
state_of_health
FROM iceberg_catalog.`iot_battery.battery_v2`
WHERE battery_serial = 'battery-01'
AND event_time >= now() - INTERVAL 1 DAY
ORDER BY event_time DESC;
-- View table schema
DESCRIBE TABLE iceberg_catalog.`iot_battery.battery_v2`;
-- Show table creation statement
SHOW CREATE TABLE iceberg_catalog.`iot_battery.battery_v2`;
clickhouse-iceberg/
├── Makefile # Main control file
├── clickhouse/ # ClickHouse configuration
│ ├── Makefile # ClickHouse service commands
│ ├── config/ # Server config
│ ├── docker-compose.yaml # ClickHouse service definition
│ ├── how_query.sql # Example queries
│ ├── migrations/ # SQL migrations
│ │ └── 00004_create_lakekeeper_catalog.sql
│ └── users/ # User settings
├── iceberg/ # Iceberg catalog service
│ ├── docker-compose.yaml # Iceberg service definition
│ └── python/ # Schema creation scripts
│ ├── Dockerfile # Python environment for Iceberg tools
│ └── create_battery_schema.py # Data creation script
├── minio/ # MinIO configuration
│ └── docker-compose.yaml # MinIO service definition
└── data/ # Persistent data storage
├── clickhouse/ # ClickHouse data files
├── iceberg/ # Iceberg metadata logs
└── minio/ # MinIO object storage data
The ClickHouse server is configured with:
- HTTP interface on port 8123
- Native protocol on port 9000
- Custom configurations in the
./clickhouse/config
directory - Database migrations in
./clickhouse/migrations
Key Configuration Files:
users.xml
: Authentication and user permissionsconfig.xml
: Server configuration settings
The Iceberg REST Catalog provides:
- Based on Uses the tabulario/iceberg-rest Docker image
- Note: The source for this image is now maintained at databricks/iceberg-rest-image
- HTTP API on port 8181
- Manages Iceberg table metadata
- Connects to MinIO for storage
- Configured to use the 'warehouse' bucket
MinIO provides:
- S3-compatible storage on port 9002
- Web console on port 9001 (login with
minio-root-user
/minio-root-password
) - Stores actual data files in Parquet format
- Configured with user 'minio-root-user'
Key Buckets:
warehouse
: Main storage for Iceberg data files- Access via browser at http://localhost:9001
The project includes a sample IoT battery dataset with:
make iceberg-logs make minio-logs make clickhouse-logs
## Troubleshooting
### Common Issues
1. **Services not starting properly**
- Check Docker process: `docker ps -a`
- Ensure no port conflicts with existing services
- Check network connectivity between containers: `docker network inspect iceberg_network`
2. **Cannot connect to ClickHouse**
- Verify ClickHouse is running: `docker ps | grep clickhouse`
- Check if ports are exposed correctly: `netstat -tuln | grep 8123`
- Try connecting directly to container: `docker exec -it clickhouse-server clickhouse-client`
3. **Tables not appearing in Iceberg catalog**
- Ensure the Iceberg REST service is healthy: `curl http://localhost:8181/v1/namespaces`
- Check MinIO is accessible and the warehouse bucket exists
- Review Iceberg schema creation logs: `docker logs iceberg-catalog`
4. **"Database not found" errors**
- Ensure the experimental flag is enabled: `SET allow_experimental_database_iceberg = 1;`
- Check catalog connection parameters
- Verify MinIO credentials are correct
### Logs
- **ClickHouse logs**: `make clickhouse-logs`
- **Iceberg catalog logs**: `docker logs iceberg-catalog`
- **MinIO logs**: `docker logs minio`
- **Schema creation logs**: Check the files in `./data/iceberg/schema-logs/`
## Advanced Topics
### Extending the Schema
To create additional tables in the Iceberg catalog, modify the Python schema creation script:
`./iceberg/python/create_battery_schema.py`
Example:
```python
# Define a new schema for vehicle data
vehicle_schema = Schema(
NestedField(1, "vehicle_id", StringType(), required=True),
NestedField(2, "timestamp", TimestampType(), required=True),
NestedField(3, "speed", DoubleType(), required=False),
NestedField(4, "location", StringType(), required=False)
)
# Create the table
catalog.create_table(
identifier=f"{namespace}.vehicles",
schema=vehicle_schema,
partition_spec=PartitionSpec()
)
For larger datasets:
- Partitioning: Add appropriate partitioning to Iceberg tables
# Partition by day for time-series data partition_spec = PartitionSpec( PartitionField(source_id=1, field_id=100, transform=DayTransform(), name="event_day") )
- ClickHouse Documentation
- ClickHouse Data Lake Connector
- Apache Iceberg Documentation
- Iceberg REST Catalog API
- tabulario/iceberg-rest
- MinIO Documentation
- PyIceberg Documentation
This project is licensed under the MIT License - see the LICENSE file for details.