A Java-based tool for migrating Delta Lake tables to Apache Iceberg format using AWS Glue Catalog for metadata management and S3 for storage.
This tool helps you migrate existing Delta Lake tables stored in S3 to Apache Iceberg format while maintaining all data and metadata through AWS Glue Catalog. It's designed to handle large-scale migrations with batch processing capabilities using CSV configuration files.
- Batch Migration: Process multiple tables defined in CSV files
- AWS Glue Integration: Automatic table registration in Glue Catalog
- S3 Storage: Seamless migration between S3 locations
- Data Validation: Count verification before and after migration
- Atomic Operations: Safe migration with temporary tables
- Logging: Comprehensive logging for monitoring and debugging
- Java 11+
- Apache Spark 3.4+
- AWS CLI configured with appropriate permissions
- Access to AWS S3 and Glue services
Your AWS credentials need permissions for:
- S3: Read/Write/Delete on source and target buckets
- Glue: Create/Read/Update/Delete tables and databases
- CloudWatch: Logs (optional, for monitoring)
git clone <repository-url>
cd migracao_delta_iceberg
mvn clean packageexport AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_REGION=us-east-1
export AWS_ACCOUNT_ID=123456789012Create a CSV file with your table mappings (see sample_models.csv):
table_name,partition_column
customer_data,created_date
order_history,order_date
product_catalog,last_updated
java -cp target/migracao_delta_iceberg-1.0-SNAPSHOT.jar com.example.app.DeltaToIcebergMigrationThe migration tool reads table configurations from CSV files with the following format:
| Column | Description | Example |
|---|---|---|
table_name |
Name of the Delta table to migrate | customer_data |
partition_column |
Column used for partitioning (optional) | created_date |
The tool automatically configures Spark with optimized settings for Delta/Iceberg operations:
- Memory allocation: 10GB driver memory
- Catalog integration: Both Delta and Iceberg catalogs
- S3 optimization: Vectorized reading disabled for compatibility
- Date handling: Corrected rebase mode for date/timestamp columns
# Migrate all tables defined in models.csv
java -cp target/migracao_delta_iceberg-1.0-SNAPSHOT.jar com.example.app.DeltaToIcebergMigrationUse the included utility to inspect your Glue Catalog tables:
java -cp target/migracao_delta_iceberg-1.0-SNAPSHOT.jar com.example.app.ListGlueTablesDetailedThis utility helps you:
- Verify successful migrations
- Check table metadata
- Confirm Iceberg table format
- Inspect storage locations and parameters
- Read Configuration: Load table definitions from CSV
- Delta Reading: Read existing Delta table from S3
- Temporary Creation: Create temporary Iceberg table
- Data Transfer: Copy data to Iceberg format
- Original Cleanup: Remove original Delta files
- Final Creation: Create final Iceberg table
- Validation: Verify record counts match
migracao_delta_iceberg/
├── src/main/java/com/example/app/
│ ├── DeltaToIcebergMigration.java # Main migration logic
│ ├── ListGlueTablesDetailed.java # Table inspection utility
│ └── util/
│ └── Logger.java # Custom logging utility
├── src/main/resources/
│ └── META-INF/services/
│ └── org.apache.iceberg.catalog.CatalogPlugin
├── sample_models.csv # Example configuration
├── pom.xml # Maven dependencies
└── README.md # This file
The tool uses environment variables to construct S3 paths:
# Source bucket pattern
s3://data-lake-snapshot-{AWS_ACCOUNT_ID}-{AWS_REGION}
# Target bucket (same as source by default)
s3://data-lake-snapshot-{AWS_ACCOUNT_ID}-{AWS_REGION}Default configuration:
- Catalog:
iceberg - Database:
data_lake_snapshot
Modify these in the source code if needed for your environment.
For large tables, adjust Spark memory settings in the code:
.config("spark.driver.memory", "20g") // Increase for large schemas
.config("spark.executor.memory", "8g") // Add if using cluster mode
.config("spark.sql.adaptive.enabled", "true") // Enable adaptive query executionThe tool provides detailed logging for each migration step:
- INFO: Progress updates and success messages
- WARN: Non-critical issues and warnings
- ERROR: Migration failures with stack traces
- DEBUG: Detailed operation logs (enable as needed)
- Permission Errors: Verify AWS credentials and policies
- Memory Issues: Increase driver memory for large tables
- Schema Conflicts: Check for incompatible data types
- Network Timeouts: Configure appropriate timeout values
After migration, verify your tables:
# Check table exists in Glue
aws glue get-table --database-name data_lake_snapshot --name your_table_name
# Verify Iceberg format
java -cp target/migracao_delta_iceberg-1.0-SNAPSHOT.jar com.example.app.ListGlueTablesDetailed- Batch Size: Process tables in smaller batches for better error handling
- Partitioning: Ensure proper partition column selection for query performance
- Resource Allocation: Monitor memory usage and adjust as needed
- Network: Use same AWS region for all resources
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is open source and available under the MIT License.
For issues and questions:
- Check existing issues in the GitHub repository
- Create a new issue with detailed information
- Include logs and configuration details
Note: This tool modifies S3 data by deleting original Delta files. Always backup your data before running migrations in production environments.