A tool for migrating Apache Iceberg tables from S3 bucket to target S3 Tables bucket. Supports complete migration of table structures, data, and snapshots.
- Complete migration of Apache Iceberg tables
- Preservation of table snapshot history
- Support for point-in-time table state migration
- Automatic collection and validation of table schemas
- Support for large-scale data migration
- Error handling and logging
- Python 3.7+
- Apache Spark 3.4+
- AWS access configuration
- Required Python packages:
- pyspark
- boto3
- pandas
Collect all snapshot information from source tables:
python 1_collect_snapshots_info.py
--catalog-name
--warehouse-uri
--database
Collect table structure information at a specific timestamp:
python 2_collect_database_schema_info.py
--catalog-name
--warehouse-uri
--database
--snapshot-info-file <path-to-snapshot-info.json>
--timestamp ""
Create table structures at the target location:
python 3_create_target_tables.py
--catalog-name
--warehouse-uri
--database
--schema-info-file <path-to-schema-info.json>
Validate schema consistency between source and target tables:
python 4_verify_table_schema.py
--source-catalog
--target-catalog
--database
Execute data migration for a specific point in time:
python 5_migrate_tables_data.py
--source-catalog
--target-catalog
--database
--snapshot-info-file <path-to-snapshot-info.json>
Execute data integrity verification after tables migration:
python 6_verify_data_consistency.py
--source-catalog
--target-catalog
--database
--snapshot-info-file <path-to-snapshot-info.json>
catalog-name
: Name of the Iceberg catalogwarehouse-uri
: S3 warehouse URIdatabase
: Name of the database to migratesnapshot-info-file
: Path to snapshot information fileschema-info-file
: Path to schema information filetimestamp
: Migration timestamp (ISO format)
- This tool performs point-in-time migration rather than incremental synchronization
- Ensure sufficient S3 storage space
- Test migration process in a non-production environment first
- Configure appropriate Spark settings for large table migration
- Maintain stable network connection during migration
- Backup important data before migration
-
Snapshot Collection
- Collects all snapshot information from source tables
- Records snapshot timestamps and IDs
-
Schema Collection
- Gathers table structures and metadata
- Records partition information and column details
-
Target Creation
- Creates tables in target location
- Maintains original schema and properties
-
Schema Verification
- Validates schema consistency
- Ensures proper table creation
-
Data Migration
- Transfers data using specified snapshots
- Maintains data integrity and history
-
Data Verification
- Validates record counts
- Verifies data consistency using checksums
- Ensures successful migration completion
- No support for incremental synchronization
- Migrates table state at a specific point in time
- Requires re-running the entire process for updating to a newer state
Common issues and solutions:
- Network connectivity issues: Check AWS credentials and network settings
- Memory errors: Adjust Spark configuration parameters
- Schema mismatch: Verify source and target table structures
- Migration interruption: Restart the migration process
Contributions are welcome! Please feel free to submit issues and pull requests.
For support and questions, please create an issue in the GitHub repository.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.