GitHub - aws-samples/apache-iceberg-tables-migration-tool: A tool for tables migration from Apache Iceberg to S3 Tables

Apache Iceberg Tables Migration Tool

A tool for migrating Apache Iceberg tables from S3 bucket to target S3 Tables bucket. Supports complete migration of table structures, data, and snapshots.

Features

Complete migration of Apache Iceberg tables
Preservation of table snapshot history
Support for point-in-time table state migration
Automatic collection and validation of table schemas
Support for large-scale data migration
Error handling and logging

Prerequisites

Python 3.7+
Apache Spark 3.4+
AWS access configuration
Required Python packages:
- pyspark
- boto3
- pandas

Usage Steps

1. Collect Snapshot Information

Collect all snapshot information from source tables:

python 1_collect_snapshots_info.py
--catalog-name
--warehouse-uri
--database

2. Collect Database Schema Information

Collect table structure information at a specific timestamp:

python 2_collect_database_schema_info.py
--catalog-name
--warehouse-uri
--database
--snapshot-info-file <path-to-snapshot-info.json>
--timestamp ""

3. Create Target Tables

Create table structures at the target location:

python 3_create_target_tables.py
--catalog-name
--warehouse-uri
--database
--schema-info-file <path-to-schema-info.json>

4. Verify Table Schema

Validate schema consistency between source and target tables:

python 4_verify_table_schema.py
--source-catalog
--target-catalog
--database

5. Migrate Table Data

Execute data migration for a specific point in time:

python 5_migrate_tables_data.py
--source-catalog
--target-catalog
--database
--snapshot-info-file <path-to-snapshot-info.json>

6. Verify Data Integrity

Execute data integrity verification after tables migration:

python 6_verify_data_consistency.py
--source-catalog
--target-catalog
--database
--snapshot-info-file <path-to-snapshot-info.json>

Configuration Parameters

catalog-name: Name of the Iceberg catalog
warehouse-uri: S3 warehouse URI
database: Name of the database to migrate
snapshot-info-file: Path to snapshot information file
schema-info-file: Path to schema information file
timestamp: Migration timestamp (ISO format)

Important Notes

This tool performs point-in-time migration rather than incremental synchronization
Ensure sufficient S3 storage space
Test migration process in a non-production environment first
Configure appropriate Spark settings for large table migration
Maintain stable network connection during migration
Backup important data before migration

Migration Process Flow

Snapshot Collection
- Collects all snapshot information from source tables
- Records snapshot timestamps and IDs
Schema Collection
- Gathers table structures and metadata
- Records partition information and column details
Target Creation
- Creates tables in target location
- Maintains original schema and properties
Schema Verification
- Validates schema consistency
- Ensures proper table creation
Data Migration
- Transfers data using specified snapshots
- Maintains data integrity and history
Data Verification
- Validates record counts
- Verifies data consistency using checksums
- Ensures successful migration completion

Limitations

No support for incremental synchronization
Migrates table state at a specific point in time
Requires re-running the entire process for updating to a newer state

Troubleshooting

Common issues and solutions:

Network connectivity issues: Check AWS credentials and network settings
Memory errors: Adjust Spark configuration parameters
Schema mismatch: Verify source and target table structures
Migration interruption: Restart the migration process

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

Support

For support and questions, please create an issue in the GitHub repository.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
1_collect_src_snapshots.py		1_collect_src_snapshots.py
2_collect_database_schema_info.py		2_collect_database_schema_info.py
3_create_dst_tables.py		3_create_dst_tables.py
4_verify_table_creation.py		4_verify_table_creation.py
5_migrate_tables_data.py		5_migrate_tables_data.py
6_verify_data_integrity.py		6_verify_data_integrity.py
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Iceberg Tables Migration Tool

Features

Prerequisites

Usage Steps

1. Collect Snapshot Information

2. Collect Database Schema Information

3. Create Target Tables

4. Verify Table Schema

5. Migrate Table Data

6. Verify Data Integrity

Configuration Parameters

Important Notes

Migration Process Flow

Limitations

Troubleshooting

Contributing

Support

Security

License

About

Releases

Packages

Contributors 2

Languages

License

aws-samples/apache-iceberg-tables-migration-tool

Folders and files

Latest commit

History

Repository files navigation

Apache Iceberg Tables Migration Tool

Features

Prerequisites

Usage Steps

1. Collect Snapshot Information

2. Collect Database Schema Information

3. Create Target Tables

4. Verify Table Schema

5. Migrate Table Data

6. Verify Data Integrity

Configuration Parameters

Important Notes

Migration Process Flow

Limitations

Troubleshooting

Contributing

Support

Security

License

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages