WIP: Adding support for Delta to Iceberg migration #5331

ericlgoodman · 2022-07-21T23:33:12Z

This is a WIP PR (adding unit tests and still testing edge cases) but wanted to put this out here to get initial feedback.

It adds support to migrate a Delta Lake table to an Iceberg table. Currently this will only create a new table and effectively compress the history of the table into a single commit. In the future, we can add functionality to optionally carry over the history of the Delta Lake table.

It also assumes that the user's current Delta Lake resides in what has now been configured to be an Iceberg catalog. Thinking of adding an optional destinationCatalog for scenarios where users are attempting to move between catalogs.

amogh-jahagirdar

Left some comments and questions. Also suggest, rebase and running the spotlessApply so discussion can be more on the fundamental aspects and less on nit style stuff.

amogh-jahagirdar · 2022-07-28T06:23:21Z

core/src/main/java/org/apache/iceberg/actions/BaseMigrateDeltaLakeTableActionResult.java

+
+  private final long numFilesImported;
+
+  public BaseMigrateDeltaLakeTableActionResult(long numFilesImported) {


Would it make sense to capture aspects like the total data size and how long the migration procedure took in the result?

amogh-jahagirdar · 2022-07-28T06:29:47Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java

@@ -293,7 +293,7 @@ private static PartitionSpec identitySpec(Schema schema, Collection<Column> colu
    return identitySpec(schema, names);
  }

-  private static PartitionSpec identitySpec(Schema schema, List<String> partitionNames) {
+  public static PartitionSpec identitySpec(Schema schema, List<String> partitionNames) {


Why is the change in access modifiers?

amogh-jahagirdar · 2022-07-28T06:34:03Z

...3/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateDeltaLakeTableSparkAction.java

+        .collect(Collectors.toList());
+  }
+
+  private static PartitionSpec getPartitionSpecFromDeltaSnapshot(


I get what the function is doing but does it need to be static?

amogh-jahagirdar · 2022-07-28T06:34:37Z

...3/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateDeltaLakeTableSparkAction.java

+    }
+  }
+
+  private static List<SparkTableUtil.SparkPartition> getSparkPartitionsFromDeltaSnapshot(


same, does it need to be static?

amogh-jahagirdar · 2022-07-28T06:39:21Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

+                                      boolean checkDuplicateFiles,
+                                      PartitionSpec nullableSpec,
+                                      List<SparkPartition> nullablePartitions) {


I get we need to pass in the computed partition spec and spark partitions, but don't think we should change any public API signatures, we can add a new one.

+1, avoid changing existing public methods, we can do something like:

public static void importSparkTable(SparkSession spark, TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles) { PartitionSpec spec = SparkSchemaUtil.specForTable(spark, sourceTableIdentWithDB.unquotedString()); ... nullablePartitions(..., spec, sourceTablePartitions) } public static void importSparkTable(SparkSession spark, TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles) { boolean checkDuplicateFiles, PartitionSpec spec, List<SparkPartition> partitions) { ... }

amogh-jahagirdar · 2022-07-28T06:43:27Z

...3/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateDeltaLakeTableSparkAction.java

+    return updatedSnapshot.getAllFiles()
+        .stream()
+        // Map each partition to the list of files within it
+        .collect(Collectors.groupingBy(AddFile::getPartitionValues))
+        .entrySet()
+        .stream()
+        .map(entry -> {
+              // We don't care what value we take since they will all have the same prefix.
+              // The arbitrary file will have a path that looks like "partition1/partition2/file.parquet,
+              // We're interested in the part prior to the filename
+              AddFile addFile = entry.getValue().get(0);
+              String pathBeforeFileName = addFile.getPath().substring(0, addFile.getPath().lastIndexOf("/"));
+              String fullPath = new Path(deltaLogPath, pathBeforeFileName).toString();
+
+              return new SparkTableUtil.SparkPartition(
+                  entry.getKey(), // Map containing name and values of partitions
+                  fullPath,
+                  // Delta tables only support parquet
+                  "parquet"
+              );
+        }
+        )
+        .collect(Collectors.toList());
+  }


Should we parallelize this computation? For example, concurrently across files update a grouping based on partition values. Could be overkill (depends on the number of files in the snapshot)

ericlgoodman · 2022-08-01T23:11:35Z

Adding here my primary concern with this PR - and in general a concern going forward with Spark and using multiple tables such as Delta Lake and Iceberg.

Spark reads tables through whatever catalog is located at the first part of a table's identifier. There can only be 1 catalog per identifier, and different catalogs have different capabilities. For example, the DeltaCatalog can read Delta Lake and generic Hive tables, and the SparkSessionCatalog can read Iceberg + Hive tables.

In theory, in order to read from multiple table types in one Spark session, a user would initialize a DeltaCatalog, at say, delta and then the SparkSessionCatalog at iceberg. Then all their Delta Lake tables would be located at delta.my_delta_database.my_delta_lake_table and all their Iceberg tables at iceberg.my_iceberg_database.my_iceberg_table. Unfortunately, this doesn't work out of the box. Both of these catalog implementations are designed to be used by overriding the default Spark catalog, which is located at spark_catalog. CatalogExtension, from which DeltaCatalog and SparkSessionCatalog both inherit from, contains a method setDelegateCatalog(CatalogPlugin delegate). As the Javadoc reads:

 /**
   * This will be called only once by Spark to pass in the Spark built-in session catalog, after
   * {@link #initialize(String, CaseInsensitiveStringMap)} is called.
   */
  void setDelegateCatalog(CatalogPlugin delegate);

A user can fix this issue by manually calling this method during Spark setup and setting the delegate to the one in the default Spark catalog. But most users presumably are not doing this, and some users might face difficulty depending on their service provider and how much abstraction/configuration has been taken away from them during setup.

This basically means that in today's world, it doesn't seem realistic that users currently have a simple way to use one Spark session to read/migrate between different table types. This solution might make make sense to implement first, as users may find that a Delta/Iceberg/Hudi table makes sense for them in one context but another one is preferable in another.

When it comes to migration, there are basically two options:

Create a more abstract Catalog implementation that can read Iceberg/Delta/Hudi/Hive tables dynamically, similar to what happens in the Trino Hive connector. The connector inspects the table properties and determines at runtime whether to redirect to another connector. Similarly, a Spark catalog could simply delegate to specific catalogs if it sees certain table type specific properties.
Provide an easier method for users to not have to override the default catalog for these table type specific catalog implementations. If the Delta catalog was located at delta, and Iceberg at iceberg, then users could just keep their different table types in different catalogs and migration could take an optional parameter of the new desired catalog.

jackye1995

So based on my understanding, the difference of this action vs the existing migrate for Hive table is that it uses DeltaLog to get partition spec, schema and partitions of the table. However, I don't see the logic of reproducing the Delta transactions as Iceberg snapshots, which is the key value of such migration.

I understand the challenge of doing this under the context of Spark, but I think we can actually do better and do not rely on Spark context at all. What we can do is to have a BaseMigrateDeltaLakeTableAction that takes in a icebergCatalog instance and a Delta Lake table location. With those 2 information, we can fully replay the DeltaLog information and create an Iceberg table out of that.

Then we can seek for a Spark-based implementation, where the icebergCatalog can be retrieved from SparkSessionCatalog.icebergCatalog(), and location can be retrieved from a Delta table name + Spark default catalog's Delta table serde property path

What do you think?

jackye1995 · 2022-08-04T20:13:54Z

spark/v3.3/build.gradle

@@ -63,6 +63,8 @@ project(":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}") {
    implementation("org.apache.parquet:parquet-column")
    implementation("org.apache.parquet:parquet-hadoop")

+    implementation ("io.delta:delta-standalone_${scalaVersion}")


can we make this compileOnly, not everyone wants this as a part of their Spark runtime.

jackye1995 · 2022-08-04T20:39:24Z

...3/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateDeltaLakeTableSparkAction.java

+      io.delta.standalone.Snapshot updatedSnapshot,
+      StructType structType
+  ) {
+    Type converted = SparkTypeVisitor.visit(structType, new SparkTypeToType(structType));


we can use SparkSchemaUtil.convert

jackye1995 · 2022-08-04T20:40:16Z

...3/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateDeltaLakeTableSparkAction.java

+
+  private void renameAndBackupSourceTable() {
+    try {
+      LOG.info("Renaming {} as {} for backup", identifier, backupIdent);


"Renaming Delta Lake table ..."

jackye1995 · 2022-08-04T20:51:39Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

+                                      boolean checkDuplicateFiles,
+                                      PartitionSpec nullableSpec,
+                                      List<SparkPartition> nullablePartitions) {


+1, avoid changing existing public methods, we can do something like:

public static void importSparkTable(SparkSession spark, TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles) { PartitionSpec spec = SparkSchemaUtil.specForTable(spark, sourceTableIdentWithDB.unquotedString()); ... nullablePartitions(..., spec, sourceTablePartitions) } public static void importSparkTable(SparkSession spark, TableIdentifier sourceTableIdent, Table targetTable, String stagingDir, Map<String, String> partitionFilter, boolean checkDuplicateFiles) { boolean checkDuplicateFiles, PartitionSpec spec, List<SparkPartition> partitions) { ... }

jackye1995 · 2022-08-04T20:56:36Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

@@ -616,6 +621,11 @@ public static Dataset<Row> loadMetadataTable(SparkSession spark, Table table, Me
    return Dataset.ofRows(spark, DataSourceV2Relation.create(metadataTable, Some.empty(), Some.empty(), options));
  }

+  public static String getIcebergMetadataLocation(Table table) {


this is hard coded across many places, I think we can raise a separated PR just to clean that up

jackye1995 · 2022-08-04T21:10:47Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

            "Cannot find any partitions in table %s", sourceTableIdent);
-        importSparkPartitions(spark, sourceTablePartitions, targetTable, spec, stagingDir, checkDuplicateFiles);
+        importSparkPartitions(spark, partitions, targetTable, nullableSpec, stagingDir, checkDuplicateFiles);


prefer to not change variable name if possible

ericlgoodman · 2022-08-23T18:46:35Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateTableSparkAction.java

@@ -121,7 +121,7 @@ private MigrateTable.Result doExecute() {

      Some<String> backupNamespace = Some.apply(backupIdent.namespace()[0]);
      TableIdentifier v1BackupIdent = new TableIdentifier(backupIdent.name(), backupNamespace);
-      String stagingLocation = getMetadataLocation(icebergTable);
+      String stagingLocation = SparkTableUtil.getIcebergMetadataLocation(icebergTable);


Need to remove this change

ericlgoodman · 2022-08-23T18:46:46Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SnapshotTableSparkAction.java

@@ -124,7 +124,7 @@ private SnapshotTable.Result doExecute() {
      ensureNameMappingPresent(icebergTable);

      TableIdentifier v1TableIdent = v1SourceTable().identifier();
-      String stagingLocation = getMetadataLocation(icebergTable);
+      String stagingLocation = SparkTableUtil.getIcebergMetadataLocation(icebergTable);


jackye1995 · 2022-08-24T15:50:43Z

...3/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateDeltaLakeTableSparkAction.java

+    String nameMappingString = table.properties().get(TableProperties.DEFAULT_NAME_MAPPING);
+    NameMapping nameMapping =
+        nameMappingString != null ? NameMappingParser.fromJson(nameMappingString) : null;
+    return TableMigrationUtil.getParquetMetrics(


we should do a check to make sure the file is indeed Parquet

jackye1995 · 2022-08-24T15:52:51Z

...3/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateDeltaLakeTableSparkAction.java

+    return spec == null ? PartitionSpec.unpartitioned() : spec;
+  }
+
+  private StructType getStructTypeFromDeltaSnapshot() {


why not just convert directly from Delta schema to Iceberg schema? Here we ended up converting to Spark and then convert to Iceberg, and as a result you had to make a few Spark classes and methods public.

jackye1995 · 2022-08-24T15:55:19Z

...3/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateDeltaLakeTableSparkAction.java

+  private final String deltaTableLocation;
+  private final Identifier newIdentifier;
+
+  MigrateDeltaLakeTableSparkAction(


Can we have a class BaseMigrateDeltaLakeTableActon? Because lots of logic could be shared when we want to extend this feature to other engines like Flink and Trino.

and that class can live in iceberg-core library.

JonasJ-ap · 2022-11-21T21:39:23Z

Hi @ericlgoodman. My name is Rushan Jiang, a CS undergrad at CMU. I am interested in learning and contributing to this migration support. I saw you did not update this PR for some time. Would you mind allowing me to continue your work?

I appreciate your time and consideration.

ericlgoodman · 2022-11-28T15:50:38Z

Hi @ericlgoodman. My name is Rushan Jiang, a CS undergrad at CMU. I am interested in learning and contributing to this migration support. I saw you did not update this PR for some time. Would you mind allowing me to continue your work?

I appreciate your time and consideration.

Followed up with you on the Iceberg Slack.

jackye1995 · 2023-03-09T23:14:22Z

Close since the PR is merged.

github-actions bot added API build core spark labels Jul 21, 2022

ericlgoodman mentioned this pull request Jul 21, 2022

[WIP] Migrate Delta Lake tables to Iceberg #5097

Closed

amogh-jahagirdar reviewed Jul 28, 2022

View reviewed changes

jackye1995 reviewed Aug 4, 2022

View reviewed changes

ericlgoodman marked this pull request as draft August 23, 2022 16:27

github-actions bot added the data label Aug 23, 2022

ericlgoodman force-pushed the delta-lake-migrate-updated branch from 9d5e53f to 397820e Compare August 23, 2022 18:45

ericlgoodman commented Aug 23, 2022

View reviewed changes

ericlgoodman force-pushed the delta-lake-migrate-updated branch from 397820e to b9b99a4 Compare August 23, 2022 18:48

Add migrate to Delta Lake Spark action

8c388ad

ericlgoodman force-pushed the delta-lake-migrate-updated branch from b9b99a4 to 8c388ad Compare August 24, 2022 14:38

jackye1995 reviewed Aug 24, 2022

View reviewed changes

JonasJ-ap mentioned this pull request Dec 19, 2022

Delta: Support Snapshot Delta Lake Table to Iceberg Table #6449

Merged

jackye1995 closed this Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Adding support for Delta to Iceberg migration #5331

WIP: Adding support for Delta to Iceberg migration #5331

ericlgoodman commented Jul 21, 2022

amogh-jahagirdar left a comment

amogh-jahagirdar Jul 28, 2022

amogh-jahagirdar Jul 28, 2022

amogh-jahagirdar Jul 28, 2022

amogh-jahagirdar Jul 28, 2022

amogh-jahagirdar Jul 28, 2022

jackye1995 Aug 4, 2022

amogh-jahagirdar Jul 28, 2022

ericlgoodman commented Aug 1, 2022

jackye1995 left a comment

jackye1995 Aug 4, 2022

jackye1995 Aug 4, 2022

jackye1995 Aug 4, 2022

jackye1995 Aug 4, 2022

jackye1995 Aug 4, 2022

jackye1995 Aug 4, 2022

ericlgoodman Aug 23, 2022

ericlgoodman Aug 23, 2022

jackye1995 Aug 24, 2022

jackye1995 Aug 24, 2022

jackye1995 Aug 24, 2022

jackye1995 Aug 24, 2022

JonasJ-ap commented Nov 21, 2022

ericlgoodman commented Nov 28, 2022

jackye1995 commented Mar 9, 2023


		private final long numFilesImported;

		public BaseMigrateDeltaLakeTableActionResult(long numFilesImported) {

WIP: Adding support for Delta to Iceberg migration #5331

WIP: Adding support for Delta to Iceberg migration #5331

Conversation

ericlgoodman commented Jul 21, 2022

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericlgoodman commented Aug 1, 2022

jackye1995 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JonasJ-ap commented Nov 21, 2022

ericlgoodman commented Nov 28, 2022

jackye1995 commented Mar 9, 2023