Delta: Support Snapshot Delta Lake Table to Iceberg Table #6449

JonasJ-ap · 2022-12-19T03:41:14Z

This PR follows #5331 and and introduce a new module called iceberg-delta-lake to provide support to snapshot a delta lake table to an iceberg table by translating schemas and committing all logs through iceberg transaction.

The current implementation relies on delta-standalone to read deltaLog of the given delta tables. Since delta-standone only support tables with minReaderVersion=1 and minWriterVersion=2, the snapshot action does not support high versioned features such as ColumnMapping.

Most of the tests for this module are based on Spark3.3 and therefore are under integrationTest task

…o do the migration

JonasJ-ap

Currently, I add an abstract class called BaseMigrateDeltaLakeTableAction. This class takes in an icebergCatalog and a delta lake table's path as suggested in #5331 (review)

JonasJ-ap · 2022-12-19T05:32:12Z

core/src/main/java/org/apache/iceberg/BaseMigrateDeltaLakeTableAction.java

+        .build();
+  }
+
+  protected abstract Metrics getMetricsForFile(Table table, String fullFilePath, FileFormat format);


Based on my understanding, we must access the relevant file's package (e.g. iceberg-parquet) to get the metrics. Hence, I have to make this method abstract since this class is currently in iceberg-core. I feel like we may want to implement this in some other ways: (e.g. make this class more abstract, add a new package called iceberg-migration,etc). Please correct me if I misunderstand something about processing data files

agree, I was proposing iceberg-deltalake, because for example if we can also later have iceberg-hudi, and people can just take 1 dependency for their migration purpose, instead of multiple.

core/src/main/java/org/apache/iceberg/DeltaLakeDataTypeVisitor.java

RussellSpitzer · 2022-12-20T15:09:53Z

I love seeing this functionality but I'm not sure it should be a first class citizen in the repo. The issue being this would require us to pull in all Delta Dependencies which would increase the complexity of our build as well as keeping up versioning. Is there a way to accomplish this without as heavy a dependency requirement?

jackye1995 · 2022-12-21T19:05:09Z

I love seeing this functionality but I'm not sure it should be a first class citizen in the repo.

+1, what about having a iceberg-delta-lake module for this feature? That can hold core logic for the conversion, I think there is also a strong interest in Trino to have a CONVERT procedure, which could import this module and leverage the shared logic.

The Delta dependencies could be marked as compileOnly, so no additional dependency will be pulled in in the runtime jar. Basically it's like any vendor integration. Would that work for minimizing dependency?

jackye1995

Thanks for taking this from me and Eric, really appreciate the help! As a starting point, I think Russell and you also agree that this is better to put in a separated module, so let's first do that, and see how that goes.

core/src/main/java/org/apache/iceberg/BaseMigrateDeltaLakeTableAction.java

jackye1995 · 2022-12-21T19:15:16Z

core/src/main/java/org/apache/iceberg/BaseMigrateDeltaLakeTableAction.java

+        .build();
+  }
+
+  protected abstract Metrics getMetricsForFile(Table table, String fullFilePath, FileFormat format);


agree, I was proposing iceberg-deltalake, because for example if we can also later have iceberg-hudi, and people can just take 1 dependency for their migration purpose, instead of multiple.

core/src/main/java/org/apache/iceberg/BaseMigrateDeltaLakeTableAction.java

core/src/main/java/org/apache/iceberg/DeltaLakeDataTypeVisitor.java

…mber of data files migrated

nastra

would be good to also get input from @rdblue or @danielcweeks here.

...c/integration/java/org/apache/iceberg/delta/DeltaLakeToIcebergMigrationSparkIntegration.java