Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta: Support Snapshot Delta Lake Table to Iceberg Table #6449

Merged
merged 57 commits into from
Feb 7, 2023

Conversation

JonasJ-ap
Copy link
Contributor

@JonasJ-ap JonasJ-ap commented Dec 19, 2022

This PR follows #5331 and and introduce a new module called iceberg-delta-lake to provide support to snapshot a delta lake table to an iceberg table by translating schemas and committing all logs through iceberg transaction.

The current implementation relies on delta-standalone to read deltaLog of the given delta tables. Since delta-standone only support tables with minReaderVersion=1 and minWriterVersion=2, the snapshot action does not support high versioned features such as ColumnMapping.

Most of the tests for this module are based on Spark3.3 and therefore are under integrationTest task

@JonasJ-ap JonasJ-ap reopened this Dec 19, 2022
@JonasJ-ap JonasJ-ap marked this pull request as draft December 19, 2022 05:16
@JonasJ-ap JonasJ-ap changed the title Migrate delta to iceberg WIP: Adding support for Migrating Delta Lake Table to Iceberg Table Dec 19, 2022
@JonasJ-ap JonasJ-ap changed the title WIP: Adding support for Migrating Delta Lake Table to Iceberg Table WIP, Core, Spark: Adding support for Migrating Delta Lake Table to Iceberg Table Dec 19, 2022
Copy link
Contributor Author

@JonasJ-ap JonasJ-ap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I add an abstract class called BaseMigrateDeltaLakeTableAction. This class takes in an icebergCatalog and a delta lake table's path as suggested in #5331 (review)

.build();
}

protected abstract Metrics getMetricsForFile(Table table, String fullFilePath, FileFormat format);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, we must access the relevant file's package (e.g. iceberg-parquet) to get the metrics. Hence, I have to make this method abstract since this class is currently in iceberg-core. I feel like we may want to implement this in some other ways: (e.g. make this class more abstract, add a new package called iceberg-migration,etc). Please correct me if I misunderstand something about processing data files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, I was proposing iceberg-deltalake, because for example if we can also later have iceberg-hudi, and people can just take 1 dependency for their migration purpose, instead of multiple.

@JonasJ-ap JonasJ-ap changed the title WIP, Core, Spark: Adding support for Migrating Delta Lake Table to Iceberg Table WIP: Core, Spark: Adding support for Migrating Delta Lake Table to Iceberg Table Dec 19, 2022
@RussellSpitzer
Copy link
Member

I love seeing this functionality but I'm not sure it should be a first class citizen in the repo. The issue being this would require us to pull in all Delta Dependencies which would increase the complexity of our build as well as keeping up versioning. Is there a way to accomplish this without as heavy a dependency requirement?

@jackye1995
Copy link
Contributor

I love seeing this functionality but I'm not sure it should be a first class citizen in the repo.

+1, what about having a iceberg-delta-lake module for this feature? That can hold core logic for the conversion, I think there is also a strong interest in Trino to have a CONVERT procedure, which could import this module and leverage the shared logic.

The Delta dependencies could be marked as compileOnly, so no additional dependency will be pulled in in the runtime jar. Basically it's like any vendor integration. Would that work for minimizing dependency?

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this from me and Eric, really appreciate the help! As a starting point, I think Russell and you also agree that this is better to put in a separated module, so let's first do that, and see how that goes.

.build();
}

protected abstract Metrics getMetricsForFile(Table table, String fullFilePath, FileFormat format);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, I was proposing iceberg-deltalake, because for example if we can also later have iceberg-hudi, and people can just take 1 dependency for their migration purpose, instead of multiple.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to also get input from @rdblue or @danielcweeks here.

*
* <pre>
* root
* |-- address_nested: struct (nullable = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be eventually testing with all supported types rather than hardcoding?

Copy link
Contributor Author

@JonasJ-ap JonasJ-ap Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Thank you for pointing this out. I added another dataframe which includes all data types that delta lake supports (and also supported by iceberg). I will discuss details below

* use assertj for all tests

* add null check for the spark integration method

* use a method to generate the hardcode dataframe

* drop iceberg table afterwards

* add typetest table

* test all delta lake types

* test conversion of NullType

* fix format issue

* add a second dataframe

* refactor the integration test

* correctly decoded delta's path

* fix wrong decoding

* fix wrong decoding 2
Copy link
Contributor Author

@JonasJ-ap JonasJ-ap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra Thank you very much for your reviews! Based on your suggestions, I made some refactor to the integration test and also add additional checks to ensure the correctness.

.withColumn("shortCol", expr("CAST(longCol AS SHORT)"))
.withColumn("mapCol", expr("MAP(longCol, decimalCol)"))
.withColumn("arrayCol", expr("ARRAY(longCol)"))
.withColumn("structCol", expr("STRUCT(mapCol, arrayCol)"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This frames aims to include all the DataType supported by the Delta Lake table and also supported by iceberg after the conversion.

The included types are:

  • ArrayType
  • BinaryType
  • BooleanType
  • ByteType
  • DateType
  • DecimalType
  • DoubleType
  • FloatType
  • IntegerType
  • LongType
  • MapType
  • ShortType
  • StringType
  • StructType
  • TimeStampType

ref: Delta Lake Types

The NullType is not supported by the iceberg, and thus will be caught in the unit test for schema conversion.

.withColumn("mapCol1", expr("MAP(structCol1, structCol2)"))
.withColumn("mapCol2", expr("MAP(longCol, dateString)"))
.withColumn("mapCol3", expr("MAP(dateCol, arrayCol)"))
.withColumn("structCol3", expr("STRUCT(structCol2, mapCol3, arrayCol)"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove the hard coded dataframe and use spark session to generated a nested dataframe. This dataframe can help us verify the correctness of recursive schema conversion and the name mapping

SnapshotDeltaLakeTable.Result result =
DeltaLakeToIcebergMigrationSparkIntegration.snapshotDeltaLakeTable(
spark, newTableIdentifier, typeTestTableLocation)
.tableProperty(TableProperties.PARQUET_VECTORIZATION_ENABLED, "false")
Copy link
Contributor Author

@JonasJ-ap JonasJ-ap Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testTypeDataFrame includes a timestamp column which is default to INT96 by delta lake/spark.
According to #4196 and #4200 , currently the vectorized reader does not support INT96, so we must disable it.

I will also add this to the doc #6600

* remove a redundant map collector in commitDeltaVersionLogToIcebergTransaction

* get the earliest possible version rather than hard code from 0

* add unit test to check if table exists

* refactor action extracted from the versionlog

* fix format issue

* move non-share table write operation to the test itself, instead of in before()

* fix type
this.deltaLog = DeltaLog.forTable(conf, deltaTableLocation);
this.deltaLakeFileIO = new HadoopFileIO(conf);
// get the earliest version available in the delta lake table
this.deltaStartVersion = deltaLog.getVersionAtOrAfterTimestamp(0L);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the earliest available log may not be 0. It is better to let deltaLog tell us which version is the earliest one that is available.

dataFileActions =
deltaLog.getSnapshotForVersionAsOf(deltaStartVersion).getAllFiles().stream()
.map(addFile -> (Action) addFile)
.collect(Collectors.toList());
Copy link
Contributor Author

@JonasJ-ap JonasJ-ap Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more logically reasonable to extract the datafiles from the snapshot for the start version instead of from the version log, since there may be versions already deleted from the table.

Also, doing this also allow us to add options to let users choose which version to start snapshot.

build.gradle Outdated Show resolved Hide resolved
@danielcweeks danielcweeks requested review from danielcweeks and removed request for danielcweeks February 6, 2023 05:56
this.deltaLog = DeltaLog.forTable(conf, deltaTableLocation);
this.deltaLakeFileIO = new ResolvingFileIO();
this.deltaLakeFileIO.initialize(ImmutableMap.of());
this.deltaLakeFileIO.setConf(conf);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to use ResolvingFileIO here. The reason I directly initialize the IO here is that: based on previous discussion: #6449 (comment), we would like to maintain the consistency of the reader and the writer. Since the deltaLog relies on the hadoop configuration, deltaLakeFileIo should also use the hadoop configuration instead of other table properties. Hence, we shall initialize the io using an empty map.

@jackye1995, @danielcweeks . Could you please help me confirm the ResolvingFileIO initialization process here? Thank you very much!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually want to use HadoopFileIO specifically in this case, because the Hadoop configuration should dictate how the Delta files are read. If we use resolving FileIO, for Delta table on S3, it requires users to put another set of configurations for S3FileIO, instead of using the S3 FileSystem that is already configured to access the delta logs.

Copy link
Contributor Author

@JonasJ-ap JonasJ-ap Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your explanation. I misunderstood that ResolvingFileIO would load hadoop configuration to S3FileIO during the instantiation. After I re-read the source code, I now understand it is not the case because S3FileIO is not even hadoop configurable. I will revert the implementation here to HadoopFileIO to maintain the consistency between the DeltaLog and FileIO.

Copy link
Contributor

@yyanyy yyanyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and thanks for the contribution! Minor nits that might not worth blocking

} else if (path.endsWith(ORC_SUFFIX)) {
return FileFormat.ORC;
} else {
throw new ValidationException("Cannot determine file format from path %s", path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure if we continue to handle non-parquet suffix and consider this as "cannot determine" rather than "do not support" here after we removed dependencies, as now they will probably fail with Cannot get metrics from file format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out. I removed support for orc and avro and changed the error message to "do not support..."

@jackye1995
Copy link
Contributor

Looks like all the comments are addressed. Synced offline with @danielcweeks and @nastra and I believe they don't have further comments either.

Given the fact that this PR has been open for quite some time, and there are quite a few subsequent updates I would like to see, I will go ahead to merge it, and create some issues tracking some features that can be added to improve the migration experience.

Thanks @JonasJ-ap and @ericlgoodman for the contribution, and thanks everyone for the review!

@jackye1995 jackye1995 merged commit eeb055a into apache:master Feb 7, 2023
krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants