-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Support appending files with different specs #9860
Core: Support appending files with different specs #9860
Conversation
8320a0e
to
beeba63
Compare
909d37c
to
916a7b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about the validation since we're really looking at possibilities of having a replace/overwrite that crosses partition structures. @szehon-ho and @rdblue will probably need to take a closer look as well.
After an initial glance, I'm not sure the validations still hold.
After looking a little closer, I think this is ok. We're actually doubling up on the validation and checking for any changes across each partition spec, so I think this is ok.
core/src/main/java/org/apache/iceberg/BaseReplacePartitions.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just had some nits!
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
126b4fd
to
724a1d8
Compare
660bab4
to
989f622
Compare
Simplified the PR to just multi-partition-append use case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really sorry for the delayed review on this @fqaiser94 I see this PR came up in discussion on the kafka commit coordination PR #10351 (comment).
I do remember the context of this PR and I just took another pass, at least from me it looks good. I'll leave it open for a bit in case @nastra @danielcweeks are interested, otherwise I'll merge.
No problems, and thanks all for the reviews 😄 |
What is the problem?
Currently the
table.newAppend()
API expects users to provide Datafiles with the same PartitionSpec via.appendFile()
.Failure to do so raises a
ValidationException("Invalid data file, expected spec id: %d", dataSpec.specId())
.CMIIW but the Iceberg spec doesn't seem to impose any such restriction.
The only related restriction I could find was in the manifests section which says:
We can easily work around this by writing multiple manifests, one for each spec for which files are being appended.
Why is this change needed/valuable?
In the iceberg-kafka-connect project, we've seen that when users evolve the PartitionSpec of the table, often they'll end up in a situation where Datafiles with different PartitionSpecs might be inflight and committing these DataFiles together as part of the same snapshot becomes impossible due to the aforementioned
ValidationException
.While we could work around this by committing DataFiles with different PartitionSpecs as separate snapshots, this makes it complex for us to correctly associate valuable (watermarking) metadata with each snapshot in the snapshot properties. In addition, it makes the table snapshot history unnecessarily longer. It would be more ideal if we could avoid these issues.
Related work