Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Support appending files with different specs #9860

Merged

Conversation

fqaiser94
Copy link
Contributor

@fqaiser94 fqaiser94 commented Mar 3, 2024

What is the problem?

Currently the table.newAppend() API expects users to provide Datafiles with the same PartitionSpec via .appendFile().
Failure to do so raises a ValidationException("Invalid data file, expected spec id: %d", dataSpec.specId()).

CMIIW but the Iceberg spec doesn't seem to impose any such restriction.
The only related restriction I could find was in the manifests section which says:

A manifest stores files for a single partition spec.

We can easily work around this by writing multiple manifests, one for each spec for which files are being appended.

Why is this change needed/valuable?

In the iceberg-kafka-connect project, we've seen that when users evolve the PartitionSpec of the table, often they'll end up in a situation where Datafiles with different PartitionSpecs might be inflight and committing these DataFiles together as part of the same snapshot becomes impossible due to the aforementioned ValidationException.

While we could work around this by committing DataFiles with different PartitionSpecs as separate snapshots, this makes it complex for us to correctly associate valuable (watermarking) metadata with each snapshot in the snapshot properties. In addition, it makes the table snapshot history unnecessarily longer. It would be more ideal if we could avoid these issues.

Related work

@github-actions github-actions bot added the core label Mar 3, 2024
@fqaiser94 fqaiser94 force-pushed the append_data_files_with_multiple_specs branch from 8320a0e to beeba63 Compare March 3, 2024 17:10
@fqaiser94 fqaiser94 force-pushed the append_data_files_with_multiple_specs branch 2 times, most recently from 909d37c to 916a7b2 Compare March 4, 2024 16:25
Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about the validation since we're really looking at possibilities of having a replace/overwrite that crosses partition structures. @szehon-ho and @rdblue will probably need to take a closer look as well.

After an initial glance, I'm not sure the validations still hold.

After looking a little closer, I think this is ok. We're actually doubling up on the validation and checking for any changes across each partition spec, so I think this is ok.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just had some nits!

@fqaiser94 fqaiser94 force-pushed the append_data_files_with_multiple_specs branch from 660bab4 to 989f622 Compare July 3, 2024 16:29
@fqaiser94
Copy link
Contributor Author

Simplified the PR to just multi-partition-append use case.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really sorry for the delayed review on this @fqaiser94 I see this PR came up in discussion on the kafka commit coordination PR #10351 (comment).

I do remember the context of this PR and I just took another pass, at least from me it looks good. I'll leave it open for a bit in case @nastra @danielcweeks are interested, otherwise I'll merge.

@amogh-jahagirdar amogh-jahagirdar merged commit 5f970a8 into apache:main Jul 18, 2024
49 checks passed
@fqaiser94 fqaiser94 deleted the append_data_files_with_multiple_specs branch July 20, 2024 16:47
@fqaiser94
Copy link
Contributor Author

Really sorry for the delayed review on this @fqaiser94 I see this PR came up in discussion on the kafka commit coordination PR #10351 (comment).

I do remember the context of this PR and I just took another pass, at least from me it looks good. I'll leave it open for a bit in case @nastra @danielcweeks are interested, otherwise I'll merge.

No problems, and thanks all for the reviews 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants