Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Ensure data read order reflects commit sequence in Iceberg tables #6341

Merged

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented Nov 7, 2024

Before this change:
Data files were read in the order they appeared in the snapshot, which sometimes led to new data appearing at the top of the table instead of at the bottom.

This PR:

  • Fixes the ordering of data files to reflect the commit sequence, ensuring that newer data appears at the end of the table, as expected.
  • Also, allows comparison between different TableLocationKey implementations using fully qualified class names as a falback.

@malhotrashivam malhotrashivam added this to the 0.37.0 milestone Nov 7, 2024
@malhotrashivam malhotrashivam self-assigned this Nov 7, 2024
Copy link
Contributor

@lbooker42 lbooker42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small thing left.

devinrsmith
devinrsmith previously approved these changes Nov 11, 2024
return uri.compareTo(otherTyped.uri);
}
// When comparing with non-iceberg location key, we want to compare both partitions and URI
return super.compareTo(other);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good first step, but I'm realizing we have some trouble here.
We may have neglected this, but TLKs really must be comparable to all TLKs, not just TLKs of the same provenance.
We need fallback comparisons.

We have two paths forward:

  1. Keep behavior as-is, and defer to a new issue: Go back to throwing an exception on this line for now. Create an issue to define an ordering.
  2. Fix the problem now.

My proposal for fixing, whether we do it here or in a new issue:

  1. Make unrelated implementations compare to one another based on fully-qualified class name. This imposes an arbitrary but consistent ordering, and should roughly keep distinct sources together.
  2. Make PartitionedTLK use order + partitions, followed by classname for tie-breaking.
  3. Make URITLK order + partitions, then break ties by URI (if the other is a URITLK), and lastly break ties by classname.
  4. Make IcebergTLK compare to other IcbergTLKs by Catalog name and TableIdentifier (which should also be added to equals and hashcode), then as currently defined in this PR, then to other URITLKs as defined by the URITLK rules, and so on.

Might want opinions from @abaranec and @devinrsmith

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Comment on lines 99 to 100
return new IcebergTableParquetLocationKey(tableUuid, catalogName, tableIdentifier, manifestFile, dataFile,
fileUri, 0, partitions, parquetInstructions, channelsProvider);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sure is a lot of constructor args. Sorry. Builder? (Probably not worth it for something called in only once place.)

rcaudy
rcaudy previously approved these changes Nov 18, 2024
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy if @abaranec and @devinrsmith are.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

rcaudy
rcaudy previously approved these changes Nov 19, 2024
@malhotrashivam malhotrashivam merged commit 6127a39 into deephaven:main Nov 20, 2024
18 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Nov 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants