-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add change proposal for conversion of deletion vectors #634
base: main
Are you sure you want to change the base?
Conversation
paths are encoded and referenced in the json files. With each DML operation, the commit log records the removal of the | ||
original data file and its addition with updated details including the newly added deletion vector. | ||
|
||
Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file
Say some rows are updated in file_a.parquet
. Wouldn't this result in one more row, that contains the new values for the deleted rows?
Delta spec doesn't describe this detail as well. Is the order of application determined by the order of actions in the log file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great question!
I haven't looked into the reference implementation that would resolve conflicting actions (add and remove in this case) for the same data file. Maybe the reference implementation requires the delete vectors descriptions to also match if two actions need to be equal.
I agree that the delta spec should clarify this. I'm happy to take a look at it, but it would be best to confirm on the Delta Lake dev list as well.
|
||
## Rollout/Adoption Plan | ||
|
||
This change does not cause any breaking changes. Before this change, delete vectors were ignored and would cause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the proposal is to only implement V2 positional deletes. Is that correct? Should that be added here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vigneshc
Thanks for reviewing the proposal!
Iceberg V3 hasn't been released yet. The good news is that its implementation aligns with Delta Lake, but it will be a bit early to dive into the details and implementation for V3 conversions in this RFC. There's also another active thread on modularizing XTable code, with plans to have separate modules for V2 and V3.
I hope I got your question right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ashvin. I was asking to explicitly add that the rfc currently only covers V2 spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RFC is recommending a generic model that allows us to evolve alongside the different table formats. The V2 spec is mainly in the RFC to illustrate why this is the recommended approach, so I don't know if we need to explicitly call this out in the RFC. We can add the versions support in our docs after we get a first iteration completed.
Fixes #631
This is a documentation change only. This document provides a high-level proposal of deletion vector conversion in XTable, focusing on the current implementation of deletion vectors in Delta Lake and Iceberg.