Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add change proposal for conversion of deletion vectors #634

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ashvina
Copy link
Contributor

@ashvina ashvina commented Jan 31, 2025

Fixes #631

This is a documentation change only. This document provides a high-level proposal of deletion vector conversion in XTable, focusing on the current implementation of deletion vectors in Delta Lake and Iceberg.

@ashvina ashvina linked an issue Jan 31, 2025 that may be closed by this pull request
2 tasks
@vinishjail97 vinishjail97 added the rfc PR's proposing an RFC label Feb 5, 2025
paths are encoded and referenced in the json files. With each DML operation, the commit log records the removal of the
original data file and its addition with updated details including the newly added deletion vector.

Example:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file

Say some rows are updated in file_a.parquet. Wouldn't this result in one more row, that contains the new values for the deleted rows?
Delta spec doesn't describe this detail as well. Is the order of application determined by the order of actions in the log file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great question!
I haven't looked into the reference implementation that would resolve conflicting actions (add and remove in this case) for the same data file. Maybe the reference implementation requires the delete vectors descriptions to also match if two actions need to be equal.
I agree that the delta spec should clarify this. I'm happy to take a look at it, but it would be best to confirm on the Delta Lake dev list as well.


## Rollout/Adoption Plan

This change does not cause any breaking changes. Before this change, delete vectors were ignored and would cause

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the proposal is to only implement V2 positional deletes. Is that correct? Should that be added here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vigneshc
Thanks for reviewing the proposal!

Iceberg V3 hasn't been released yet. The good news is that its implementation aligns with Delta Lake, but it will be a bit early to dive into the details and implementation for V3 conversions in this RFC. There's also another active thread on modularizing XTable code, with plans to have separate modules for V2 and V3.

I hope I got your question right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ashvin. I was asking to explicitly add that the rfc currently only covers V2 spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC is recommending a generic model that allows us to evolve alongside the different table formats. The V2 spec is mainly in the RFC to illustrate why this is the recommended approach, so I don't know if we need to explicitly call this out in the RFC. We can add the versions support in our docs after we get a first iteration completed.

@ashvina ashvina closed this Feb 19, 2025
@ashvina ashvina reopened this Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc PR's proposing an RFC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add RFC for deletion vector conversion support in XTable
4 participants