Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add add_columns job #3180

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

westonpace
Copy link
Contributor

My main goal is to create a flexible add_columns job which is distributed. Both the calculation of the new columns can be distributed and the eventual alignment can be distributed. This is done by incurring a penalty that we need to write the data twice.

However, this allows the "adding data" step to be independent of any fragment configuration.

…taset from a source dataset

Add a new "add columns job" which utilizes this to make a distributable add_columns
@github-actions github-actions bot added enhancement New feature or request python labels Nov 26, 2024
@westonpace
Copy link
Contributor Author

Leaving in draft for now as it's not fully complete. Although the tests pass and we could maybe call this phase 1. I'd like to eventually get the align columns task working for row ids and arbitrary join keys.

Main concerns:

  • Inside align columns task we have basically created a distributed version of Updater. Is there a way these two things can be merged to avoid needing two separate implementations?
  • I need to add some version validation (see below)
  • Still need to supply impls for row ids and arbitrary join keys

Validation of read versions is still missing but here is my thoughts:

  • A read version is captured when the add columns job is created
  • When the add phase is finished (and we create the align job) we do a version check
    • If the dataset has been overwritten then fail
    • If the user is using dataset offsets (or anything "address-based") then fail if there has been a compaction (potentially fail if there has been an update but we can allow that)
    • Capture the read version again for the align phase
  • When we commit the align phase we need to be stricter
    • If there has been a compaction then fail regardless of what kind of join column is being used (this might be handled already)

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 88.37209% with 50 lines in your changes missing coverage. Please review.

Project coverage is 78.10%. Comparing base (e6c2343) to head (8c524a9).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/dataset/align.rs 88.64% 33 Missing and 14 partials ⚠️
rust/lance/src/dataset.rs 72.72% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3180      +/-   ##
==========================================
+ Coverage   78.00%   78.10%   +0.09%     
==========================================
  Files         243      244       +1     
  Lines       82889    83091     +202     
  Branches    82889    83091     +202     
==========================================
+ Hits        64660    64897     +237     
+ Misses      15005    14978      -27     
+ Partials     3224     3216       -8     
Flag Coverage Δ
unittests 78.10% <88.37%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants