feat: add add_columns job #3180

westonpace · 2024-11-26T21:15:56Z

My main goal is to create a flexible add_columns job which is distributed. Both the calculation of the new columns can be distributed and the eventual alignment can be distributed. This is done by incurring a penalty that we need to write the data twice.

However, this allows the "adding data" step to be independent of any fragment configuration.

…taset from a source dataset Add a new "add columns job" which utilizes this to make a distributable add_columns

westonpace · 2024-11-26T21:20:22Z

Leaving in draft for now as it's not fully complete. Although the tests pass and we could maybe call this phase 1. I'd like to eventually get the align columns task working for row ids and arbitrary join keys.

Main concerns:

Inside align columns task we have basically created a distributed version of Updater. Is there a way these two things can be merged to avoid needing two separate implementations?
I need to add some version validation (see below)
Still need to supply impls for row ids and arbitrary join keys

Validation of read versions is still missing but here is my thoughts:

A read version is captured when the add columns job is created
When the add phase is finished (and we create the align job) we do a version check
- If the dataset has been overwritten then fail
- If the user is using dataset offsets (or anything "address-based") then fail if there has been a compaction (potentially fail if there has been an update but we can allow that)
- Capture the read version again for the align phase
When we commit the align phase we need to be stricter
- If there has been a compaction then fail regardless of what kind of join column is being used (this might be handled already)

codecov-commenter · 2024-11-26T21:48:22Z

Codecov Report

Attention: Patch coverage is 88.37209% with 50 lines in your changes missing coverage. Please review.

Project coverage is 78.10%. Comparing base (e6c2343) to head (8c524a9).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/align.rs	88.64%	33 Missing and 14 partials ⚠️
rust/lance/src/dataset.rs	72.72%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3180      +/-   ##
==========================================
+ Coverage   78.00%   78.10%   +0.09%     
==========================================
  Files         243      244       +1     
  Lines       82889    83091     +202     
  Branches    82889    83091     +202     
==========================================
+ Hits        64660    64897     +237     
+ Misses      15005    14978      -27     
+ Partials     3224     3216       -8

Flag	Coverage Δ
unittests	`78.10% <88.37%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Adds a job for "aligning fragments" which adds columns to a target da…

8c524a9

…taset from a source dataset Add a new "add columns job" which utilizes this to make a distributable add_columns

github-actions bot added enhancement New feature or request python labels Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add add_columns job #3180

feat: add add_columns job #3180

westonpace commented Nov 26, 2024

westonpace commented Nov 26, 2024

codecov-commenter commented Nov 26, 2024

feat: add add_columns job #3180

Are you sure you want to change the base?

feat: add add_columns job #3180

Conversation

westonpace commented Nov 26, 2024

westonpace commented Nov 26, 2024

codecov-commenter commented Nov 26, 2024

Codecov Report