Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part 2: propagate transform in visit_scan_files #612

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Dec 20, 2024

Propagate the computed transforms from #607 through calls to visit_scan_files.

How was this change tested?

Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 60.41667% with 19 lines in your changes missing coverage. Please review.

Project coverage is 84.02%. Comparing base (d999b5c) to head (779a662).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ffi/src/scan.rs 0.00% 16 Missing ⚠️
kernel/src/scan/mod.rs 88.46% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #612      +/-   ##
==========================================
+ Coverage   83.92%   84.02%   +0.09%     
==========================================
  Files          75       76       +1     
  Lines       17277    17565     +288     
  Branches    17277    17565     +288     
==========================================
+ Hits        14500    14759     +259     
- Misses       2078     2091      +13     
- Partials      699      715      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking -- this PR makes almost no functional changes (other than printing the transforms)? It's mostly wiring to prepare for the next PR?

ffi/src/scan.rs Outdated
}

// #[no_mangle]
// /// allow probing into a CStringMap. If the specified key is in the map, kernel will call
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// /// allow probing into a CStringMap. If the specified key is in the map, kernel will call
// /// allow probing into a CTransformMap. If the specified key is in the map, kernel will call

(tho probably this whole chunk of code should just be moved to the later FFI PR?)

@nicklan nicklan force-pushed the part-2-propogate-transform-in-visit-scan-files branch from ff5d5fe to e88db7e Compare January 10, 2025 01:23
@nicklan
Copy link
Collaborator Author

nicklan commented Jan 10, 2025

Just double checking -- this PR makes almost no functional changes (other than printing the transforms)? It's mostly wiring to prepare for the next PR?

Yep, that's right. Just keeping things in logical chunks as much as possible. Hopefully that made the reviews easier

@zachschuermann zachschuermann changed the title Part 2: propogate transform in visit_scan_files Part 2: propagate transform in visit_scan_files Jan 15, 2025
nicklan added a commit that referenced this pull request Jan 23, 2025
…and return it. (#607)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

This is the initial part of moving to using expressions to express
transformations when reading data. What this PR does is:
- Compute a "static" transform, which is just a set of column
expressions that need to be passed directly through without change, or
enough metadata for lower levels to fill in a "fixup" expression
- The static transform is passed into the iterator that parses each
`Add` file
- When parsing the `Add` file, if there are needed fix-ups (just
partition columns today), the correct expression is created, and
inserted into a row indexed map
- This map is returned so the caller can find out for a given row what,
if any, expression needs to be applied when reading the specified row

Follow-up PRs:
* #612: Propagate this information through when using `visit_scan_files`
* #613: Actually use the data to do transformation and remove
`transform_to_logical` entirely
* #614: Make this work over ffi and use it
* (TODO): Clean up any existing code that's now over complicated in the
scan building

Each of those are more invasive and end up touching significant code, so
I'm staging this as much as possible to make reviews easier.

<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->

Unit tests, and inspection of resultant expressions when run on tables
@nicklan nicklan force-pushed the part-2-propogate-transform-in-visit-scan-files branch from da29cc7 to 18b29db Compare January 23, 2025 00:43
@nicklan nicklan marked this pull request as ready for review January 23, 2025 00:44
@nicklan nicklan force-pushed the part-2-propogate-transform-in-visit-scan-files branch from 18b29db to de5fd07 Compare January 23, 2025 21:48
@@ -213,7 +220,10 @@ impl<T> RowVisitor for ScanFileVisitor<'_, T> {
mod tests {
use std::collections::HashMap;

use crate::scan::test_utils::{add_batch_simple, run_with_validate_callback};
use crate::{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe flatten these imports?

Comment on lines 314 to 318
if row < transforms.len() {
transforms[row].clone()
} else {
None
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if row < transforms.len() {
transforms[row].clone()
} else {
None
}
transforms.get(row).cloned().flatten()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this only has a single call site. Now that it has reduced to a one-liner, should we just inline the code directly where it's used? Or do we anticipate other uses coming soon?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note it's pub. My thinking was, much like the issues with deletion vectors being shorter than the actual number of rows, here we have a case where it's easy to mess up if the transform vec is shorter than the data, so let's have a nice interface that makes it easy. Given the one liner maybe we don't need it though. thoughts?

@@ -398,5 +407,12 @@ pub unsafe extern "C" fn visit_scan_data(
callback,
};
// TODO: return ExternResult to caller instead of panicking?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems pretty easy to address this todo? maybe we can go ahead and do that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will require changes in the c code, which I was trying to avoid mostly here. I added a reminder for myself here for when I do the final fixup that requires lots of C changes

@@ -138,12 +141,14 @@ pub type ScanCallback<T> = fn(
pub fn visit_scan_files<T>(
data: &dyn EngineData,
selection_vector: &[bool],
transforms: &Vec<Option<ExpressionRef>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for &Vec<Option<ExpressionRef>> instead of &[Option<ExpresisonRef>]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, good catch. usually clippy gets these, but it didn't this time :)

@@ -355,6 +362,7 @@ fn rust_callback(
size: i64,
kernel_stats: Option<delta_kernel::scan::state::Stats>,
dv_info: DvInfo,
_transform: Option<ExpressionRef>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason not to update the callback in this PR as well, so we can pass this on to the engine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, mostly I was trying to keep it separated. This PR was focused on getting visit_scan_files to pass things along. If I update the callback here I have to update all the c code too so it brings a bunch into this PR and I thought it was better to keep all the c changes to part 4

@@ -281,7 +284,7 @@ pub struct CStringMap {
/// # Safety
///
/// The engine is responsible for providing a valid [`CStringMap`] pointer and [`KernelStringSlice`]
pub unsafe extern "C" fn get_from_map(
pub unsafe extern "C" fn get_from_string_map(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable name change, but why this PR?
(also -- do we anticipate exposing other map types through FFI in the future?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mostly I just noticed it while possibly having a "transform map" (which we no longer will have), and thought it was a good change. I can move it to another PR, but it does seem to make more sense this way. TBD on other types, but I imagine eventually we'll find something :)

Comment on lines 314 to 318
if row < transforms.len() {
transforms[row].clone()
} else {
None
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this only has a single call site. Now that it has reduced to a one-liner, should we just inline the code directly where it's used? Or do we anticipate other uses coming soon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants