Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: more economic data skipping with datafusion #2772

Merged
merged 6 commits into from
Aug 14, 2024

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Aug 14, 2024

Description

This PR adresses some inefficiencies when creating stats for file pruning using the PruningPredicate. Rather then generating stats from file actions which we generate ad-hoc from teh underlying arrow data, we now read the stats directly from the raw file data - i.e. avoiding expensive roundtrips through file actions.

To make this work, we needed to include parsing partition values (adding the partitionValues_parsed field to action data) during log replay. As a follow-up we should make more use of the parsed partiton values #2771

Related Issue(s)

Documentation

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Aug 14, 2024
.ok()?;
results.push(result.record_batch().clone());
}
let batch = concat_batches(results[0].schema_ref(), &results).ok()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it cheaper to concat first and then run it through the Evaluator?

Copy link
Collaborator Author

@roeap roeap Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, at the very least that would give more opportunity for parallelism. in fact we should already concatenate the batches on EagerSnapshot. However the schemas of these batches are not yet normalized, so we cannot concatenate them yet.

For this we need to do either some internal casting/filtering in the log replay, or even better do column selection when reading the checkpoints ...

@roeap roeap enabled auto-merge August 14, 2024 20:09
rtyler
rtyler previously approved these changes Aug 14, 2024
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manually tested, some rustdocs can come later 😄

Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙃 i'm sure it's fine 🙃

@rtyler rtyler disabled auto-merge August 14, 2024 20:26
@rtyler rtyler enabled auto-merge August 14, 2024 20:26
@rtyler rtyler added this pull request to the merge queue Aug 14, 2024
Merged via the queue into delta-io:main with commit d3a7967 Aug 14, 2024
18 checks passed
@roeap roeap deleted the feature/data-skipping branch August 14, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants