Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add offset pushdown to parquet #3848

Merged
merged 1 commit into from
Mar 14, 2023
Merged

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

A follow up to #3633 that also allows pushing down an offset

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 12, 2023
@tustvold tustvold requested a review from thinkharderdev March 12, 2023 17:45
Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +407 to +408
selectors.push(RowSelector::skip(skipped_count + offset));
selectors.push(RowSelector::select(selected_count - offset));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip and select cannot be interleaved? i.e. all skip selections are in front of select selection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operator is skipping the first offset rows, so will always start with a skip, the remaining selection afterwards can be interleaved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I meant the way you calculate skpped_count and selected_count. If original selectors are mixed with skip and select, is the logic still correct?

E.g., 1st selector skip 5 row, 2nd selector select 1 row, 3rd selector skip 2 rows, 4 selector select 5 rows.

With offset 2, this produces 1st selector skip 7 + 2 rows, 2nd selector select 6 - 2 rows. But original 2nd selector's 1 row is skipped now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the correct behaviour no, the offset is an offset into the selected rows? So in this case you would expect the last 4 rows, skipping the first 2 that were selected originally?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's correct as skipped rows only count until selected count is larger than offset.

Comment on lines +407 to +408
selectors.push(RowSelector::skip(skipped_count + offset));
selectors.push(RowSelector::select(selected_count - offset));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's correct as skipped rows only count until selected count is larger than offset.

@tustvold tustvold merged commit dfb8c76 into apache:master Mar 14, 2023
@ursabot
Copy link

ursabot commented Mar 14, 2023

Benchmark runs are scheduled for baseline = c156715 and contender = dfb8c76. dfb8c76 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

metesynnada pushed a commit to synnada-ai/arrow-rs that referenced this pull request Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants