Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in page skipping #2552

Merged
merged 4 commits into from
Sep 5, 2022
Merged

Conversation

thinkharderdev
Copy link
Contributor

Which issue does this PR close?

Closes #2543

Rationale for this change

Discovered some small bugs while testing the page pruning logic in ParquetRecordBatchStream. We were improperly reading data pages when not necessary in a few cases. When we have only fetched the pages required for a given row selection this causes an error.

What changes are included in this PR?

We had a method GenericColumnReader::has_next which will check if the current page is fully decoded and then load the next page (if it exists). This method was used in a few places to test whether the column was exhausted. This can cause issues since it will load the next page, which may be skipped according to the current row selection.

Added a method GenericColumnReader::has_next which will test whether the column is exhausted without loading the next page and changed a few places to use that.

I think a slightly more thorough refactor is not a bad idea since the API is a bit confusing as is, but this should solve the immediate bugs.

Are there any user-facing changes?

No

@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Aug 22, 2022
@@ -2623,7 +2623,7 @@ mod tests {
let re = builder.build(Cursor::new(json_content));
assert_eq!(
re.err().unwrap().to_string(),
r#"Json error: Expected JSON record to be an object, found Array [Number(1), String("hello")]"#,
r#"Json error: Expected JSON record to be an object, found Array([Number(1), String("hello")])"#,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this is about but this test was failing for me so I fixed it

// Check to see if the column is exhausted. Only peek the next page since in
// case we are reading to a page boundary and do not actually need to read
// the next page.
let end_of_column = !self.column_reader.as_mut().unwrap().peek_next()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should ultimately get cleaned up a bit. It is confusing since we need to peek_next but also call has_next below (since the next page needs to get loaded). It feels like the page-level logic wants to be encapsulated inside GenericColumnReader but it inevitable leaks out in places like this.

@thinkharderdev
Copy link
Contributor Author

@Ted-Jiang @tustvold

@github-actions github-actions bot removed the arrow Changes to the arrow crate label Aug 22, 2022
@tustvold
Copy link
Contributor

tustvold commented Aug 22, 2022

Not a fully formed thought, but perhaps we could avoid the has_next check entirely by breaking if reading the next batch returns fewer than the requested number of records?

// Skip first page
RowSelector::skip(10),
// Multiple selects in same page
RowSelector::select(3),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run this test without this pr change, still passed🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this test is actually unrelated to the fix. I originally thought the bug was in the scan_ranges so I added this test. But it turned out to be unrelated and I figured I might as well leave another unit test in. Sorry for the confusion :)

@thinkharderdev
Copy link
Contributor Author

Was testing with this fix and ran into another bug along similar lines. Not sure what is causing it just yet but I think we need to take a more holistic look at this.

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
@thinkharderdev
Copy link
Contributor Author

Was testing with this fix and ran into another bug along similar lines. Not sure what is causing it just yet but I think we need to take a more holistic look at this.

Sorry guys, got a bit distracted with other things. I think the latest commit should fix the last bug I could find. I've run this against some fairly large datasets and not seen any more issues so I think it should be good to go.

@tustvold tustvold merged commit dc4ccf8 into apache:master Sep 5, 2022
@ursabot
Copy link

ursabot commented Sep 5, 2022

Benchmark runs are scheduled for baseline = 30ab9bb and contender = dc4ccf8. dc4ccf8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@tustvold
Copy link
Contributor

tustvold commented Sep 5, 2022

This appears to be failing on master - https://github.com/apache/arrow-rs/runs/8193393646?check_suite_focus=true could you possibly take a look?

@thinkharderdev
Copy link
Contributor Author

This appears to be failing on master - https://github.com/apache/arrow-rs/runs/8193393646?check_suite_focus=true could you possibly take a look?

Will do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic when first data page is skipped using ColumnChunkData::Sparse
4 participants