-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in page skipping #2552
Fix bug in page skipping #2552
Conversation
arrow/src/json/reader.rs
Outdated
@@ -2623,7 +2623,7 @@ mod tests { | |||
let re = builder.build(Cursor::new(json_content)); | |||
assert_eq!( | |||
re.err().unwrap().to_string(), | |||
r#"Json error: Expected JSON record to be an object, found Array [Number(1), String("hello")]"#, | |||
r#"Json error: Expected JSON record to be an object, found Array([Number(1), String("hello")])"#, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this is about but this test was failing for me so I fixed it
// Check to see if the column is exhausted. Only peek the next page since in | ||
// case we are reading to a page boundary and do not actually need to read | ||
// the next page. | ||
let end_of_column = !self.column_reader.as_mut().unwrap().peek_next()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should ultimately get cleaned up a bit. It is confusing since we need to peek_next
but also call has_next
below (since the next page needs to get loaded). It feels like the page-level logic wants to be encapsulated inside GenericColumnReader
but it inevitable leaks out in places like this.
Not a fully formed thought, but perhaps we could avoid the has_next check entirely by breaking if reading the next batch returns fewer than the requested number of records? |
// Skip first page | ||
RowSelector::skip(10), | ||
// Multiple selects in same page | ||
RowSelector::select(3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run this test without this pr change, still passed🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this test is actually unrelated to the fix. I originally thought the bug was in the scan_ranges
so I added this test. But it turned out to be unrelated and I figured I might as well leave another unit test in. Sorry for the confusion :)
Was testing with this fix and ran into another bug along similar lines. Not sure what is causing it just yet but I think we need to take a more holistic look at this. |
d58dfa3
to
b2d6da2
Compare
Sorry guys, got a bit distracted with other things. I think the latest commit should fix the last bug I could find. I've run this against some fairly large datasets and not seen any more issues so I think it should be good to go. |
Benchmark runs are scheduled for baseline = 30ab9bb and contender = dc4ccf8. dc4ccf8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This appears to be failing on master - https://github.com/apache/arrow-rs/runs/8193393646?check_suite_focus=true could you possibly take a look? |
Will do |
Which issue does this PR close?
Closes #2543
Rationale for this change
Discovered some small bugs while testing the page pruning logic in
ParquetRecordBatchStream
. We were improperly reading data pages when not necessary in a few cases. When we have only fetched the pages required for a given row selection this causes an error.What changes are included in this PR?
We had a method
GenericColumnReader::has_next
which will check if the current page is fully decoded and then load the next page (if it exists). This method was used in a few places to test whether the column was exhausted. This can cause issues since it will load the next page, which may be skipped according to the current row selection.Added a method
GenericColumnReader::has_next
which will test whether the column is exhausted without loading the next page and changed a few places to use that.I think a slightly more thorough refactor is not a bad idea since the API is a bit confusing as is, but this should solve the immediate bugs.
Are there any user-facing changes?
No