-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix null_count
of columns returned by chunked_parquet_reader
#13111
Fix null_count
of columns returned by chunked_parquet_reader
#13111
Conversation
chunked_parquet_reader
null_count
of columns returned by chunked_parquet_reader
cpp/src/io/parquet/page_data.cu
Outdated
s->nesting_info[thread_depth].valid_count = 0; | ||
s->nesting_info[thread_depth].value_count = 0; | ||
s->nesting_info[thread_depth].null_count = 0; | ||
s->page.nesting_decode[thread_depth].valid_count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this fix . s->nesting_info
points to either the values in the actual pages, or to the decode cache. Up on line 1026 this gets set. So with the original code, this loop will be clearing the correct one. In the case where we're using the cache, the value computed for null_count
gets copied back to the data in the pages at the very end here:
cudf/cpp/src/io/parquet/page_data.cu
Line 2070 in 5638d44
if (s->nesting_info == s->nesting_decode_cache) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Then I suspect something is wrong with the back copy. I'll look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your help reworking the fix :)
Please take another look.
I can't speak to the logic of the change, but I can confirm that merging these changes makes tests pass on #13104. |
…bug-chunk_read_pq-null_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just the one issue. I'm going to run this on the spark integration tests before giving the thumbs up.
Spark integration tests passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
if (!t) s->page = *p; | ||
if (!t) { | ||
s->page = *p; | ||
s->nesting_info = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC this is the key change that stops it from looking in the decode cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I didn't update the description.
This null prevents null_count_back_copier
from copying nesting_decode_cache
to nesting_decode
when the nesting info hasn't been set up because of an early exit from setupLocalPageInfo
.
/merge |
Description
Chunked Parquet reader returns columns with incorrect null counts - the counts are cumulative sums that include all previous chunks.
Root cause is that
nesting_decode_cache
is not copied back tonesting_decode
whengpuDecodePageData
returns early, so previously computed null counts are only reset in the cache.With this PR, we use RAII to make sure cached decode info is always copied back in
gpuDecodePageData
.Also fixed
column_buffer::empty_like
to return zero null count and empty null mask.Checklist