Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix null_count of columns returned by chunked_parquet_reader #13111

Merged
merged 10 commits into from
Apr 13, 2023

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Apr 11, 2023

Description

Chunked Parquet reader returns columns with incorrect null counts - the counts are cumulative sums that include all previous chunks.
Root cause is that nesting_decode_cache is not copied back to nesting_decode when gpuDecodePageData returns early, so previously computed null counts are only reset in the cache.
With this PR, we use RAII to make sure cached decode info is always copied back in gpuDecodePageData.
Also fixed column_buffer::empty_like to return zero null count and empty null mask.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Apr 11, 2023
@vuule vuule self-assigned this Apr 11, 2023
@vuule vuule changed the title Fix null_count of columns returned by chunked_parquet_reader Fix null_count of columns returned by chunked_parquet_reader Apr 11, 2023
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 11, 2023
@vuule vuule requested a review from nvdbaranec April 11, 2023 00:26
s->nesting_info[thread_depth].valid_count = 0;
s->nesting_info[thread_depth].value_count = 0;
s->nesting_info[thread_depth].null_count = 0;
s->page.nesting_decode[thread_depth].valid_count = 0;
Copy link
Contributor

@nvdbaranec nvdbaranec Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this fix . s->nesting_info points to either the values in the actual pages, or to the decode cache. Up on line 1026 this gets set. So with the original code, this loop will be clearing the correct one. In the case where we're using the cache, the value computed for null_count gets copied back to the data in the pages at the very end here:

if (s->nesting_info == s->nesting_decode_cache) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Then I suspect something is wrong with the back copy. I'll look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your help reworking the fix :)
Please take another look.

@vyasr
Copy link
Contributor

vyasr commented Apr 11, 2023

I can't speak to the logic of the change, but I can confirm that merging these changes makes tests pass on #13104.

@vuule vuule requested a review from nvdbaranec April 11, 2023 21:28
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just the one issue. I'm going to run this on the spark integration tests before giving the thumbs up.

cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
@vuule vuule marked this pull request as ready for review April 12, 2023 18:26
@vuule vuule requested a review from a team as a code owner April 12, 2023 18:26
@vuule vuule requested a review from PointKernel April 12, 2023 18:26
@nvdbaranec
Copy link
Contributor

Spark integration tests passed.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

if (!t) s->page = *p;
if (!t) {
s->page = *p;
s->nesting_info = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC this is the key change that stops it from looking in the decode cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I didn't update the description.
This null prevents null_count_back_copier from copying nesting_decode_cache to nesting_decode when the nesting info hasn't been set up because of an early exit from setupLocalPageInfo.

@vuule
Copy link
Contributor Author

vuule commented Apr 13, 2023

/merge

@rapids-bot rapids-bot bot merged commit 5764ba5 into rapidsai:branch-23.06 Apr 13, 2023
@vuule vuule deleted the bug-chunk_read_pq-null_count branch April 13, 2023 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants