Fix issues with parquet chunked reader #12488

nvdbaranec · 2023-01-06T17:51:57Z

Fixes two issues:

A very old issue where we were incorrectly setting the output nesting depth in the PageNestingInfo struct in some cases. Previously, this was benign, but the chunked reader broke because of it. I fixed this and did some variable renaming to make things more clear.
Fixed an issue with an optimization that was added for the chunked reader: We were incorrectly determining when we could early out for a given page during the preprocessing step.

There is an issue as far as generating a test for this PR. The conditions under which the second bug occurs require that values from a given row span multiple pages. I couldn't get the cudf writer to make this happen. The alternative is to use a pre-created file and use the python tests, but the issue there is that we don't expose skip_rows/num_rows through that API. So I have no good way of building a test.

This fix has been vetted against a known real-world failure case.

Fixes #12376

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

- The num_nesting_levels field in the gpu::PageNestingInfo struct wasn't using the right value in all cases. This was benign previously, but the chunked reader fell victim to it. - Fixed an issue with an optimization included in the chunked reader : we weren't properly determining which pages could be ignored during the preprocess step in some cases.

cpp/src/io/parquet/page_data.cu

codecov · 2023-01-06T19:16:28Z

Codecov Report

Base: 86.58% // Head: 85.70% // Decreases project coverage by -0.88% ⚠️

Coverage data is based on head (8df80e7) compared to base (b6dccb3).
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-23.02   #12488      +/-   ##
================================================
- Coverage         86.58%   85.70%   -0.88%     
================================================
  Files               155      155              
  Lines             24368    24865     +497     
================================================
+ Hits              21098    21311     +213     
- Misses             3270     3554     +284

Impacted Files	Coverage Δ
python/cudf/cudf/_version.py	`1.41% <0.00%> (-98.59%)`	⬇️
python/cudf/cudf/core/buffer/spill_manager.py	`72.50% <0.00%> (-7.50%)`	⬇️
python/cudf/cudf/core/buffer/spillable_buffer.py	`91.07% <0.00%> (-1.78%)`	⬇️
python/cudf/cudf/utils/dtypes.py	`77.85% <0.00%> (-1.61%)`	⬇️
python/cudf/cudf/options.py	`86.11% <0.00%> (-1.59%)`	⬇️
python/cudf/cudf/core/single_column_frame.py	`94.30% <0.00%> (-1.27%)`	⬇️
...ython/custreamz/custreamz/tests/test_dataframes.py	`98.38% <0.00%> (-1.01%)`	⬇️
python/dask_cudf/dask_cudf/io/csv.py	`96.34% <0.00%> (-1.00%)`	⬇️
python/dask_cudf/dask_cudf/io/parquet.py	`91.81% <0.00%> (-0.59%)`	⬇️
python/cudf/cudf/core/multiindex.py	`91.66% <0.00%> (-0.51%)`	⬇️
... and 44 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

ttnghia · 2023-01-09T23:07:56Z

Thanks for working on this. In addition to the code, do we have any unit test to catch the bug?

cpp/src/io/parquet/parquet_gpu.hpp

cpp/src/io/parquet/page_data.cu

nvdbaranec · 2023-01-21T00:20:38Z

At the end of the day, it hasn't been possible to get cudf or Spark to generate a useful test file here. Should I just check the "New or existing tests cover this change" box anyway? I'm not sure what else to do.

vuule · 2023-01-21T00:29:02Z

At the end of the day, it hasn't been possible to get cudf or Spark to generate a useful test file here. Should I just check the "New or existing tests cover this change" box anyway? I'm not sure what else to do.

I think it should be left unchecked. The reason for the missing test is documented in the description.

vuule

LGTM
As expected, this one's really in the weeds. It would be best if the other C++ review comes from @ttnghia or @PointKernel.

PointKernel

LGTM

nvdbaranec · 2023-01-23T16:22:07Z

/merge

Fixes an issue with a particular arrangement of page data related to lists. Specifically, it is possible for page `N` to contain "0" rows because the values for the row it is a part of start on page `N-1` and end on page `N+1`. This was defeating logic in the decode kernel that would erroneously cause these values to be skipped. Similar to #12488 this is only reproducible with data out in the wild. In this case, we have a file that we could in theory check in to create a test with, but it is 16 MB so it's fairly large. Looking for feedback on whether this is too big. Authors: - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) URL: #12698

nvdbaranec added 2 commits January 4, 2023 18:00

Merge branch 'branch-23.02' into nested_page_bug_fix

0c3c1a3

nvdbaranec added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Jan 6, 2023

nvdbaranec requested a review from a team as a code owner January 6, 2023 17:51

nvdbaranec requested review from elstehle and mythrocks January 6, 2023 17:51

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 6, 2023

Merge branch 'branch-23.02' into nested_page_bug_fix

62d9483

ttnghia reviewed Jan 6, 2023

View reviewed changes

cpp/src/io/parquet/page_data.cu Show resolved Hide resolved

nvdbaranec mentioned this pull request Jan 6, 2023

[BUG] The parquet chunked reader can fail for certain list cases. NVIDIA/spark-rapids#7472

Closed

vuule reviewed Jan 9, 2023

View reviewed changes

cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved

cpp/src/io/parquet/page_data.cu Show resolved Hide resolved

cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved

Changes from PR comments.

10c90b1

GregoryKimball assigned nvdbaranec Jan 17, 2023

nvdbaranec requested review from ttnghia and vuule January 20, 2023 16:24

vuule requested a review from PointKernel January 20, 2023 22:52

Merge branch 'branch-23.02' into nested_page_bug_fix

8df80e7

vuule approved these changes Jan 21, 2023

View reviewed changes

PointKernel approved these changes Jan 23, 2023

View reviewed changes

rapids-bot bot merged commit 2bfe9e3 into rapidsai:branch-23.02 Jan 23, 2023

nvdbaranec mentioned this pull request Feb 3, 2023

Handle parquet list data corner case #12698

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues with parquet chunked reader #12488

Fix issues with parquet chunked reader #12488

nvdbaranec commented Jan 6, 2023 •

edited by sameerz

Loading

codecov bot commented Jan 6, 2023 •

edited

Loading

ttnghia commented Jan 9, 2023

nvdbaranec commented Jan 21, 2023

vuule commented Jan 21, 2023

vuule left a comment

PointKernel left a comment

nvdbaranec commented Jan 23, 2023

Fix issues with parquet chunked reader #12488

Fix issues with parquet chunked reader #12488

Conversation

nvdbaranec commented Jan 6, 2023 • edited by sameerz Loading

Checklist

codecov bot commented Jan 6, 2023 • edited Loading

Codecov Report

ttnghia commented Jan 9, 2023

nvdbaranec commented Jan 21, 2023

vuule commented Jan 21, 2023

vuule left a comment

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

nvdbaranec commented Jan 23, 2023

nvdbaranec commented Jan 6, 2023 •

edited by sameerz

Loading

codecov bot commented Jan 6, 2023 •

edited

Loading