Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues with parquet chunked reader #12488

Merged
merged 5 commits into from
Jan 23, 2023

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Jan 6, 2023

Fixes two issues:

  • A very old issue where we were incorrectly setting the output nesting depth in the PageNestingInfo struct in some cases. Previously, this was benign, but the chunked reader broke because of it. I fixed this and did some variable renaming to make things more clear.

  • Fixed an issue with an optimization that was added for the chunked reader: We were incorrectly determining when we could early out for a given page during the preprocessing step.

There is an issue as far as generating a test for this PR. The conditions under which the second bug occurs require that values from a given row span multiple pages. I couldn't get the cudf writer to make this happen. The alternative is to use a pre-created file and use the python tests, but the issue there is that we don't expose skip_rows/num_rows through that API. So I have no good way of building a test.

This fix has been vetted against a known real-world failure case.

Fixes #12376

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

- The num_nesting_levels field in the gpu::PageNestingInfo struct wasn't using the right value in all cases. This was
benign previously, but the chunked reader fell victim to it.
- Fixed an issue with an optimization included in the chunked reader : we weren't properly determining which pages
could be ignored during the preprocess step in some cases.
@nvdbaranec nvdbaranec added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Jan 6, 2023
@nvdbaranec nvdbaranec requested a review from a team as a code owner January 6, 2023 17:51
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 6, 2023
@codecov
Copy link

codecov bot commented Jan 6, 2023

Codecov Report

Base: 86.58% // Head: 85.70% // Decreases project coverage by -0.88% ⚠️

Coverage data is based on head (8df80e7) compared to base (b6dccb3).
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-23.02   #12488      +/-   ##
================================================
- Coverage         86.58%   85.70%   -0.88%     
================================================
  Files               155      155              
  Lines             24368    24865     +497     
================================================
+ Hits              21098    21311     +213     
- Misses             3270     3554     +284     
Impacted Files Coverage Δ
python/cudf/cudf/_version.py 1.41% <0.00%> (-98.59%) ⬇️
python/cudf/cudf/core/buffer/spill_manager.py 72.50% <0.00%> (-7.50%) ⬇️
python/cudf/cudf/core/buffer/spillable_buffer.py 91.07% <0.00%> (-1.78%) ⬇️
python/cudf/cudf/utils/dtypes.py 77.85% <0.00%> (-1.61%) ⬇️
python/cudf/cudf/options.py 86.11% <0.00%> (-1.59%) ⬇️
python/cudf/cudf/core/single_column_frame.py 94.30% <0.00%> (-1.27%) ⬇️
...ython/custreamz/custreamz/tests/test_dataframes.py 98.38% <0.00%> (-1.01%) ⬇️
python/dask_cudf/dask_cudf/io/csv.py 96.34% <0.00%> (-1.00%) ⬇️
python/dask_cudf/dask_cudf/io/parquet.py 91.81% <0.00%> (-0.59%) ⬇️
python/cudf/cudf/core/multiindex.py 91.66% <0.00%> (-0.51%) ⬇️
... and 44 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ttnghia
Copy link
Contributor

ttnghia commented Jan 9, 2023

Thanks for working on this. In addition to the code, do we have any unit test to catch the bug?

cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_data.cu Show resolved Hide resolved
cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
@nvdbaranec
Copy link
Contributor Author

At the end of the day, it hasn't been possible to get cudf or Spark to generate a useful test file here. Should I just check the "New or existing tests cover this change" box anyway? I'm not sure what else to do.

@vuule
Copy link
Contributor

vuule commented Jan 21, 2023

At the end of the day, it hasn't been possible to get cudf or Spark to generate a useful test file here. Should I just check the "New or existing tests cover this change" box anyway? I'm not sure what else to do.

I think it should be left unchecked. The reason for the missing test is documented in the description.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
As expected, this one's really in the weeds. It would be best if the other C++ review comes from @ttnghia or @PointKernel.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nvdbaranec
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 2bfe9e3 into rapidsai:branch-23.02 Jan 23, 2023
rapids-bot bot pushed a commit that referenced this pull request Feb 8, 2023
Fixes an issue with a particular arrangement of page data related to lists.  Specifically, it is possible for page `N` to contain "0" rows because the values for the row it is a part of start on page `N-1` and end on page `N+1`.  This was defeating logic in the decode kernel that would erroneously cause these values to be skipped.

Similar to #12488 this is only reproducible with data out in the wild.  In this case, we have a file that we could in theory check in to create a test with, but it is 16 MB so it's fairly large.  Looking for feedback on whether this is too big.

Authors:
  - https://github.com/nvdbaranec
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #12698
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] ParquetChunkedReaderTest.TestChunkedReadSimpleData device-side assert with libcudf compiled with Debug
4 participants