Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing #11752

Merged

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Sep 23, 2022

Fixes an issue where using user bounds with parquet files containing both nested and non-nested types could result in incorrect row counts for the non-nested columns. Originally reported by @etseidl

The nature of the fix also implements a longstanding desired optimization: when running the preprocess step for nested types, ignore pages for non-nested hierarchies. This can result in significant speedups for files containing only a few nested columns.

The tests added for this PR seem to tease a bug in the parquet writer into happening (#11748) so I will leave this as a draft until that issue is resolved.

…taining a mix of nested and non-nested types would

result in incorrect row counts for the non-nested types. Also optimizes the preprocess path so that non-nested types
do not end up getting visited by the kernel.
@nvdbaranec nvdbaranec added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Sep 23, 2022
@nvdbaranec nvdbaranec requested a review from a team as a code owner September 23, 2022 16:06
@nvdbaranec nvdbaranec marked this pull request as draft September 23, 2022 16:07
@nvdbaranec nvdbaranec added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 23, 2022
…ists. Fixed an additional issue in the decoding where flat column types underneath

structs could end up ignoring skip_rows/num_rows.
@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. Python Affects Python cuDF API. labels Sep 27, 2022
@nvdbaranec nvdbaranec marked this pull request as ready for review September 27, 2022 22:52
@nvdbaranec nvdbaranec requested review from a team as code owners September 27, 2022 22:52
@nvdbaranec nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 27, 2022
@ttnghia ttnghia changed the base branch from branch-22.10 to branch-22.12 September 27, 2022 23:13
@ttnghia ttnghia changed the base branch from branch-22.12 to branch-22.10 September 27, 2022 23:13
@ttnghia
Copy link
Contributor

ttnghia commented Sep 27, 2022

Wrongly switched branch. You may need to reverse changes from branch 22.12 and up merge with 22.10 instead.

@nvdbaranec nvdbaranec changed the base branch from branch-22.10 to branch-22.12 September 28, 2022 14:55
@nvdbaranec nvdbaranec removed request for a team, mroeschke and charlesbluca September 28, 2022 14:57
@nvdbaranec nvdbaranec removed gpuCI Python Affects Python cuDF API. CMake CMake build issue Java Affects Java cuDF API. labels Sep 28, 2022
@nvdbaranec
Copy link
Contributor Author

rerun tests

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this is more understandable now. Thanks!

Comment on lines +1864 to +1865
// it would be better/safer to be checking (schema.max_repetition_level > 0) here, but there's
// no easy way to get at that info here. we'd have to move this function into reader_impl.cu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an issue for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right. This ended up being a bug :) If you have a struct at the top of a nested hierarchy, this logic fails. The max_repetition_level check is the correct one.

cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 5, 2022

Codecov Report

Base: 87.40% // Head: 87.50% // Increases project coverage by +0.09% 🎉

Coverage data is based on head (d2e409a) compared to base (f72c4ce).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11752      +/-   ##
================================================
+ Coverage         87.40%   87.50%   +0.09%     
================================================
  Files               133      133              
  Lines             21833    21826       -7     
================================================
+ Hits              19084    19099      +15     
+ Misses             2749     2727      -22     
Impacted Files Coverage Δ
python/cudf/cudf/core/udf/__init__.py 50.00% <ø> (ø)
python/cudf/cudf/utils/ioutils.py 79.47% <ø> (ø)
...thon/dask_cudf/dask_cudf/tests/test_distributed.py 18.86% <ø> (+4.94%) ⬆️
python/cudf/cudf/__init__.py 90.69% <100.00%> (ø)
python/cudf/cudf/core/scalar.py 90.52% <100.00%> (+1.25%) ⬆️
python/dask_cudf/dask_cudf/sorting.py 93.29% <100.00%> (+2.11%) ⬆️
python/strings_udf/strings_udf/__init__.py 86.27% <100.00%> (-10.61%) ⬇️
python/strings_udf/strings_udf/_typing.py 94.73% <0.00%> (-1.06%) ⬇️
python/cudf/cudf/testing/dataset_generator.py 72.83% <0.00%> (-0.42%) ⬇️
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@nvdbaranec
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4525474 into rapidsai:branch-22.12 Oct 5, 2022
rapids-bot bot pushed a commit that referenced this pull request Oct 13, 2022
Fixes NVIDIA/spark-rapids#6718

There was a bug introduced recently #11752 where an insufficient check for whether an input column contained repetition information could cause incorrect results for column hierarchies with structs at the root.

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Jim Brennan (https://github.com/jbrennan333)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #11910
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants