Fix an issue with one_level_list schemas in parquet reader. #10750

nvdbaranec · 2022-04-27T21:07:02Z

Partially addresses: #10733

For a particular way of encoding list schemas (an old way that Spark seems to use sometimes), the parquet reader was accidentally propagating incorrect nesting information between columns. Just a simple bug of not popping an extra value off a stack.

Note: this is simply a fix so that the files read correctly, however the internal data in the file is actually of binary type and cudf converts these to string columns. This PR does not add support for binary as a real type in cudf.

…nformation to propagate between columns, causing crashes.

PointKernel

LGTM. Thanks!

mythrocks

Thanks for explaining it to me, @nvdbaranec.

mythrocks · 2022-04-27T23:34:50Z

I realize there isn't a good way to test the readers without checking in the parquet file itself. :/

nvdbaranec · 2022-04-28T18:16:52Z

rerun tests

codecov · 2022-04-28T19:29:00Z

Codecov Report

Merging #10750 (005949b) into branch-22.06 (d6e3068) will increase coverage by 0.06%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10750      +/-   ##
================================================
+ Coverage         86.36%   86.43%   +0.06%     
================================================
  Files               142      143       +1     
  Lines             22302    22444     +142     
================================================
+ Hits              19261    19399     +138     
- Misses             3041     3045       +4

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/io/parquet.py	`92.39% <0.00%> (-1.40%)`	⬇️
python/cudf/cudf/api/types.py	`89.36% <0.00%> (-0.44%)`	⬇️
python/cudf/cudf/core/dataframe.py	`93.74% <0.00%> (-0.01%)`	⬇️
python/cudf/cudf/core/frame.py	`93.41% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`92.31% <0.00%> (ø)`
python/cudf/cudf/core/dtypes.py	`97.30% <0.00%> (ø)`
python/cudf/cudf/_lib/__init__.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/testing/dataset_generator.py	`73.25% <0.00%> (ø)`
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/core/_internals/expressions.py	`92.85% <0.00%> (ø)`
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75f3873...005949b. Read the comment docs.

nvdbaranec · 2022-04-29T18:04:15Z

@gpucibot merge

Fix an issue with one_level_list schemas which were causing nesting i…

005949b

…nformation to propagate between columns, causing crashes.

nvdbaranec added bug Something isn't working 4 - Needs Review Waiting for reviewer to review or respond non-breaking Non-breaking change labels Apr 27, 2022

nvdbaranec requested review from a team as code owners April 27, 2022 21:07

nvdbaranec requested review from shwina, rgsl888prabhu and karthikeyann April 27, 2022 21:07

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Apr 27, 2022

nvdbaranec requested a review from PointKernel April 27, 2022 21:09

PointKernel approved these changes Apr 27, 2022

View reviewed changes

mythrocks approved these changes Apr 27, 2022

View reviewed changes

galipremsagar approved these changes Apr 28, 2022

View reviewed changes

rapids-bot bot merged commit 9b8d26f into rapidsai:branch-22.06 Apr 29, 2022

nvdbaranec mentioned this pull request May 23, 2022

[FEA] Parquet support for reading binary and repeated binary #10733

Closed

NVnavkumar mentioned this pull request Jun 14, 2022

Enable the spark.sql.parquet.binaryAsString=true configuration option on the GPU NVIDIA/spark-rapids#5830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix an issue with one_level_list schemas in parquet reader. #10750

Fix an issue with one_level_list schemas in parquet reader. #10750

nvdbaranec commented Apr 27, 2022

PointKernel left a comment

mythrocks left a comment

mythrocks commented Apr 27, 2022

nvdbaranec commented Apr 28, 2022

codecov bot commented Apr 28, 2022 •

edited

Loading

nvdbaranec commented Apr 29, 2022

Fix an issue with one_level_list schemas in parquet reader. #10750

Fix an issue with one_level_list schemas in parquet reader. #10750

Conversation

nvdbaranec commented Apr 27, 2022

PointKernel left a comment

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

mythrocks commented Apr 27, 2022

nvdbaranec commented Apr 28, 2022

codecov bot commented Apr 28, 2022 • edited Loading

Codecov Report

nvdbaranec commented Apr 29, 2022

codecov bot commented Apr 28, 2022 •

edited

Loading