Verify compression type in Parquet reader #10610

vuule · 2022-04-06T19:49:41Z

This PR adds a compression type check for each chunk in the input file.
Reader throws in an unsupported compression is used.

codecov · 2022-04-06T20:58:46Z

Codecov Report

Merging #10610 (f1dcfd8) into branch-22.06 (956c7b5) will increase coverage by 0.03%.
The diff coverage is 88.97%.

❗ Current head f1dcfd8 differs from pull request most recent head 82b437d. Consider uploading reports for the commit 82b437d to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10610      +/-   ##
================================================
+ Coverage         86.30%   86.34%   +0.03%     
================================================
  Files               140      140              
  Lines             22255    22280      +25     
================================================
+ Hits              19207    19237      +30     
+ Misses             3048     3043       -5

Impacted Files	Coverage Δ
python/cudf/cudf/core/frame.py	`94.75% <ø> (+1.02%)`	⬆️
python/dask_cudf/dask_cudf/tests/test_accessor.py	`98.41% <ø> (ø)`
python/cudf/cudf/core/indexed_frame.py	`91.77% <87.93%> (-0.87%)`	⬇️
python/cudf/cudf/core/column/lists.py	`90.62% <100.00%> (+0.57%)`	⬆️
python/cudf/cudf/core/dataframe.py	`93.59% <100.00%> (ø)`
python/cudf/cudf/core/series.py	`95.28% <100.00%> (-0.01%)`	⬇️
python/cudf/cudf/core/column/column.py	`89.45% <0.00%> (+0.10%)`	⬆️
python/cudf/cudf/core/column/string.py	`89.10% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.72% <0.00%> (+0.22%)`	⬆️
python/cudf/cudf/core/tools/datetimes.py	`84.49% <0.00%> (+0.30%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 956c7b5...82b437d. Read the comment docs.

brandon-b-miller

Fairly minor comment, usually we don't propagate libcudf errors directly to the user from python. Is there a way the reader code can query the codec before it hits libcudf and error there?

bdice · 2022-04-07T18:02:55Z

Fairly minor comment, usually we don't propagate libcudf errors directly to the user from python. Is there a way the reader code can query the codec before it hits libcudf and error there?

The current design looks fine to me. We pass quite a few low-level errors from libcudf, to my knowledge (perhaps especially in I/O code?). I would avoid re-implementing this codec check in Python if we can let it fail in C++ and bubble through Cython's exception handling. The exception would be cases that need to be pre-emptively stopped at a higher layer (Python), but that doesn't seem to apply here.

vuule · 2022-04-07T19:16:48Z

Fairly minor comment, usually we don't propagate libcudf errors directly to the user from python. Is there a way the reader code can query the codec before it hits libcudf and error there?

The current design looks fine to me. We pass quite a few low-level errors from libcudf, to my knowledge (perhaps especially in I/O code?). I would avoid re-implementing this codec check in Python if we can let it fail in C++ and bubble through Cython's exception handling. The exception would be cases that need to be pre-emptively stopped at a higher layer (Python), but that doesn't seem to apply here.

Thank you for the comments!

Unfortunately, there's no cheap way to catch this error in Python. I don't think we can avoid propagating the C++ exception here.

vuule · 2022-04-07T19:17:07Z

@gpucibot merge

vuule added 2 commits April 6, 2022 12:46

add check

23fb427

add test

d136afb

vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Apr 6, 2022

vuule self-assigned this Apr 6, 2022

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Apr 6, 2022

vuule added 2 commits April 6, 2022 16:22

rename test file

031e9da

clean up

82b437d

vuule marked this pull request as ready for review April 7, 2022 07:57

vuule requested review from a team as code owners April 7, 2022 07:57

vuule requested review from bdice, brandon-b-miller and mythrocks April 7, 2022 07:57

brandon-b-miller approved these changes Apr 7, 2022

View reviewed changes

PointKernel approved these changes Apr 7, 2022

View reviewed changes

bdice approved these changes Apr 7, 2022

View reviewed changes

rapids-bot bot merged commit 018924f into rapidsai:branch-22.06 Apr 7, 2022

vuule deleted the bug-pq-check-comp-type branch April 7, 2022 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify compression type in Parquet reader #10610

Verify compression type in Parquet reader #10610

vuule commented Apr 6, 2022

codecov bot commented Apr 6, 2022 •

edited

Loading

brandon-b-miller left a comment

bdice commented Apr 7, 2022 •

edited

Loading

vuule commented Apr 7, 2022

vuule commented Apr 7, 2022

Verify compression type in Parquet reader #10610

Verify compression type in Parquet reader #10610

Conversation

vuule commented Apr 6, 2022

codecov bot commented Apr 6, 2022 • edited Loading

Codecov Report

brandon-b-miller left a comment

Choose a reason for hiding this comment

bdice commented Apr 7, 2022 • edited Loading

vuule commented Apr 7, 2022

vuule commented Apr 7, 2022

codecov bot commented Apr 6, 2022 •

edited

Loading

bdice commented Apr 7, 2022 •

edited

Loading