Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify compression type in Parquet reader #10610

Merged
merged 4 commits into from
Apr 7, 2022

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Apr 6, 2022

Closes #10602

This PR adds a compression type check for each chunk in the input file.
Reader throws in an unsupported compression is used.

@vuule vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Apr 6, 2022
@vuule vuule self-assigned this Apr 6, 2022
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Apr 6, 2022
@codecov
Copy link

codecov bot commented Apr 6, 2022

Codecov Report

Merging #10610 (f1dcfd8) into branch-22.06 (956c7b5) will increase coverage by 0.03%.
The diff coverage is 88.97%.

❗ Current head f1dcfd8 differs from pull request most recent head 82b437d. Consider uploading reports for the commit 82b437d to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10610      +/-   ##
================================================
+ Coverage         86.30%   86.34%   +0.03%     
================================================
  Files               140      140              
  Lines             22255    22280      +25     
================================================
+ Hits              19207    19237      +30     
+ Misses             3048     3043       -5     
Impacted Files Coverage Δ
python/cudf/cudf/core/frame.py 94.75% <ø> (+1.02%) ⬆️
python/dask_cudf/dask_cudf/tests/test_accessor.py 98.41% <ø> (ø)
python/cudf/cudf/core/indexed_frame.py 91.77% <87.93%> (-0.87%) ⬇️
python/cudf/cudf/core/column/lists.py 90.62% <100.00%> (+0.57%) ⬆️
python/cudf/cudf/core/dataframe.py 93.59% <100.00%> (ø)
python/cudf/cudf/core/series.py 95.28% <100.00%> (-0.01%) ⬇️
python/cudf/cudf/core/column/column.py 89.45% <0.00%> (+0.10%) ⬆️
python/cudf/cudf/core/column/string.py 89.10% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.72% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 956c7b5...82b437d. Read the comment docs.

@vuule vuule marked this pull request as ready for review April 7, 2022 07:57
@vuule vuule requested review from a team as code owners April 7, 2022 07:57
Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fairly minor comment, usually we don't propagate libcudf errors directly to the user from python. Is there a way the reader code can query the codec before it hits libcudf and error there?

@bdice
Copy link
Contributor

bdice commented Apr 7, 2022

Fairly minor comment, usually we don't propagate libcudf errors directly to the user from python. Is there a way the reader code can query the codec before it hits libcudf and error there?

The current design looks fine to me. We pass quite a few low-level errors from libcudf, to my knowledge (perhaps especially in I/O code?). I would avoid re-implementing this codec check in Python if we can let it fail in C++ and bubble through Cython's exception handling. The exception would be cases that need to be pre-emptively stopped at a higher layer (Python), but that doesn't seem to apply here.

@vuule
Copy link
Contributor Author

vuule commented Apr 7, 2022

Fairly minor comment, usually we don't propagate libcudf errors directly to the user from python. Is there a way the reader code can query the codec before it hits libcudf and error there?

The current design looks fine to me. We pass quite a few low-level errors from libcudf, to my knowledge (perhaps especially in I/O code?). I would avoid re-implementing this codec check in Python if we can let it fail in C++ and bubble through Cython's exception handling. The exception would be cases that need to be pre-emptively stopped at a higher layer (Python), but that doesn't seem to apply here.

Thank you for the comments!

Unfortunately, there's no cheap way to catch this error in Python. I don't think we can avoid propagating the C++ exception here.

@vuule
Copy link
Contributor Author

vuule commented Apr 7, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 018924f into rapidsai:branch-22.06 Apr 7, 2022
@vuule vuule deleted the bug-pq-check-comp-type branch April 7, 2022 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] cudf should throw exception when reading parquet file with unsupported codec
4 participants