Throw an exception if an unsupported page encoding is detected in Parquet reader #12754

etseidl · 2023-02-10T00:32:13Z

Description

If the Parquet reader comes across a page encoded with an unsupported encoding, the call to decode page data silently fails, leading to either an empty table or unrelated exceptions being thrown. This PR adds code to validate the page encodings after the page headers are decoded.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…cted

rapids-bot · 2023-02-10T00:32:18Z

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

etseidl · 2023-02-10T00:33:12Z

Not sure why clang-format changed the lines at the top of parquet_test.cpp

cpp/src/io/parquet/reader_impl_preprocess.cu

cpp/tests/io/parquet_test.cpp

mythrocks · 2023-02-10T20:29:17Z

cpp/src/io/parquet/parquet_common.hpp

@@ -86,11 +86,12 @@ enum class Encoding : uint8_t {
  GROUP_VAR_INT           = 1,  // Deprecated, never used
  PLAIN_DICTIONARY        = 2,
  RLE                     = 3,
-  BIT_PACKED              = 4,
+  BIT_PACKED              = 4,  // Deprecated


Question: Could you please clarify how this change relates to the change in the reader?

I'm just updating the enum to make it match the current parquet spec, which lists BIT_PACKED as a deprecated encoding. Can remove if you'd prefer.

Can you please add more comments about why/when/how etc. this enum is deprecated? As well as tracking issue (if any)?

@ttnghia I did some digging through git blame on the parquet-format README.md. I found the commit but couldn't find an associated JIRA issue. I added a little more context to the comment.

mythrocks

LGTM, mostly. A couple of questions. We should, however, switch out of using thrust::all_of().

…into feature/validate_encodings

mythrocks

LGTM!

mythrocks · 2023-02-13T18:06:23Z

I've added a couple of labels (Bug, Non-breaking) to get the CI going.

cpp/src/io/parquet/reader_impl_preprocess.cu

…into feature/validate_encodings

ttnghia · 2023-02-13T19:59:31Z

cpp/src/io/parquet/reader_impl_preprocess.cu

+
+  // validate page encodings (avoiding use of thrust::any_of() NVIDIA/thrust #1016)
+  auto const num_valid_pages = static_cast<size_t>(thrust::count_if(
+    rmm::exec_policy(stream), pages.d_begin(), pages.d_end(), [] __device__(auto const& page) {
+      return is_supported_encoding(page.encoding);
+    }));
+
+  CUDF_EXPECTS(num_valid_pages == pages.size(), "Unsupported page encoding detected");


Wait, I just see that we already call pages.device_to_host(stream, true); in the line right above. That means we can call is_supported_encoding on host code:

Suggested change

// validate page encodings (avoiding use of thrust::any_of() NVIDIA/thrust #1016)

auto const num_valid_pages = static_cast<size_t>(thrust::count_if(

rmm::exec_policy(stream), pages.d_begin(), pages.d_end(), [] __device__(auto const& page) {

return is_supported_encoding(page.encoding);

}));

CUDF_EXPECTS(num_valid_pages == pages.size(), "Unsupported page encoding detected");

// validate page encodings

auto const num_valid_pages = static_cast<size_t>(std::count_if(

pages.begin(), pages.end(), [] auto const& page) {

return is_supported_encoding(page.encoding);

}));

CUDF_EXPECTS(num_valid_pages == pages.size(), "Unsupported page encoding detected");

Note: The code above still needs to be reformatted.

Can it be any_of/all_of now?

I was wanting to use the thrust version to do the check in parallel since the number of pages can easily get into the thousands (or higher). Do you thinks that's not worth worrying about? Then, yes, I'd switch back to std::any_of/all_of.

The runtime of thousands of such trivial checks will be negligible. So you don't need to be worried about it.

Ok, it's back on the host. It's slower by 30us 🤣 (for a totally unscientific single pass of nsys profile).

+30us from 1us, or from 1ms? :D

30us to 60 us :D

etseidl · 2023-02-13T21:47:46Z

ok to test?

vuule · 2023-02-13T21:50:38Z

/ok to test

etseidl · 2023-02-22T17:39:37Z

looks like CI is fixed, can this be tested again please?

vuule · 2023-02-22T17:41:39Z

/ok to test

etseidl · 2023-02-28T23:24:24Z

Any objections to merging this?

vuule · 2023-03-01T22:04:23Z

/ok to test

vuule · 2023-03-04T01:03:01Z

/merge

etseidl added 3 commits February 9, 2023 14:07

test for unsupported page encodings and throw an error if one is dete…

ae74d50

…cted

add test

04fcd01

use sizeof

30c0ecb

etseidl requested a review from a team as a code owner February 10, 2023 00:32

etseidl requested review from harrism and mythrocks February 10, 2023 00:32

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 10, 2023

use correct version of clang-format

03a5960

mythrocks reviewed Feb 10, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

mythrocks reviewed Feb 10, 2023

View reviewed changes

cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved

mythrocks reviewed Feb 10, 2023

View reviewed changes

mythrocks requested changes Feb 10, 2023

View reviewed changes

etseidl and others added 4 commits February 10, 2023 12:56

replace thrust::any_of with thrust::count_if

f231b3a

Merge branch 'rapidsai:branch-23.04' into feature/validate_encodings

5f1ef57

move test to python

85c2bf6

Merge branch 'feature/validate_encodings' of github.com:etseidl/cudf …

21ed189

…into feature/validate_encodings

etseidl requested a review from a team as a code owner February 10, 2023 23:38

etseidl requested review from bdice and brandon-b-miller February 10, 2023 23:38

github-actions bot added the Python Affects Python cuDF API. label Feb 10, 2023

Merge branch 'rapidsai:branch-23.04' into feature/validate_encodings

0fa7f3f

galipremsagar approved these changes Feb 13, 2023

View reviewed changes

mythrocks approved these changes Feb 13, 2023

View reviewed changes

mythrocks assigned etseidl Feb 13, 2023

mythrocks added bug Something isn't working non-breaking Non-breaking change labels Feb 13, 2023

ttnghia reviewed Feb 13, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Show resolved Hide resolved

etseidl and others added 6 commits February 13, 2023 10:31

update comment about deprecation of BIT_PACKED

b5573eb

Merge branch 'feature/validate_encodings' of github.com:etseidl/cudf …

f1af811

…into feature/validate_encodings

add comment per review suggestion

febeb02

Merge branch 'rapidsai:branch-23.04' into feature/validate_encodings

0015e08

add comment to update is_supported_encoding

aec40aa

Merge branch 'feature/validate_encodings' of github.com:etseidl/cudf …

e6fef99

…into feature/validate_encodings

ttnghia reviewed Feb 13, 2023

View reviewed changes

switch check back to host code

432c9df

ttnghia approved these changes Feb 13, 2023

View reviewed changes

Merge branch 'branch-23.04' into feature/validate_encodings

03ee353

vuule approved these changes Feb 14, 2023

View reviewed changes

etseidl added 5 commits February 15, 2023 09:55

Merge branch 'rapidsai:branch-23.04' into feature/validate_encodings

abf18aa

Merge branch 'rapidsai:branch-23.04' into feature/validate_encodings

431f4c6

Merge branch 'branch-23.04' into feature/validate_encodings

2ca3746

Merge branch 'rapidsai:branch-23.04' into feature/validate_encodings

872103a

Merge branch 'rapidsai:branch-23.04' into feature/validate_encodings

8f287fc

Merge branch 'rapidsai:branch-23.04' into feature/validate_encodings

9a54c69

vuule and others added 3 commits February 28, 2023 15:28

Merge branch 'branch-23.04' into feature/validate_encodings

7bb6898

Merge branch 'branch-23.04' into feature/validate_encodings

4005eef

Merge branch 'rapidsai:branch-23.04' into feature/validate_encodings

4a45626

rapids-bot bot merged commit 2689bb6 into rapidsai:branch-23.04 Mar 4, 2023

etseidl deleted the feature/validate_encodings branch March 6, 2023 18:36

etseidl mentioned this pull request Nov 15, 2023

[FEA] Improve exception message when unknown Parquet page encoding detected #14209

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throw an exception if an unsupported page encoding is detected in Parquet reader #12754

Throw an exception if an unsupported page encoding is detected in Parquet reader #12754

etseidl commented Feb 10, 2023 •

edited

Loading

rapids-bot bot commented Feb 10, 2023

etseidl commented Feb 10, 2023

mythrocks Feb 10, 2023

etseidl Feb 10, 2023

ttnghia Feb 13, 2023

etseidl Feb 13, 2023

mythrocks left a comment

mythrocks left a comment

mythrocks commented Feb 13, 2023

ttnghia Feb 13, 2023 •

edited

Loading

vuule Feb 13, 2023

etseidl Feb 13, 2023

ttnghia Feb 13, 2023

etseidl Feb 13, 2023

vuule Feb 13, 2023

etseidl Feb 13, 2023

etseidl commented Feb 13, 2023

vuule commented Feb 13, 2023

etseidl commented Feb 22, 2023

vuule commented Feb 22, 2023

etseidl commented Feb 28, 2023

vuule commented Mar 1, 2023

vuule commented Mar 4, 2023

Throw an exception if an unsupported page encoding is detected in Parquet reader #12754

Throw an exception if an unsupported page encoding is detected in Parquet reader #12754

Conversation

etseidl commented Feb 10, 2023 • edited Loading

Description

Checklist

rapids-bot bot commented Feb 10, 2023

etseidl commented Feb 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

mythrocks commented Feb 13, 2023

ttnghia Feb 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl commented Feb 13, 2023

vuule commented Feb 13, 2023

etseidl commented Feb 22, 2023

vuule commented Feb 22, 2023

etseidl commented Feb 28, 2023

vuule commented Mar 1, 2023

vuule commented Mar 4, 2023

etseidl commented Feb 10, 2023 •

edited

Loading

ttnghia Feb 13, 2023 •

edited

Loading