Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throw an exception if an unsupported page encoding is detected in Parquet reader #12754

Merged
merged 26 commits into from
Mar 4, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Feb 10, 2023

Description

If the Parquet reader comes across a page encoded with an unsupported encoding, the call to decode page data silently fails, leading to either an empty table or unrelated exceptions being thrown. This PR adds code to validate the page encodings after the page headers are decoded.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner February 10, 2023 00:32
@etseidl etseidl requested review from harrism and mythrocks February 10, 2023 00:32
@rapids-bot
Copy link

rapids-bot bot commented Feb 10, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 10, 2023
@etseidl
Copy link
Contributor Author

etseidl commented Feb 10, 2023

Not sure why clang-format changed the lines at the top of parquet_test.cpp

@@ -86,11 +86,12 @@ enum class Encoding : uint8_t {
GROUP_VAR_INT = 1, // Deprecated, never used
PLAIN_DICTIONARY = 2,
RLE = 3,
BIT_PACKED = 4,
BIT_PACKED = 4, // Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Could you please clarify how this change relates to the change in the reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just updating the enum to make it match the current parquet spec, which lists BIT_PACKED as a deprecated encoding. Can remove if you'd prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add more comments about why/when/how etc. this enum is deprecated? As well as tracking issue (if any)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttnghia I did some digging through git blame on the parquet-format README.md. I found the commit but couldn't find an associated JIRA issue. I added a little more context to the comment.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, mostly. A couple of questions. We should, however, switch out of using thrust::all_of().

@etseidl etseidl requested a review from a team as a code owner February 10, 2023 23:38
@github-actions github-actions bot added the Python Affects Python cuDF API. label Feb 10, 2023
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mythrocks mythrocks added bug Something isn't working non-breaking Non-breaking change labels Feb 13, 2023
@mythrocks
Copy link
Contributor

I've added a couple of labels (Bug, Non-breaking) to get the CI going.

Comment on lines 344 to 351

// validate page encodings (avoiding use of thrust::any_of() NVIDIA/thrust #1016)
auto const num_valid_pages = static_cast<size_t>(thrust::count_if(
rmm::exec_policy(stream), pages.d_begin(), pages.d_end(), [] __device__(auto const& page) {
return is_supported_encoding(page.encoding);
}));

CUDF_EXPECTS(num_valid_pages == pages.size(), "Unsupported page encoding detected");
Copy link
Contributor

@ttnghia ttnghia Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I just see that we already call pages.device_to_host(stream, true); in the line right above. That means we can call is_supported_encoding on host code:

Suggested change
// validate page encodings (avoiding use of thrust::any_of() NVIDIA/thrust #1016)
auto const num_valid_pages = static_cast<size_t>(thrust::count_if(
rmm::exec_policy(stream), pages.d_begin(), pages.d_end(), [] __device__(auto const& page) {
return is_supported_encoding(page.encoding);
}));
CUDF_EXPECTS(num_valid_pages == pages.size(), "Unsupported page encoding detected");
// validate page encodings
auto const num_valid_pages = static_cast<size_t>(std::count_if(
pages.begin(), pages.end(), [] auto const& page) {
return is_supported_encoding(page.encoding);
}));
CUDF_EXPECTS(num_valid_pages == pages.size(), "Unsupported page encoding detected");

Note: The code above still needs to be reformatted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be any_of/all_of now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wanting to use the thrust version to do the check in parallel since the number of pages can easily get into the thousands (or higher). Do you thinks that's not worth worrying about? Then, yes, I'd switch back to std::any_of/all_of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime of thousands of such trivial checks will be negligible. So you don't need to be worried about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's back on the host. It's slower by 30us 🤣 (for a totally unscientific single pass of nsys profile).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+30us from 1us, or from 1ms? :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30us to 60 us :D

@etseidl
Copy link
Contributor Author

etseidl commented Feb 13, 2023

ok to test?

@vuule
Copy link
Contributor

vuule commented Feb 13, 2023

/ok to test

@etseidl
Copy link
Contributor Author

etseidl commented Feb 22, 2023

looks like CI is fixed, can this be tested again please?

@vuule
Copy link
Contributor

vuule commented Feb 22, 2023

/ok to test

@etseidl
Copy link
Contributor Author

etseidl commented Feb 28, 2023

Any objections to merging this?

@vuule
Copy link
Contributor

vuule commented Mar 1, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Mar 4, 2023

/merge

@rapids-bot rapids-bot bot merged commit 2689bb6 into rapidsai:branch-23.04 Mar 4, 2023
@etseidl etseidl deleted the feature/validate_encodings branch March 6, 2023 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants