-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throw an exception if an unsupported page encoding is detected in Parquet reader #12754
Throw an exception if an unsupported page encoding is detected in Parquet reader #12754
Conversation
Pull requests from external contributors require approval from a |
Not sure why clang-format changed the lines at the top of parquet_test.cpp |
@@ -86,11 +86,12 @@ enum class Encoding : uint8_t { | |||
GROUP_VAR_INT = 1, // Deprecated, never used | |||
PLAIN_DICTIONARY = 2, | |||
RLE = 3, | |||
BIT_PACKED = 4, | |||
BIT_PACKED = 4, // Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Could you please clarify how this change relates to the change in the reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just updating the enum to make it match the current parquet spec, which lists BIT_PACKED as a deprecated encoding. Can remove if you'd prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add more comments about why/when/how etc. this enum is deprecated? As well as tracking issue (if any)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ttnghia I did some digging through git blame on the parquet-format README.md. I found the commit but couldn't find an associated JIRA issue. I added a little more context to the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, mostly. A couple of questions. We should, however, switch out of using thrust::all_of()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I've added a couple of labels ( |
…into feature/validate_encodings
…into feature/validate_encodings
|
||
// validate page encodings (avoiding use of thrust::any_of() NVIDIA/thrust #1016) | ||
auto const num_valid_pages = static_cast<size_t>(thrust::count_if( | ||
rmm::exec_policy(stream), pages.d_begin(), pages.d_end(), [] __device__(auto const& page) { | ||
return is_supported_encoding(page.encoding); | ||
})); | ||
|
||
CUDF_EXPECTS(num_valid_pages == pages.size(), "Unsupported page encoding detected"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I just see that we already call pages.device_to_host(stream, true);
in the line right above. That means we can call is_supported_encoding
on host code:
// validate page encodings (avoiding use of thrust::any_of() NVIDIA/thrust #1016) | |
auto const num_valid_pages = static_cast<size_t>(thrust::count_if( | |
rmm::exec_policy(stream), pages.d_begin(), pages.d_end(), [] __device__(auto const& page) { | |
return is_supported_encoding(page.encoding); | |
})); | |
CUDF_EXPECTS(num_valid_pages == pages.size(), "Unsupported page encoding detected"); | |
// validate page encodings | |
auto const num_valid_pages = static_cast<size_t>(std::count_if( | |
pages.begin(), pages.end(), [] auto const& page) { | |
return is_supported_encoding(page.encoding); | |
})); | |
CUDF_EXPECTS(num_valid_pages == pages.size(), "Unsupported page encoding detected"); |
Note: The code above still needs to be reformatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be any_of/all_of now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wanting to use the thrust version to do the check in parallel since the number of pages can easily get into the thousands (or higher). Do you thinks that's not worth worrying about? Then, yes, I'd switch back to std::any_of/all_of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime of thousands of such trivial checks will be negligible. So you don't need to be worried about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it's back on the host. It's slower by 30us 🤣 (for a totally unscientific single pass of nsys profile).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+30us from 1us, or from 1ms? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30us to 60 us :D
ok to test? |
/ok to test |
looks like CI is fixed, can this be tested again please? |
/ok to test |
Any objections to merging this? |
/ok to test |
/merge |
Description
If the Parquet reader comes across a page encoded with an unsupported encoding, the call to decode page data silently fails, leading to either an empty table or unrelated exceptions being thrown. This PR adds code to validate the page encodings after the page headers are decoded.
Checklist