-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle empty string correctly in Parquet statistics #14257
Conversation
@karthikeyann could you please check the changes to the predicate pushdown to make sure that's correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, lovely plumage!
using optional_binary = parquet_field_optional<std::vector<uint8_t>, parquet_field_binary>; | ||
using optional_int64 = parquet_field_optional<int64_t, parquet_field_int64>; | ||
|
||
auto op = std::make_tuple(optional_binary(1, s->max), | ||
optional_binary(2, s->min), | ||
optional_int64(3, s->null_count), | ||
optional_int64(4, s->distinct_count), | ||
optional_binary(5, s->max_value), | ||
optional_binary(6, s->min_value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful, the investment into parquet_field_optional
is paying off!
Co-authored-by: Vukasin Milovanovic <[email protected]>
/ok to test |
/ok to test |
/merge |
Description
An empty string should be a valid minimum value for a string column, but the current parquet writer considers an empty string to have no value when writing the column chunk statistics. This PR changes all fields in the Statistics struct to be
thrust::optional
to help distinguish between a valid empty string and no value.Checklist