Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle empty string correctly in Parquet statistics #14257

Merged
merged 8 commits into from
Oct 10, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 5, 2023

Description

An empty string should be a valid minimum value for a string column, but the current parquet writer considers an empty string to have no value when writing the column chunk statistics. This PR changes all fields in the Statistics struct to be thrust::optional to help distinguish between a valid empty string and no value.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner October 5, 2023 22:29
@etseidl etseidl requested review from bdice and karthikeyann October 5, 2023 22:29
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 5, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 5, 2023
@etseidl
Copy link
Contributor Author

etseidl commented Oct 5, 2023

@karthikeyann could you please check the changes to the predicate pushdown to make sure that's correct?

@vuule vuule self-requested a review October 5, 2023 23:10
@vuule vuule added bug Something isn't working non-breaking Non-breaking change labels Oct 6, 2023
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, lovely plumage!

cpp/src/io/parquet/parquet.hpp Outdated Show resolved Hide resolved
Comment on lines +772 to +780
using optional_binary = parquet_field_optional<std::vector<uint8_t>, parquet_field_binary>;
using optional_int64 = parquet_field_optional<int64_t, parquet_field_int64>;

auto op = std::make_tuple(optional_binary(1, s->max),
optional_binary(2, s->min),
optional_int64(3, s->null_count),
optional_int64(4, s->distinct_count),
optional_binary(5, s->max_value),
optional_binary(6, s->min_value));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful, the investment into parquet_field_optional is paying off!

@vuule
Copy link
Contributor

vuule commented Oct 9, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Oct 10, 2023

/ok to test

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Oct 10, 2023
@ttnghia
Copy link
Contributor

ttnghia commented Oct 10, 2023

/merge

@rapids-bot rapids-bot bot merged commit aa8b0f8 into rapidsai:branch-23.12 Oct 10, 2023
57 checks passed
@etseidl etseidl deleted the stats_optional branch October 10, 2023 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants