Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] min-max Statistics doesn't work well when one of min-max being truncated #43382

Closed
mapleFU opened this issue Jul 23, 2024 · 9 comments

Comments

@mapleFU
Copy link
Member

mapleFU commented Jul 23, 2024

Describe the bug, including details regarding any error messages, version, and platform.

The problem

The min-max statistics would being truncated during write, as the code below:

    EncodedStatistics chunk_statistics = GetChunkStatistics();
    chunk_statistics.ApplyStatSizeLimits(
        properties_->max_statistics_size(descr_->path()));
    chunk_statistics.set_is_signed(SortOrder::SIGNED == descr_->sort_order());

ApplyStatSizeLimits will try to truncate min-max if greater than properties_->max_statistics_size(descr_->path())) , which default is 4096 Bytes

  // From parquet-mr
  // Don't write stats larger than the max size rather than truncating. The
  // rationale is that some engines may use the minimum value in the page as
  // the true minimum for aggregations and there is no way to mark that a
  // value has been truncated and is a lower bound and not in the page.
  void ApplyStatSizeLimits(size_t length) {
    if (max_.length() > length) {
      has_max = false;
      max_.clear();
    }
    if (min_.length() > length) {
      has_min = false;
      min_.clear();
    }
  }

The code is right here.

But during consuming this api, the code is here:

template <typename DType>
static std::shared_ptr<Statistics> MakeTypedColumnStats(
    const format::ColumnMetaData& metadata, const ColumnDescriptor* descr) {
  // If ColumnOrder is defined, return max_value and min_value
  if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER) {
    return MakeStatistics<DType>(
        descr, metadata.statistics.min_value, metadata.statistics.max_value,
        metadata.num_values - metadata.statistics.null_count,
        metadata.statistics.null_count, metadata.statistics.distinct_count,
        metadata.statistics.__isset.max_value || metadata.statistics.__isset.min_value,
        metadata.statistics.__isset.null_count,
        metadata.statistics.__isset.distinct_count);
  }
  // Default behavior
  return MakeStatistics<DType>(
      descr, metadata.statistics.min, metadata.statistics.max,
      metadata.num_values - metadata.statistics.null_count,
      metadata.statistics.null_count, metadata.statistics.distinct_count,
      metadata.statistics.__isset.max || metadata.statistics.__isset.min,
      metadata.statistics.__isset.null_count, metadata.statistics.__isset.distinct_count);
}

The problem is that || is being used for min-max statistics existence. And the final result just have a has_min_max_state.

As a result, for example, a statistics has :

min: ""
max: "..." <-- an 10000Bytes string

The stored is has_min: true, min: "", has_max: false. And the loaded stats is has_min_max:true, min="", max="", which is a bug here.

Solving

This is because currently, HasMinMax is "has min or max", we can have solvings below:

  1. Change MakeTypedColumnStats to use && rather than ||
  2. Propose a new api for HasMinAndMax, and use this api for pruning.

Component(s)

C++, Parquet

@mapleFU
Copy link
Member Author

mapleFU commented Jul 23, 2024

@pitrou @emkornfield @wgtmac

@mapleFU
Copy link
Member Author

mapleFU commented Jul 23, 2024

Besides, the min-max api looks like:

  /// \brief Return true if the min and max statistics are set. Obtain
  /// with TypedStatistics<T>::min and max
  virtual bool HasMinMax() const = 0;

@wgtmac
Copy link
Member

wgtmac commented Jul 23, 2024

There was a related discussion on whether it is valid to have min/max value from one side only: #34112 (comment)

@mapleFU
Copy link
Member Author

mapleFU commented Jul 23, 2024

There was a related discussion on whether it is valid to have min/max value from one side only: #34112 (comment)

This make sense, I can also separate to two separate api. Let's hear what other decide

@mapleFU
Copy link
Member Author

mapleFU commented Aug 3, 2024

Updated: I'll separate the implemetation into two separate patch:

  1. The first one is test and remove the || in HasMinMax, changing the syntax to both has-min and has-max
  2. The second one is add HasMin and HasMax.

@emkornfield
Copy link
Contributor

These seem like good changes. Thank you for doing this @mapleFU

@emkornfield
Copy link
Contributor

CC @jp0317

@kou
Copy link
Member

kou commented Aug 5, 2024

Updated: I'll separate the implemetation into two separate patch:

1. The first one is test and remove the `||` in HasMinMax, changing the syntax to both has-min and has-max

2. The second one is add `HasMin` and `HasMax`.

+1

@wgtmac wgtmac changed the title [C++][Parquet] min-max Statistics doesn't works well when one of min-max beging truncated [C++][Parquet] min-max Statistics doesn't work well when one of min-max being truncated Aug 5, 2024
mapleFU added a commit that referenced this issue Aug 5, 2024
…e of min-max is truncated (#43383)

### Rationale for this change

See #43382

### What changes are included in this PR?

Change stats has min-max from min || max to &&

### Are these changes tested?

* [x] TODO

### Are there any user-facing changes?

Might affect interface using HasMinMax

**This PR includes breaking changes to public APIs.**

* GitHub Issue: #43382

Authored-by: mwish <[email protected]>
Signed-off-by: mwish <[email protected]>
@mapleFU mapleFU added this to the 18.0.0 milestone Aug 5, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Aug 5, 2024

Issue resolved by pull request 43383
#43383

@mapleFU mapleFU closed this as completed Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants