-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] min-max Statistics doesn't work well when one of min-max being truncated #43382
Comments
Besides, the min-max api looks like: /// \brief Return true if the min and max statistics are set. Obtain
/// with TypedStatistics<T>::min and max
virtual bool HasMinMax() const = 0; |
There was a related discussion on whether it is valid to have min/max value from one side only: #34112 (comment) |
This make sense, I can also separate to two separate api. Let's hear what other decide |
Updated: I'll separate the implemetation into two separate patch:
|
These seem like good changes. Thank you for doing this @mapleFU |
CC @jp0317 |
+1 |
…e of min-max is truncated (#43383) ### Rationale for this change See #43382 ### What changes are included in this PR? Change stats has min-max from min || max to && ### Are these changes tested? * [x] TODO ### Are there any user-facing changes? Might affect interface using HasMinMax **This PR includes breaking changes to public APIs.** * GitHub Issue: #43382 Authored-by: mwish <[email protected]> Signed-off-by: mwish <[email protected]>
Issue resolved by pull request 43383 |
Describe the bug, including details regarding any error messages, version, and platform.
The problem
The min-max statistics would being truncated during write, as the code below:
ApplyStatSizeLimits
will try to truncate min-max if greater thanproperties_->max_statistics_size(descr_->path()))
, which default is 4096 BytesThe code is right here.
But during consuming this api, the code is here:
The problem is that
||
is being used for min-max statistics existence. And the final result just have ahas_min_max_state
.As a result, for example, a statistics has :
The stored is
has_min: true, min: "", has_max: false
. And the loaded stats ishas_min_max:true, min="", max=""
, which is a bug here.Solving
This is because currently,
HasMinMax
is "has min or max", we can have solvings below:MakeTypedColumnStats
to use&&
rather than||
HasMinAndMax
, and use this api for pruning.Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: