-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet bool column SortOrder being UNSIGNED? #6544
Comments
Given a boolean doesn't have a sign, I would have expected it to be unsigned |
thanks @tustvold for the reply. Yeah that makes sense to me too, but both c++ and java implementation go with |
That is because older versions of those libraries used to write invalid statistics, see a few lines earlier where it checks the parquet version - https://github.com/apache/arrow/blob/main/cpp/src/parquet/metadata.cc#L1482 This codepath should not be being encountered with data written by parquet-rs AFAICT |
yeah, i got that. But wouldn't it be better if parquet-rs can read files (with correct statistics or rejecting invalid statistics) written by other implementations? |
We faithfully reproduce the statistics as written within the file, one could raise the prospect of adding similar validity checking, perhaps you might file a separate issue for this if there isn't one already. However, as for the topic of this issue, perhaps you could provider a reproducer of the actual issue you are running into? AFAICT we are writing unsigned, which the type objectively is, and some other implementations choose to write signed, likely for backwards compatibility, but I can't see what issue this could cause. |
Thanks for the reply! I think a potential issue would be affecting whether we can use statistics to avoid unnecessary reading, e.g., pruning data based on the min/max. If parquet-rs cannot identify invalid statistics, then the user won't be able to tell if the statistics are trustable or would have risk of causing incorrect results. And I agree that it's more of how to valid statistics rather than whether bool should have a |
Maybe we can try to create a bug reproducer using a system like https://datafusion.apache.org/ or something else that uses the rust parquet reader (i.e. see if whatever statistics case we are concerned about can generate incorrect results) |
Describe the bug
Hi, please close if it's intended. It seems that SortOrder for bool is UNSIGNED in rust but SIGNED in c++ implementation. Should the rust implementation also be SIGNED?
To Reproduce
n/a
Expected behavior
Consistency?
Additional context
n/a
The text was updated successfully, but these errors were encountered: