Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet bool column SortOrder being UNSIGNED? #6544

Open
jp0317 opened this issue Oct 11, 2024 · 7 comments
Open

Parquet bool column SortOrder being UNSIGNED? #6544

jp0317 opened this issue Oct 11, 2024 · 7 comments
Labels

Comments

@jp0317
Copy link
Contributor

jp0317 commented Oct 11, 2024

Describe the bug
Hi, please close if it's intended. It seems that SortOrder for bool is UNSIGNED in rust but SIGNED in c++ implementation. Should the rust implementation also be SIGNED?

To Reproduce
n/a

Expected behavior
Consistency?

Additional context
n/a

@jp0317 jp0317 added the bug label Oct 11, 2024
@tustvold
Copy link
Contributor

Given a boolean doesn't have a sign, I would have expected it to be unsigned

@jp0317
Copy link
Contributor Author

jp0317 commented Oct 11, 2024

thanks @tustvold for the reply. Yeah that makes sense to me too, but both c++ and java implementation go with SIGNED, which can affect if the column statistics are considered correct.

@tustvold
Copy link
Contributor

That is because older versions of those libraries used to write invalid statistics, see a few lines earlier where it checks the parquet version - https://github.com/apache/arrow/blob/main/cpp/src/parquet/metadata.cc#L1482

This codepath should not be being encountered with data written by parquet-rs AFAICT

@jp0317
Copy link
Contributor Author

jp0317 commented Oct 11, 2024

yeah, i got that. But wouldn't it be better if parquet-rs can read files (with correct statistics or rejecting invalid statistics) written by other implementations?

@tustvold
Copy link
Contributor

We faithfully reproduce the statistics as written within the file, one could raise the prospect of adding similar validity checking, perhaps you might file a separate issue for this if there isn't one already. However, as for the topic of this issue, perhaps you could provider a reproducer of the actual issue you are running into? AFAICT we are writing unsigned, which the type objectively is, and some other implementations choose to write signed, likely for backwards compatibility, but I can't see what issue this could cause.

@jp0317
Copy link
Contributor Author

jp0317 commented Oct 11, 2024

Thanks for the reply! I think a potential issue would be affecting whether we can use statistics to avoid unnecessary reading, e.g., pruning data based on the min/max. If parquet-rs cannot identify invalid statistics, then the user won't be able to tell if the statistics are trustable or would have risk of causing incorrect results.

And I agree that it's more of how to valid statistics rather than whether bool should have a SortType of UNSIGNED or SIGNED, i.e., the validation logic should handle it properly.

@alamb
Copy link
Contributor

alamb commented Oct 15, 2024

Maybe we can try to create a bug reproducer using a system like https://datafusion.apache.org/ or something else that uses the rust parquet reader (i.e. see if whatever statistics case we are concerned about can generate incorrect results)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants