-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Expose more metadata in pyarrow.parquet.ParquetFile.metadata #34180
Comments
@deanm0000 the statistics indeed don't have those exact min_value / max_value keys, but it does have |
So my thoughts are that some readers and optimizers use the "newer" |
Starting from the next version (#34112), the min/max python attributes will check for min_value/max_value in the actual parquet metadata, if they are present, and otherwise fallback to the deprecated min/max values. |
Does the writer write min_value/max_value and if so has it done that for a while? |
Yes, AFAIK we have been writing those values for a long time. As a quick test using pyarrow 1.0 (from almost 3 years ago), writing a small table, and inspecting the thrift metadata of the file with parquet-tools, both min/max and min_value/max_value are set. |
Describe the enhancement requested
I'm not sure if this issue pertains to all implementations of arrow including pyarrow or just c++ but related to this #14870
I'm guessing it affects pyarrow as
pq.ParquetFile.metadata.to_dict()['row_groups'][0]['columns'][0]['statistics'].keys()
doesn't have the min_value, max_value keys.
So the feature request is to include min_value and max_value in that metadata.
Additionally, I think there's metadata on whether or not a column is sorted (I might be confused on that point) but if there is it'd be good to see that too.
Component(s)
Python
The text was updated successfully, but these errors were encountered: