Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Expose more metadata in pyarrow.parquet.ParquetFile.metadata #34180

Open
deanm0000 opened this issue Feb 14, 2023 · 5 comments
Open

Comments

@deanm0000
Copy link

Describe the enhancement requested

I'm not sure if this issue pertains to all implementations of arrow including pyarrow or just c++ but related to this #14870

I'm guessing it affects pyarrow as pq.ParquetFile.metadata.to_dict()['row_groups'][0]['columns'][0]['statistics'].keys()
doesn't have the min_value, max_value keys.

So the feature request is to include min_value and max_value in that metadata.

Additionally, I think there's metadata on whether or not a column is sorted (I might be confused on that point) but if there is it'd be good to see that too.

Component(s)

Python

@AlenkaF AlenkaF changed the title Expose more metadata in pyarrow.parquet.ParquetFile.metadata [Python] Expose more metadata in pyarrow.parquet.ParquetFile.metadata Feb 14, 2023
@jorisvandenbossche
Copy link
Member

@deanm0000 the statistics indeed don't have those exact min_value / max_value keys, but it does have min / max and min_raw / max_raw keys. Does those already do what you are looking for?
(it seems the _raw ones are only available as attribute on the object (eg meta.row_group(0).column(0).statistics.min_raw), and not in the dict)

@deanm0000
Copy link
Author

So my thoughts are that some readers and optimizers use the "newer" min_value and max_value stats when they plan queries and filters. (Hopefully pyarrow.dataset is included in that or will be). I'd like a way to verify that my parquet files have those stats. Since the min and max stats are deprecated it seems fewer libraries are going to even look at those if they exist. As to the min_raw and max_raw, I've never heard of them so I'm not sure how valuable they are.

@jorisvandenbossche
Copy link
Member

Starting from the next version (#34112), the min/max python attributes will check for min_value/max_value in the actual parquet metadata, if they are present, and otherwise fallback to the deprecated min/max values.

@deanm0000
Copy link
Author

Does the writer write min_value/max_value and if so has it done that for a while?

@jorisvandenbossche
Copy link
Member

Yes, AFAIK we have been writing those values for a long time. As a quick test using pyarrow 1.0 (from almost 3 years ago), writing a small table, and inspecting the thrift metadata of the file with parquet-tools, both min/max and min_value/max_value are set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants