-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract Parquet statistics from Interval
column
#10801
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @marvinlanhenke -- this looks very nice 👌 -- I left some comments but I think it is quite close.
Note that there are some conflicts, likely with #10711 which merged a few hours ago
I will file a ticket upstream in arrow-rs to support writing statistics for intervals (thank you for the code pointers)
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Outdated
Show resolved
Hide resolved
I filed apache/arrow-rs#5847. I didn't quite understand your suggestion about two tickets |
I was thinking about two tickets:
I think those are two distinct issues? |
@alamb thanks for the review. I have adressed your comments PTAL. |
Yes I think you are right -- any chance you can file a ticket in arrow-rs to track writing IntervalMonthDayNano? |
Sure, I can do that. I filed: apache/arrow-rs#5849 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you very much @marvinlanhenke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm @marvinlanhenke kudos for the test description
As per the format specification, it is incorrect to read statistics for interval columns, as there is no defined sort order - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval |
@alamb |
I think removing it would be fine @marvinlanhenke -- thank you |
* feat: add make_batch + basic test
Which issue does this PR close?
Closes #10752.
Rationale for this change
Since parquet does not support statistics for
Interval
columns, this PR only prepares test cases and some necessary setup.Once parquet supports statistics, a follow-up PR will be needed to finish the implementation.
What changes are included in this PR?
get_statistic
match arm (stub)should_panic
Are these changes tested?
Are there any user-facing changes?
No.