Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Parquet statistics from Interval column #10801

Merged
merged 7 commits into from
Jun 6, 2024

Conversation

marvinlanhenke
Copy link
Contributor

Which issue does this PR close?

Closes #10752.

Rationale for this change

Since parquet does not support statistics for Interval columns, this PR only prepares test cases and some necessary setup.
Once parquet supports statistics, a follow-up PR will be needed to finish the implementation.

What changes are included in this PR?

  • added test cases for interval columns
  • added get_statistic match arm (stub)
  • added comments to highlight the not yet supported parts
  • for now the test_cases should_panic

Are these changes tested?

Are there any user-facing changes?

No.

@github-actions github-actions bot added the core Core DataFusion crate label Jun 5, 2024
@marvinlanhenke
Copy link
Contributor Author

marvinlanhenke commented Jun 5, 2024

@alamb PTAL.

As described in the PR I tried to prepare as much as possible, although statistics are not yet supported.

Yet another issue: The IntervalUnit::MonthDayNano is also not supported.

So I'll guess we have to open 2 tickets in arrow-rs?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @marvinlanhenke -- this looks very nice 👌 -- I left some comments but I think it is quite close.

Note that there are some conflicts, likely with #10711 which merged a few hours ago

I will file a ticket upstream in arrow-rs to support writing statistics for intervals (thank you for the code pointers)

datafusion/core/tests/parquet/mod.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Jun 5, 2024

@alamb PTAL.

As described in the PR I tried to prepare as much as possible, although statistics are not yet supported.

Yet another issue: The IntervalUnit::MonthDayNano is also not supported.

So I'll guess we have to open 2 tickets in arrow-rs?

I filed apache/arrow-rs#5847. I didn't quite understand your suggestion about two tickets

@marvinlanhenke
Copy link
Contributor Author

@alamb PTAL.
As described in the PR I tried to prepare as much as possible, although statistics are not yet supported.
Yet another issue: The IntervalUnit::MonthDayNano is also not supported.
So I'll guess we have to open 2 tickets in arrow-rs?

I filed apache/arrow-rs#5847. I didn't quite understand your suggestion about two tickets

I was thinking about two tickets:

  1. Interval statistics not supported; possibly due to those lines
  2. Type IntervalUnit::MonthDayNano not supported at all by the arrow_writer

I think those are two distinct issues?

@marvinlanhenke
Copy link
Contributor Author

@alamb thanks for the review. I have adressed your comments PTAL.

@alamb
Copy link
Contributor

alamb commented Jun 5, 2024

  1. Interval statistics not supported; possibly due to those lines
  2. Type IntervalUnit::MonthDayNano not supported at all by the arrow_writer

I think those are two distinct issues?

Yes I think you are right -- any chance you can file a ticket in arrow-rs to track writing IntervalMonthDayNano?

@marvinlanhenke
Copy link
Contributor Author

marvinlanhenke commented Jun 5, 2024

  1. Interval statistics not supported; possibly due to those lines
  2. Type IntervalUnit::MonthDayNano not supported at all by the arrow_writer

I think those are two distinct issues?

Yes I think you are right -- any chance you can file a ticket in arrow-rs to track writing IntervalMonthDayNano?

Sure, I can do that.

I filed: apache/arrow-rs#5849

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you very much @marvinlanhenke

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @marvinlanhenke kudos for the test description

@comphead comphead merged commit 089b232 into apache:main Jun 6, 2024
23 checks passed
@tustvold
Copy link
Contributor

tustvold commented Jun 11, 2024

As per the format specification, it is incorrect to read statistics for interval columns, as there is no defined sort order - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval

@marvinlanhenke
Copy link
Contributor Author

marvinlanhenke commented Jun 11, 2024

@alamb
Since the support for Interval is not planned or possible, should we remove the added code here? As it might never be supported, we should probably get rid of potentially dead code? Less code less problems...

@alamb
Copy link
Contributor

alamb commented Jun 12, 2024

@alamb Since the support for Interval is not planned or possible, should we remove the added code here? As it might never be supported, we should probably get rid of potentially dead code? Less code less problems...

I think removing it would be fine @marvinlanhenke -- thank you

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract parquet statistics from Interval columns
4 participants