-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parquet statistics for ListingTable and Utf8View with schema_force_string_view
, rename config option to schema_force_view_types
#12232
Fix parquet statistics for ListingTable and Utf8View with schema_force_string_view
, rename config option to schema_force_view_types
#12232
Conversation
…ll solve the issue
schema_force_string_view
.schema_force_string_view
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld -- this PR is really nicely thought out and implemented.
I think the idea of putting the "should we use view types" on to the ParquetFormat
is much nicer than passing around boolean flags
I left some suggestoons on this PR for naming and comments
I think now that the next arrow release is in #12032 , perhaps you can merge up this branch and fixup the tests in preparation for merge?
@@ -380,6 +380,10 @@ config_namespace! { | |||
/// the filters are applied in the same order as written in the query | |||
pub reorder_filters: bool, default = false | |||
|
|||
/// (reading) If true, parquet reader will read columns of `Utf8/Utf8Large` with `Utf8View`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change so that all the reading configuration values are before the writing ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reviewing / considering this PR, I wonder if we should (in a follow on PR) rename this config flag to be schema_force_view_types
as it also applies to binary columns 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah...I just did it all in this PR because it felt weird to have 2 naming conventions at once. Hopefully that's ok. 😅
cb30be7
to
2438136
Compare
schema_force_string_view
.schema_force_string_view
, rename config option to schema_force_view_types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld -- I think this PR looks good to merge from my perspective
…ew types are used
2438136
to
02036fb
Compare
Thanks again @wiedld |
Which issue does this PR close?
Closes #12123
Rationale for this change
On write: parquet file written with utf8/large-utf & binary/large-binary schema (so is in metadata).
On read: we would like to be able to read as the more performant view types.
Previous work has already used the
schema_force_string_view
to read into utf8view and binaryview, by passing around a bool to the ParquetOpener.This work is to focused on getting the parquet statistics, on read, to properly compute when reading as view types.
What changes are included in this PR?
schema_force_string_view
up a few lines, to be with the "read" (not write) config options.schema_force_string_view
There are two tests marked as incomplete: an expected panic and a commented out test. Both tests will successfully pass once the next (anticipated) arrow release occurs. In order to prove this, there is a PoC over here which branches of the in-progress WIP for the arrow upgrade and then adds the commits from this PR.. I merged in the latest main with the arrow-rs release (thank you!) and those tests are now passing.Are these changes tested?
Yes.
Are there any user-facing changes?
No.