-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to read parquet binary column as UTF8 type #6539
Conversation
.build() | ||
.expect("reader with schema"); | ||
|
||
arrow_reader.next(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow_reader.next(); | |
arrow_reader.next().unwrap_err(); |
As this should error, given the data isn't actually UTF-8
.column(0) | ||
.as_any() | ||
.downcast_ref::<StringArray>() | ||
.expect("downcast to string") | ||
.iter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.column(0) | |
.as_any() | |
.downcast_ref::<StringArray>() | |
.expect("downcast to string") | |
.iter() | |
.column(0) | |
.as_string::<i32>() | |
.iter() |
And the same below
@@ -57,6 +57,11 @@ fn apply_hint(parquet: DataType, hint: DataType) -> DataType { | |||
(DataType::Utf8, DataType::LargeUtf8) => hint, | |||
(DataType::Binary, DataType::LargeBinary) => hint, | |||
|
|||
// Read as Utf8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
} | ||
|
||
#[test] | ||
#[should_panic(expected = "Invalid UTF8 sequence at")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @goldmedal and @tustvold -- this looks great to me
Which issue does this PR close?
No related issue.
Rationale for this change
While working on apache/datafusion#12788 (comment) in DataFusion, I found we can't read the parquet binary column as string types (Utf8, LargeUtf8, or Utf8View) through
ArrowReaderOptions::with_schema
. I think it makes sense to read them as strings if the user ensures it's a string binary value.What changes are included in this PR?
I added some matching rules in
apply_hint
inparquet/src/arrow/schema/primitive.rs
to handle the binary-to-string cases.Are there any user-facing changes?
no
cc @alamb