-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extension type support #2444
Comments
Thank you for the example, I can reproduce this. The parquet file is produced with the following base64 encoded arrow schema
I took this and converted it to its raw data with
I trimmed the first 8 bytes (as they're arrow specific delimiter information)
I then decoded the flatbuffer
And got the error
This fits with the error that the rust implementation is returning, the schema has non-UTF-8 data encoded in a string field, which is technically illegal. If I tell flatc to ignore this
I get the decoded data
I'm not really sure what to make of this, KeyValue is defined as
Which as per the flatbuffer spec must be valid UTF-8. I will try to get some clarity on what is going on here - my understanding of the specification is the Rust implementation is correct to refuse this schema... |
Ok so it would appear that this is a known issue where pyarrow is writing ill-formed flatbuffers (here) for extension types. There isn't really much we can do here, a flatbuffer string field should not contain non-UTF-8 data, and in the case of Rust permitting this would not be sound (it could lead to UB). Having spoken with @jorgecarleitao I'm led to believe arrow2 also takes the approach of rejecting this. The proper solution to the problem is for pyarrow to either base64 encode the payloads, or for the arrow specification to change That being said, the embedded metadata is a pickled python class, which likely isn't hugely useful to a rust client anyway. Perhaps you could use skip_arrow_metadata to tell the parquet reader to just ignore the malformed embedded arrow schema, and just infer the data from the underlying parquet schema? |
@tustvold Thanks a lot! I replaced the generator with this one, it basically changed the
|
What is the behavior of the C++ parser?
Happy to be corrected on this, but I don't believe there is a standard for extension types, one could conceivably make the case that such a concept would be an oxymoron... If there is a general purpose data type that is missing from the standard set, I'm sure the community would be willing to consider additions to the arrow specification, and this would be the path to portability for that data type. There is always going to be a trade-off between expressiveness and portability, with extension types sacrificing the latter in favour of the former. I'm not sure there is a way around this... FWIW just supporting the standard arrow types is ripe with excitement #1666 |
Closing this as I don't believe it is tracking any missing functionality, feel free to reopen if I am mistaken |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
My friends wrote a data format lance that has interconvertibility with parquet, and I want to make another implementation with Rust.
However, they used EXTENTION type,
it seems has not been implemented in arrow-rs.
Describe the solution you'd like
Let the reader can convert the parquet file with EXTENTION type to Arrow.
Describe alternatives you've considered
Additional context
Python code I used to generate such a file.
Rust code that failed reading
error log
The text was updated successfully, but these errors were encountered: