Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Metadata from C data interface is not valid utf8 #20107

Open
asfimport opened this issue Feb 8, 2022 · 3 comments
Open

[C++][Python] Metadata from C data interface is not valid utf8 #20107

asfimport opened this issue Feb 8, 2022 · 3 comments

Comments

@asfimport
Copy link
Collaborator

While trying to roundtrip an extension from schema.metadata (see ARROW-13855 for details), I got invalid utf8, which imo goes against

A binary string describing the type’s metadata [1]

Specifically, a field

field = pyarrow.field("aa", UuidType())

contains the following:

key len: 20
key: "ARROW:extension:name"
value len: 23
value: "arrow.py_extension_type"
key len: 24
key: "ARROW:extension:metadata"
value len: 28

with the value's data for this key being:

[128, 3, 99, 116, 101, 115, 116, 95, 115, 113, 108, 10, 85, 117, 105, 100, 84, 121, 112, 101, 10, 113, 0, 41, 82, 113, 1, 46]

This is not a valid utf8 (see e.g. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=02b67658b3cddf8dc095bc9750fa7032).

Maybe I am reading the values incorrectly? (null point?)

[1] https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata

Reporter: Jorge Leitão / @jorgecarleitao

Note: This issue was originally created as ARROW-15613. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Jorge Leitão / @jorgecarleitao:
cc @pitrou

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
There is actually a discussion to relax the utf8 requirement in IPC metadata values (see the message recently posted by @jorisvandenbossche  "Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams").

In short: yes, Arrow C++ and PyArrow can put arbitrary binary data in metadata values.

Also cc @lidavidm   @emkornfield  

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
(Side note: this might be just for quick testing, but if you actually want to use the extension type on the rust side as well, you should probably define the extension type in Python as a subclass of pyarrow.ExtensionType, and not pyarrow.PyExtensionType, since the latter uses a pickle dump of the class as the serialized metadata, which you won't be able to use in Rust, I suppose)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant