-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Python Schema in Rust #684
Conversation
This is really nice improvement, thanks @wjones127 for picking it up! |
We will not be able to roundtrip field metadata until apache/arrow-rs#478 is addressed. |
0aba557
to
197386c
Compare
@@ -42,7 +57,7 @@ impl TryFrom<&schema::SchemaTypeArray> for ArrowField { | |||
|
|||
fn try_from(a: &schema::SchemaTypeArray) -> Result<Self, ArrowError> { | |||
Ok(ArrowField::new( | |||
"element", | |||
"item", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the below changes are necessary to make them inline with the default field names in Rust and C++ Arrow, allowing easy round tripping between Arrow types and Delta Lake types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change might cause issues when reading complex nested types. For example a map of arrays gives
pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map<string, list<item: string>> from map<string, list<element: string> ('map_arrays')>
It seems PySpark always uses element
and PyArrow cannot cast element
-> item
for some complex types. I have not seen any example where entries
vs key_value
causes problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a test to reproduce this issue https://github.com/Tom-Newton/delta-rs/pull/13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wjones127 @Tom-Newton - is this something blocking us from merging, or would this conflict somehow with #714?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unrelated to #714. But it relates to making map types work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reported an upstream issue: https://issues.apache.org/jira/browse/ARROW-17349
For now, I guess I'll see if I can align things with using element
, since Spark seems to not allow customizing that field name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried two approaches and neither worked very well:
- First, I tried to change back to always convert Delta Array types to ListTypes with "element" as field name. With that change, we cannot roundtrip from
Arrow -> Delta -> Arrow
without having to re-map the field names, which I think will be annoying beyond just our test cases. - Second, I tried changing so that whenever we read and write, we always use the "element" field name. But PyArrow errors when trying to write a table with a schema not matching the output schema, including field names. And if we try to cast we will run into exactly the same error you are getting now when reading.
Since this case isn't something we already support, I'm thinking the best course of action is to fix the issue upstream. Does that seem acceptable @Tom-Newton ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective this PR is a big improvement and does not introduce any regression, so I certainly think this is good to merge. However if this particular change was not needed it would have been even more helpful when it comes to making map types work.
I would like to understand why always using element
doesn't work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to understand why always using
element
doesn't work though.
First, there is the annoyance that is breaks roundtripping:
pa_type = pa.list_(pa.int32()) # uses `item` as field name
delta_type = deltalake.schema.ArrayType.from_pyarrow(pa_type) # drops the field name
pa_type_back = delta_type.to_pyarrow() # adds `element` as field name
pa_type == pa_type_back # is False
But more importantly, it means that when writing to a Delta table, we need to make sure we use the field name "element" for list types. Otherwise when reading, we'll get the same error you were getting, but now complaining it can't cast from "item" to "element" instead of the other way around. That means when we write data, we have to cast any other field name to "element". But casting the field names is what's broken in the first place! So instead the writer would be broken in this edge case.
I guess ultimately it's a tradeoff: if we use "item", we break the reader in a certain edge case. If we use "element", we break the writer in that edge case and we break pyarrow type roundtripping. The latter outcome seems slightly worse to me (plus it's more code to write and maintain). Either way, we should have this fixed by the PyArrow 10.0.0, which should be released in October.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining 🙂 . So if I understand correctly the issue is because write_deltalake
supports arbitrary pyarrow schemas so depending on whether the user's data uses item
or element
a cast may be needed.
For the round tripping I guess you could make the opposite argument that pa_type = pa.list_(pa.field("element", pa.int32()))
cannot be round tripped with the current implementation.
My personal opinion is that I would rather the writer is broken in this edge case but I'm only one user with a vested interest 😄 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great overall! left some minor comments.
("z", ArrayType(StructType([Field("x", "integer", True)]), True), True, None), | ||
] | ||
|
||
# TODO: are there field names we should reject? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only thing i can think of is when we use the field names for partitioning. We have some test for special characters. Not sure though if we could handle (or should be able to) if an "=" were to appear in the field name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I may leave this for later. There may be complications with column mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really high quality work, thanks @wjones127 !
eef3fb7
to
8cdd31b
Compare
Description
These changes move the schema code down into Rust, removing (most of) the JSON serialization and making the Python bindings closer to the Rust implementation. This will make it easier to add methods onto the Rust DeltaSchema and expose them in Python.
It also adds
to_pyarrow()
andfrom_pyarrow()
methods to each of the classes, since this was easy to implement.It does mean some API changes, such as removing the
DataType
base class, because we can't yet properly do inheritance in PyO3. I had to rewrite the unit tests for schemas, but I left the unit tests for all other modules unchanged (minus moving away from a deprecated function).Also, I moved the main typestubs into the module, which I think means we will ship them now with the code. Haven't confirmed that yet though.
Related Issue(s)
Will also help with #592
Documentation