Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Python Schema in Rust #684

Merged
merged 10 commits into from
Aug 10, 2022
Merged

Conversation

wjones127
Copy link
Collaborator

@wjones127 wjones127 commented Jul 11, 2022

Description

These changes move the schema code down into Rust, removing (most of) the JSON serialization and making the Python bindings closer to the Rust implementation. This will make it easier to add methods onto the Rust DeltaSchema and expose them in Python.

It also adds to_pyarrow() and from_pyarrow() methods to each of the classes, since this was easy to implement.

It does mean some API changes, such as removing the DataType base class, because we can't yet properly do inheritance in PyO3. I had to rewrite the unit tests for schemas, but I left the unit tests for all other modules unchanged (minus moving away from a deprecated function).

Also, I moved the main typestubs into the module, which I think means we will ship them now with the code. Haven't confirmed that yet though.

Related Issue(s)

Will also help with #592

Documentation

@houqp
Copy link
Member

houqp commented Jul 11, 2022

This is really nice improvement, thanks @wjones127 for picking it up!

@wjones127
Copy link
Collaborator Author

We will not be able to roundtrip field metadata until apache/arrow-rs#478 is addressed.

@@ -42,7 +57,7 @@ impl TryFrom<&schema::SchemaTypeArray> for ArrowField {

fn try_from(a: &schema::SchemaTypeArray) -> Result<Self, ArrowError> {
Ok(ArrowField::new(
"element",
"item",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the below changes are necessary to make them inline with the default field names in Rust and C++ Arrow, allowing easy round tripping between Arrow types and Delta Lake types.

Copy link
Contributor

@Tom-Newton Tom-Newton Aug 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change might cause issues when reading complex nested types. For example a map of arrays gives

pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map<string, list<item: string>> from map<string, list<element: string> ('map_arrays')>

It seems PySpark always uses element and PyArrow cannot cast element -> item for some complex types. I have not seen any example where entries vs key_value causes problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a test to reproduce this issue https://github.com/Tom-Newton/delta-rs/pull/13

Copy link
Collaborator

@roeap roeap Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjones127 @Tom-Newton - is this something blocking us from merging, or would this conflict somehow with #714?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unrelated to #714. But it relates to making map types work correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reported an upstream issue: https://issues.apache.org/jira/browse/ARROW-17349

For now, I guess I'll see if I can align things with using element, since Spark seems to not allow customizing that field name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried two approaches and neither worked very well:

  1. First, I tried to change back to always convert Delta Array types to ListTypes with "element" as field name. With that change, we cannot roundtrip from Arrow -> Delta -> Arrow without having to re-map the field names, which I think will be annoying beyond just our test cases.
  2. Second, I tried changing so that whenever we read and write, we always use the "element" field name. But PyArrow errors when trying to write a table with a schema not matching the output schema, including field names. And if we try to cast we will run into exactly the same error you are getting now when reading.

Since this case isn't something we already support, I'm thinking the best course of action is to fix the issue upstream. Does that seem acceptable @Tom-Newton ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective this PR is a big improvement and does not introduce any regression, so I certainly think this is good to merge. However if this particular change was not needed it would have been even more helpful when it comes to making map types work.

I would like to understand why always using element doesn't work though.

Copy link
Collaborator Author

@wjones127 wjones127 Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to understand why always using element doesn't work though.

First, there is the annoyance that is breaks roundtripping:

pa_type = pa.list_(pa.int32()) # uses `item` as field name
delta_type = deltalake.schema.ArrayType.from_pyarrow(pa_type) # drops the field name
pa_type_back = delta_type.to_pyarrow() # adds `element` as field name
pa_type == pa_type_back # is False

But more importantly, it means that when writing to a Delta table, we need to make sure we use the field name "element" for list types. Otherwise when reading, we'll get the same error you were getting, but now complaining it can't cast from "item" to "element" instead of the other way around. That means when we write data, we have to cast any other field name to "element". But casting the field names is what's broken in the first place! So instead the writer would be broken in this edge case.

I guess ultimately it's a tradeoff: if we use "item", we break the reader in a certain edge case. If we use "element", we break the writer in that edge case and we break pyarrow type roundtripping. The latter outcome seems slightly worse to me (plus it's more code to write and maintain). Either way, we should have this fixed by the PyArrow 10.0.0, which should be released in October.

Copy link
Contributor

@Tom-Newton Tom-Newton Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining 🙂 . So if I understand correctly the issue is because write_deltalake supports arbitrary pyarrow schemas so depending on whether the user's data uses item or element a cast may be needed.

For the round tripping I guess you could make the opposite argument that pa_type = pa.list_(pa.field("element", pa.int32())) cannot be round tripped with the current implementation.

My personal opinion is that I would rather the writer is broken in this edge case but I'm only one user with a vested interest 😄 .

@wjones127 wjones127 marked this pull request as ready for review August 2, 2022 04:19
@wjones127 wjones127 requested review from fvaleye and houqp August 2, 2022 04:19
roeap
roeap previously approved these changes Aug 2, 2022
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great overall! left some minor comments.

python/src/lib.rs Outdated Show resolved Hide resolved
("z", ArrayType(StructType([Field("x", "integer", True)]), True), True, None),
]

# TODO: are there field names we should reject?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only thing i can think of is when we use the field names for partitioning. We have some test for special characters. Not sure though if we could handle (or should be able to) if an "=" were to appear in the field name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I may leave this for later. There may be complications with column mapping.

python/tests/test_schema.py Outdated Show resolved Hide resolved
roeap
roeap previously approved these changes Aug 4, 2022
houqp
houqp previously approved these changes Aug 4, 2022
Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really high quality work, thanks @wjones127 !

@wjones127 wjones127 dismissed stale reviews from houqp and roeap via eef3fb7 August 9, 2022 01:10
@wjones127 wjones127 force-pushed the python-schema branch 2 times, most recently from eef3fb7 to 8cdd31b Compare August 10, 2022 03:21
@wjones127 wjones127 enabled auto-merge (squash) August 10, 2022 03:58
@wjones127 wjones127 requested review from roeap and houqp August 10, 2022 03:58
@wjones127 wjones127 merged commit b885b7d into delta-io:main Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create and bind Delta Schema structures with pyo3 for Python.
4 participants