Implement Python Schema in Rust #684

wjones127 · 2022-07-11T01:01:37Z

Description

These changes move the schema code down into Rust, removing (most of) the JSON serialization and making the Python bindings closer to the Rust implementation. This will make it easier to add methods onto the Rust DeltaSchema and expose them in Python.

It also adds to_pyarrow() and from_pyarrow() methods to each of the classes, since this was easy to implement.

It does mean some API changes, such as removing the DataType base class, because we can't yet properly do inheritance in PyO3. I had to rewrite the unit tests for schemas, but I left the unit tests for all other modules unchanged (minus moving away from a deprecated function).

Also, I moved the main typestubs into the module, which I think means we will ship them now with the code. Haven't confirmed that yet though.

Related Issue(s)

Closes Create and bind Delta Schema structures with pyo3 for Python. #95

Will also help with #592

Documentation

houqp · 2022-07-11T05:48:07Z

This is really nice improvement, thanks @wjones127 for picking it up!

wjones127 · 2022-07-21T04:18:47Z

We will not be able to roundtrip field metadata until apache/arrow-rs#478 is addressed.

wjones127 · 2022-08-02T04:19:11Z

rust/src/delta_arrow.rs

@@ -42,7 +57,7 @@ impl TryFrom<&schema::SchemaTypeArray> for ArrowField {

    fn try_from(a: &schema::SchemaTypeArray) -> Result<Self, ArrowError> {
        Ok(ArrowField::new(
-            "element",
+            "item",


This and the below changes are necessary to make them inline with the default field names in Rust and C++ Arrow, allowing easy round tripping between Arrow types and Delta Lake types.

I think this change might cause issues when reading complex nested types. For example a map of arrays gives

pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map<string, list<item: string>> from map<string, list<element: string> ('map_arrays')>

It seems PySpark always uses element and PyArrow cannot cast element -> item for some complex types. I have not seen any example where entries vs key_value causes problems.

I created a test to reproduce this issue https://github.com/Tom-Newton/delta-rs/pull/13

@wjones127 @Tom-Newton - is this something blocking us from merging, or would this conflict somehow with #714?

It's unrelated to #714. But it relates to making map types work correctly.

I have reported an upstream issue: https://issues.apache.org/jira/browse/ARROW-17349

For now, I guess I'll see if I can align things with using element, since Spark seems to not allow customizing that field name.

Tried two approaches and neither worked very well:

First, I tried to change back to always convert Delta Array types to ListTypes with "element" as field name. With that change, we cannot roundtrip from Arrow -> Delta -> Arrow without having to re-map the field names, which I think will be annoying beyond just our test cases.

Second, I tried changing so that whenever we read and write, we always use the "element" field name. But PyArrow errors when trying to write a table with a schema not matching the output schema, including field names. And if we try to cast we will run into exactly the same error you are getting now when reading.

Since this case isn't something we already support, I'm thinking the best course of action is to fix the issue upstream. Does that seem acceptable @Tom-Newton ?

From my perspective this PR is a big improvement and does not introduce any regression, so I certainly think this is good to merge. However if this particular change was not needed it would have been even more helpful when it comes to making map types work.

I would like to understand why always using element doesn't work though.

I would like to understand why always using element doesn't work though.

First, there is the annoyance that is breaks roundtripping:

pa_type = pa.list_(pa.int32()) # uses `item` as field name delta_type = deltalake.schema.ArrayType.from_pyarrow(pa_type) # drops the field name pa_type_back = delta_type.to_pyarrow() # adds `element` as field name pa_type == pa_type_back # is False

But more importantly, it means that when writing to a Delta table, we need to make sure we use the field name "element" for list types. Otherwise when reading, we'll get the same error you were getting, but now complaining it can't cast from "item" to "element" instead of the other way around. That means when we write data, we have to cast any other field name to "element". But casting the field names is what's broken in the first place! So instead the writer would be broken in this edge case.

I guess ultimately it's a tradeoff: if we use "item", we break the reader in a certain edge case. If we use "element", we break the writer in that edge case and we break pyarrow type roundtripping. The latter outcome seems slightly worse to me (plus it's more code to write and maintain). Either way, we should have this fixed by the PyArrow 10.0.0, which should be released in October.

Thanks for explaining 🙂 . So if I understand correctly the issue is because write_deltalake supports arbitrary pyarrow schemas so depending on whether the user's data uses item or element a cast may be needed.

For the round tripping I guess you could make the opposite argument that pa_type = pa.list_(pa.field("element", pa.int32())) cannot be round tripped with the current implementation.

My personal opinion is that I would rather the writer is broken in this edge case but I'm only one user with a vested interest 😄 .

roeap

looks great overall! left some minor comments.

python/src/lib.rs

roeap · 2022-08-02T16:37:54Z

python/tests/test_schema.py

+        ("z", ArrayType(StructType([Field("x", "integer", True)]), True), True, None),
+    ]
+
+    # TODO: are there field names we should reject?


the only thing i can think of is when we use the field names for partitioning. We have some test for special characters. Not sure though if we could handle (or should be able to) if an "=" were to appear in the field name.

I think I may leave this for later. There may be complications with column mapping.

python/tests/test_schema.py

houqp

This is really high quality work, thanks @wjones127 !

wjones127 mentioned this pull request Jul 26, 2022

Fix map type support #712

Closed

Tom-Newton mentioned this pull request Jul 26, 2022

DeltaTable.to_pyarrow_dataset() fails for tables containing map types #713

Closed

wjones127 force-pushed the python-schema branch from 0aba557 to 197386c Compare August 2, 2022 04:11

wjones127 commented Aug 2, 2022

View reviewed changes

wjones127 marked this pull request as ready for review August 2, 2022 04:19

wjones127 requested review from fvaleye and houqp August 2, 2022 04:19

roeap previously approved these changes Aug 2, 2022

View reviewed changes

wjones127 dismissed roeap’s stale review via 99632b0 August 3, 2022 00:16

roeap previously approved these changes Aug 4, 2022

View reviewed changes

houqp previously approved these changes Aug 4, 2022

View reviewed changes

wjones127 added 9 commits August 8, 2022 18:09

feat: start moving toward Rust implementation of Python Schema

d40f06d

feat: implemention schema

0aa56c0

feat: Move implementation into schema.rs

8151ec6

fix: get tests passing

528d835

fix: cargo fmt

6538be4

fix: use default names so we can roundtrip maptype and listtype

a313ee3

feat: complete docs for Schema

d4e5bfa

fix: import literal from typing_extensions

9921302

fix: add back original warnings

1187608

wjones127 dismissed stale reviews from houqp and roeap via eef3fb7 August 9, 2022 01:10

wjones127 force-pushed the python-schema branch 2 times, most recently from eef3fb7 to 8cdd31b Compare August 10, 2022 03:21

fix: Add pyarrow type stubs

8cdd31b

wjones127 enabled auto-merge (squash) August 10, 2022 03:58

wjones127 requested review from roeap and houqp August 10, 2022 03:58

houqp approved these changes Aug 10, 2022

View reviewed changes

wjones127 merged commit b885b7d into delta-io:main Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Python Schema in Rust #684

Implement Python Schema in Rust #684

wjones127 commented Jul 11, 2022 •

edited

Loading

houqp commented Jul 11, 2022

wjones127 commented Jul 21, 2022

wjones127 Aug 2, 2022

Tom-Newton Aug 6, 2022 •

edited

Loading

Tom-Newton Aug 6, 2022

roeap Aug 8, 2022 •

edited

Loading

Tom-Newton Aug 8, 2022

wjones127 Aug 9, 2022

wjones127 Aug 9, 2022

Tom-Newton Aug 9, 2022

wjones127 Aug 10, 2022 •

edited

Loading

Tom-Newton Aug 10, 2022 •

edited

Loading

roeap left a comment

roeap Aug 2, 2022

wjones127 Aug 3, 2022

houqp left a comment

Implement Python Schema in Rust #684

Implement Python Schema in Rust #684

Conversation

wjones127 commented Jul 11, 2022 • edited Loading

Description

Related Issue(s)

Documentation

houqp commented Jul 11, 2022

wjones127 commented Jul 21, 2022

Choose a reason for hiding this comment

Tom-Newton Aug 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap Aug 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 Aug 10, 2022 • edited Loading

Choose a reason for hiding this comment

Tom-Newton Aug 10, 2022 • edited Loading

Choose a reason for hiding this comment

roeap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

houqp left a comment

Choose a reason for hiding this comment

wjones127 commented Jul 11, 2022 •

edited

Loading

Tom-Newton Aug 6, 2022 •

edited

Loading

roeap Aug 8, 2022 •

edited

Loading

wjones127 Aug 10, 2022 •

edited

Loading

Tom-Newton Aug 10, 2022 •

edited

Loading