Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Initial PyArrow writer #566

Merged
merged 28 commits into from
Mar 20, 2022
Merged

Conversation

wjones127
Copy link
Collaborator

@wjones127 wjones127 commented Feb 27, 2022

Description

Implements a write_to_deltalake function, for creating, appending to, and overwriting delta tables. All other operations will be done in future PRs. This will only support writer protocol version 1; my impression is most existing tables are at least version 2, so upgrading protocol support will be a major priority in follow-up work.

⚠️ I tried to make this a pretty minimal writer, so the PR wouldn't be so huge. It's marked in the docs as "experimental" for now, so people shouldn't use this for any production use yet.

In order to create this function, I had to make a handful of related changes:

  • Added python as an optional feature to pass through the pyarrow feature of Arrow. This gives us access to the conversions between arrow-rs schema and pyarrow schema.
  • Added add_actions parameter to DeltaTable.create().
  • Changed the list field name from "element" to "item" and the map main field name from "element" to "entries". This aligns with how the Rust and C++ implementations default to, and it's important for lists right now because equality checks on arrays care about the field names.
  • Added intersphinx extension so that type signatures referencing other modules (e.g. Pandas, PySpark, PyArrow) will link to respective API docs.
  • Added file python/tests/__init__.py. I found I wasn't seeing any coverage data collected without that file present.

TODO:

  • Remove extra sketch work
  • Add round trip and protocol tests
  • Write file statistics
  • Update user guide
  • Update feature support on readme
  • Test robustness of partition column parsing
  • Test accuracy of file stats
  • Check min writer version

Some follow-up work for future PRs below, in order of importance. I'll create separate issues for these soon.

  • Enable passing in filesystems
  • Test writer with all filesystems
  • Writer protocol 2:
    • Add support for enforcing delta.appendOnly
    • Add support for enforcing column invariants
  • Enable passing in metadata (like table name and description)
  • Enable writing statistics for date32 and decimal columns

Related Issue(s)

Documentation

@wjones127 wjones127 marked this pull request as ready for review March 12, 2022 05:16
@wjones127 wjones127 changed the title [Python] Initial PyArrow writer [WIP] [Python] Initial PyArrow writer Mar 12, 2022
@houqp
Copy link
Member

houqp commented Mar 12, 2022

sweet, @wjones127 this PR is ready for review now?

@wjones127
Copy link
Collaborator Author

@houqp I have a few issues to fix with mypy and some other unit tests to get the CI green, but besides that it is ready for review. I expect to get those green sometime tomorrow, but don't feel like you need to wait for that.

@wjones127
Copy link
Collaborator Author

@houqp Actually looks like there might be a blocking issue in the filesystem implementation. We seem to be hitting this unimplemented here: https://github.com/delta-io/delta-rs/blob/main/rust/src/storage/file/rename.rs#L104 in the python_build on Ubuntu 20.04. Sounds like this was expected, given #148.

@houqp houqp requested a review from fvaleye March 15, 2022 06:32
@houqp
Copy link
Member

houqp commented Mar 15, 2022

Oh yeah, totally forgot about that part :( If the tests works fine on your local machine, we could skip these write tests in CI for now and tackle it as a follow up issue.

The rest of the code change looks good to me 👍 This is a big milestone, I believe the first native delta write from python?

@wjones127
Copy link
Collaborator Author

If the tests works fine on your local machine, we could skip these write tests in CI for now and tackle it as a follow up issue.

Yeah it does all pass on Mac OS (intel), which makes sense. I guess I can add skips to the tests that checks for Windows or Linux with old glibc.

@wjones127
Copy link
Collaborator Author

I have gated those writer tests behind an OS / glibc version check.

I'm seeing there's some error in the CI for a warning in Sphinx I don't have in my local. I think it's likely related to the old Python version (the CI is running 3.6), but haven't yet been able to successfully setup that version of Python on my local. As an aside, PyArrow dropped support (or at least stopped releasing wheels on PyPI) for Python 3.6 in version 7.0.0.

@houqp
Copy link
Member

houqp commented Mar 16, 2022

I think it's fair to bump ci python version to 3.6. I took a quick look at the error and the code, it's also not obvious to me why it's throwing that error 🤔 Maybe something to do with type annotation. Let's give the bump version bump (3.7?) a try and see how it goes.



def _is_old_glibc_version():
if "CS_GNU_LIBC_VERSION" in os.confstr_names:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL :D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dug that snippet up from this obscure thread: https://bugs.python.org/issue35389

@wjones127
Copy link
Collaborator Author

I reproduced the same error as in CI on Python 3.7. I think Sphinx might just not support typing_extensions.Literal, given their tests skip on versions before 3.8: https://github.com/tk0miya/sphinx/blob/52787deb32305fad40569758e5d17180d729862a/tests/test_domain_py.py#L353

My inclination is to do the docs in a newer Python. For testing purposes alone, it would be nice to have one build in Python 3.6 with PyArrow 4.0 and one with latest Python and latest PyArrow. I think I'll submit a second PR with just those CI changes, and can rebase this one once that's done.

@houqp
Copy link
Member

houqp commented Mar 17, 2022

ha, good catch on the upstream test! I agree with you it would be nice to test multiple python versions in CI and use >=3.8 for doc build to workaround this issue.

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@houqp houqp merged commit 346f51a into delta-io:main Mar 20, 2022
@houqp
Copy link
Member

houqp commented Mar 20, 2022

Epic work @wjones127 :D

@matthewmturner
Copy link
Contributor

Epic work @wjones127 :D

+1

@dblairski
Copy link

dblairski commented Mar 22, 2022

Does this release not work on Windows OS environments yet?
noWindowsSupportYet

@wjones127
Copy link
Collaborator Author

@dblairski I haven't tested on Windows yet, but in the platform_specific_rename function we use for writing commits doesn't seem implemented for Windows:

unsafe fn platform_specific_rename(from: *const libc::c_char, to: *const libc::c_char) -> i32 {

I've just started on #570, in which I think I should address this.

@wjones127 wjones127 deleted the pyarrow-writer branch March 22, 2022 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants