-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Initial PyArrow writer #566
Conversation
6396d4e
to
5c57a59
Compare
sweet, @wjones127 this PR is ready for review now? |
@houqp I have a few issues to fix with mypy and some other unit tests to get the CI green, but besides that it is ready for review. I expect to get those green sometime tomorrow, but don't feel like you need to wait for that. |
@houqp Actually looks like there might be a blocking issue in the filesystem implementation. We seem to be hitting this unimplemented here: https://github.com/delta-io/delta-rs/blob/main/rust/src/storage/file/rename.rs#L104 in the python_build on Ubuntu 20.04. Sounds like this was expected, given #148. |
Oh yeah, totally forgot about that part :( If the tests works fine on your local machine, we could skip these write tests in CI for now and tackle it as a follow up issue. The rest of the code change looks good to me 👍 This is a big milestone, I believe the first native delta write from python? |
Yeah it does all pass on Mac OS (intel), which makes sense. I guess I can add skips to the tests that checks for Windows or Linux with old glibc. |
842eb63
to
af8d966
Compare
I have gated those writer tests behind an OS / glibc version check. I'm seeing there's some error in the CI for a warning in Sphinx I don't have in my local. I think it's likely related to the old Python version (the CI is running 3.6), but haven't yet been able to successfully setup that version of Python on my local. As an aside, PyArrow dropped support (or at least stopped releasing wheels on PyPI) for Python 3.6 in version 7.0.0. |
I think it's fair to bump ci python version to 3.6. I took a quick look at the error and the code, it's also not obvious to me why it's throwing that error 🤔 Maybe something to do with type annotation. Let's give the bump version bump (3.7?) a try and see how it goes. |
|
||
|
||
def _is_old_glibc_version(): | ||
if "CS_GNU_LIBC_VERSION" in os.confstr_names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dug that snippet up from this obscure thread: https://bugs.python.org/issue35389
I reproduced the same error as in CI on Python 3.7. I think Sphinx might just not support My inclination is to do the docs in a newer Python. For testing purposes alone, it would be nice to have one build in Python 3.6 with PyArrow 4.0 and one with latest Python and latest PyArrow. I think I'll submit a second PR with just those CI changes, and can rebase this one once that's done. |
ha, good catch on the upstream test! I agree with you it would be nice to test multiple python versions in CI and use >=3.8 for doc build to workaround this issue. |
af8d966
to
a4ebf7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Epic work @wjones127 :D |
+1 |
@dblairski I haven't tested on Windows yet, but in the platform_specific_rename function we use for writing commits doesn't seem implemented for Windows: delta-rs/rust/src/storage/file/rename.rs Line 96 in 346f51a
I've just started on #570, in which I think I should address this. |
Description
Implements a
write_to_deltalake
function, for creating, appending to, and overwriting delta tables. All other operations will be done in future PRs. This will only support writer protocol version 1; my impression is most existing tables are at least version 2, so upgrading protocol support will be a major priority in follow-up work.In order to create this function, I had to make a handful of related changes:
python
as an optional feature to pass through thepyarrow
feature of Arrow. This gives us access to the conversions between arrow-rs schema and pyarrow schema.add_actions
parameter toDeltaTable.create()
."element"
to"item"
and the map main field name from"element"
to"entries"
. This aligns with how the Rust and C++ implementations default to, and it's important for lists right now because equality checks on arrays care about the field names.python/tests/__init__.py
. I found I wasn't seeing any coverage data collected without that file present.TODO:
Some follow-up work for future PRs below, in order of importance. I'll create separate issues for these soon.
delta.appendOnly
Related Issue(s)
Documentation