-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python/adbc_driver_manager): export handles and ingest data through python Arrow PyCapsule interface #1346
Conversation
memcpy(allocated, &self.schema, sizeof(CArrowSchema)) | ||
self.schema.release = NULL | ||
return capsule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are "moving" the schema here, while in nanoarrow I opted for a hard copy for the schema (using nanoarrow's ArrowSchemaDeepCopy
).
But I think the only advantage of a hard copy is that this means you can consume it multiple times? (or in the case of nanoarrow-python, that the nanoarrow Schema object is still valid and inspectable after it has been converted to eg a pyarrow.Schema)
For ADBC, I think the use case will be much more "receive handle and convert it directly once", given that the Handle object itself isn't useful at all (in contrast to nanoarrow.Schema), so moving here is probably fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving makes sense here.
def __arrow_c_array__(self, requested_schema=None) -> object: | ||
"""Consume this object to get a PyCapsule.""" | ||
if requested_schema is not None: | ||
raise NotImplementedError("requested_schema") | ||
|
||
cdef CArrowArray* allocated = <CArrowArray*> malloc(sizeof(CArrowArray)) | ||
allocated.release = NULL | ||
capsule = PyCapsule_New( | ||
<void*>allocated, "arrow_array", pycapsule_array_deleter, | ||
) | ||
memcpy(allocated, &self.array, sizeof(CArrowArray)) | ||
self.array.release = NULL | ||
return capsule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not being used at the moment, because I think none of the ADBC APIs are returning an ArrowArray (only ArrowSchema or ArrowArrayStream).
This handle is currently only used internally for ingesting data (bind
).
So I could also remove __arrow_c_array__
, given it is unused. If we require pyarrow >= 14, I could probably also remove this class entirely, because then we can use the capsule interface for ingesting data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And realizing now, this implementation is actually also wrong -> it needs to return two capsules, one for the ArrowArray but also one for the ArrowSchema. And this handle only has the array. So I don't think we can add this dunder here.
@@ -25,8 +25,8 @@ requires-python = ">=3.9" | |||
dynamic = ["version"] | |||
|
|||
[project.optional-dependencies] | |||
dbapi = ["pandas", "pyarrow>=8.0.0"] | |||
test = ["duckdb", "pandas", "pyarrow>=8.0.0", "pytest"] | |||
dbapi = ["pandas", "pyarrow>=14.0.1"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we OK with bumping this requirement? (I don't know who are already users of the python adbc packages that might be affected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured we should bump it just because our official guidance is to upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge deal now but if usage grows over time this could be a pain point for pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove it. I'm just not sure if we can express that you need the fix package if you're < 14.0.1 in the requirements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this change to update the minimum requirement. This PR doesn't strictly speaking need it, so let's keep the discussion to bump the minimum requirement separate.
|
||
|
||
@pytest.mark.sqlite | ||
def test_pycapsule(sqlite): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I further expanded the specific test that David started, but in addition (at least if requiring pyarrow>=14 for testing) I could also update the other tests above to do the import/export using capsules instead of the current calls to _import_from_c
), that then will also automatically give some more test coverage.
8d95ec8
to
3856a6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just one small question
@@ -25,8 +25,8 @@ requires-python = ">=3.9" | |||
dynamic = ["version"] | |||
|
|||
[project.optional-dependencies] | |||
dbapi = ["pandas", "pyarrow>=8.0.0"] | |||
test = ["duckdb", "pandas", "pyarrow>=8.0.0", "pytest"] | |||
dbapi = ["pandas", "pyarrow>=14.0.1"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured we should bump it just because our official guidance is to upgrade.
memcpy(allocated, &self.schema, sizeof(CArrowSchema)) | ||
self.schema.release = NULL | ||
return capsule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving makes sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Addresses #70
This PR adds the dunder methods to the Handle classes of the low-level interface (which already enables using the low-level interface without pyarrow and with the capsule protocol).
And secondly, in the places that accept data (eg ingest/bind), it now also accepts objects that implement the dunders in addition to hardcoded support for pyarrow.