-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-42222: [Python] Add bindings for CopyTo on RecordBatch and Array classes #42223
GH-42222: [Python] Add bindings for CopyTo on RecordBatch and Array classes #42223
Conversation
|
I am not entirely sold on the API on the python side. While this currently 1:1 exposes the C++ That requested functionality does not yet exist, though. But assuming we add this in the near future, I am not sure it would be great to have both a |
Overall, the current changes LGTM! I do like your proposal of having one API. Would your alternative option be an API in the MemoryManager interface e.g. |
python/pyarrow/tests/test_device.py
Outdated
arr = pa.array([0, 1, 2]) | ||
arr_copied = arr.copy_to(mm) | ||
assert arr.equals(arr_copied) | ||
assert arr.buffers()[1].address != arr_copied.buffers()[1].address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also check that the device and memory manager are as expected?
python/pyarrow/table.pxi
Outdated
@@ -3549,6 +3549,29 @@ cdef class RecordBatch(_Tabular): | |||
row_major, pool)) | |||
return pyarrow_wrap_tensor(c_tensor) | |||
|
|||
def copy_to(self, MemoryManager memory_manager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could perhaps also accept a Device and copy to the Device's default memory manager, for usability.
python/pyarrow/table.pxi
Outdated
shared_ptr[CRecordBatch] c_batch | ||
|
||
with nogil: | ||
c_batch = GetResultValue(self.batch.CopyTo(memory_manager.unwrap())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can memory_manager
be None here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, with the current design it is a required keyword and so I should protect it from not being None (add a not None
in the signature).
But we might want to allow it to be None, and in that case use the source's memory manager as destination device (i.e. copy to the same device)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think copying to the same memory manager is really useful, so we needn't allow None IMHO.
In addition, on the C++ side there is also |
@pitrou do you have any thoughts on the API design here?
|
As another idea based on the Array API (although then deviating much more from the C++ API) could be a |
My preference would be on adding |
@github-actions crossbow submit test-cuda-python |
Revision: 576d20d Submitted crossbow builds: ursacomputing/crossbow @ actions-221d0f9993
|
OK, updated the PR and keeping the I don't have a strong opinion about the names so happy to go with those. I think my main concern is that IMO it is likely we are going to want to add a |
LGTM! There are just some numpydoc errors left |
@@ -64,6 +64,9 @@ cdef class Device(_Weakrefable): | |||
self.init(device) | |||
return self | |||
|
|||
cdef inline shared_ptr[CDevice] unwrap(self) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my curiousity, why did you choose to set the unwrap
with nogil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice it's what we do for almost all our unwrap
functions (and the few that don't use it probably should for consistency)
But the reason is 1) that this function does not need to GIL (there is no Python interaction), so it can be marked with nogil
, and 2) actually marking it as such ensures that we can call this method from within a with nogil: ...
block (and given that calling unwrap()
is typically needed when calling some C++ function, and we typically use with nogil: ...
blocks when we are calling a C++ function, in practice there are quite some cases where we indeed use unwrap()
inside such a block)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first section in this doc really helped me understand when/why to release the gil: https://cython.readthedocs.io/en/latest/src/userguide/nogil.html
I will leave |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit ffee537. There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
We have added bindings for the Device and MemoryManager classes (#41126), and as a next step we can expose the functionality to copy a full Array or RecordBatch to a specific memory manager.
What changes are included in this PR?
This adds a
copy_to
method on pyarrow Array and RecordBatch.Are these changes tested?
Yes
CopyTo
on RecordBatch and Array classes #42222