Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-42222: [Python] Add bindings for CopyTo on RecordBatch and Array classes #42223

Merged
merged 7 commits into from
Aug 21, 2024

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jun 20, 2024

Rationale for this change

We have added bindings for the Device and MemoryManager classes (#41126), and as a next step we can expose the functionality to copy a full Array or RecordBatch to a specific memory manager.

What changes are included in this PR?

This adds a copy_to method on pyarrow Array and RecordBatch.

Are these changes tested?

Yes

Copy link

⚠️ GitHub issue #42222 has been automatically assigned in GitHub to PR creator.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 20, 2024

I am not entirely sold on the API on the python side. While this currently 1:1 exposes the C++ CopyTo method as a copy_to() method, on the Python side we could also have just a copy() method (and for example the default memory manager could be the one of the calling object). However, ideally such a method could (optionally?) truncate sliced buffers and child arrays (xref #37878, #38806).

That requested functionality does not yet exist, though. But assuming we add this in the near future, I am not sure it would be great to have both a copy and copy_to method, as it feels those could be integrated into a single method on the python side.

@danepitkin
Copy link
Member

Overall, the current changes LGTM!

I do like your proposal of having one API. Would your alternative option be an API in the MemoryManager interface e.g. new_arr = MemoryManager.copy(array)?

arr = pa.array([0, 1, 2])
arr_copied = arr.copy_to(mm)
assert arr.equals(arr_copied)
assert arr.buffers()[1].address != arr_copied.buffers()[1].address
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also check that the device and memory manager are as expected?

@@ -3549,6 +3549,29 @@ cdef class RecordBatch(_Tabular):
row_major, pool))
return pyarrow_wrap_tensor(c_tensor)

def copy_to(self, MemoryManager memory_manager):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could perhaps also accept a Device and copy to the Device's default memory manager, for usability.

shared_ptr[CRecordBatch] c_batch

with nogil:
c_batch = GetResultValue(self.batch.CopyTo(memory_manager.unwrap()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can memory_manager be None here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, with the current design it is a required keyword and so I should protect it from not being None (add a not None in the signature).

But we might want to allow it to be None, and in that case use the source's memory manager as destination device (i.e. copy to the same device)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think copying to the same memory manager is really useful, so we needn't allow None IMHO.

@jorisvandenbossche
Copy link
Member Author

I am not entirely sold on the API on the python side. While this currently 1:1 exposes the C++ CopyTo method as a copy_to() method, on the Python side we could also have just a copy() method

In addition, on the C++ side there is also ViewOrCopyTo that we should expose in Python, and so the question becomes if that should be a separate method or if that could also be combined in a single API.

@jorisvandenbossche
Copy link
Member Author

@pitrou do you have any thoughts on the API design here?

  • Either we simply mimic the C++ API exactly, i.e. add a copy_to() and view_or_copy_to() on Array/RecordBatch (although then on the Buffer class it is copy() and not copy_to() if we follow C++ .., which is a bit inconsistent)
  • or add a bit more typical python copy() method with an optional device argument (and default would then be to copy to the same device). But question is then if we would include the functionality of ViewOrCopyTo in that same method? (at which point the result could be only a "shallow" copy, i.e. same buffers in new Array/RecordBatch object) In any case we can't add a view() method as that is already taken on Array (for viewing as another data type)

@jorisvandenbossche
Copy link
Member Author

As another idea based on the Array API (although then deviating much more from the C++ API) could be a to_device() method (https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.to_device.html), which could quite naturally combine the functionality of CopyTo and ViewOrCopyTo, I think.

@pitrou
Copy link
Member

pitrou commented Aug 7, 2024

My preference would be on adding copy_to() and view_or_copy_to() methods.

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review August 8, 2024 16:01
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-cuda-python

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 8, 2024
Copy link

github-actions bot commented Aug 8, 2024

Revision: 576d20d

Submitted crossbow builds: ursacomputing/crossbow @ actions-221d0f9993

Task Status
test-cuda-python GitHub Actions

@jorisvandenbossche
Copy link
Member Author

My preference would be on adding copy_to() and view_or_copy_to() methods.

OK, updated the PR and keeping the copy_to name then (and will add view_or_copy_to()) as well).

I don't have a strong opinion about the names so happy to go with those. I think my main concern is that IMO it is likely we are going to want to add a Array.copy() method at some point as well (that copies with truncation of sliced buffers), and then we have three similar method names.

@danepitkin
Copy link
Member

LGTM! There are just some numpydoc errors left

@apache apache deleted a comment from github-actions bot Aug 14, 2024
@apache apache deleted a comment from github-actions bot Aug 14, 2024
@@ -64,6 +64,9 @@ cdef class Device(_Weakrefable):
self.init(device)
return self

cdef inline shared_ptr[CDevice] unwrap(self) nogil:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my curiousity, why did you choose to set the unwrap with nogil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice it's what we do for almost all our unwrap functions (and the few that don't use it probably should for consistency)

But the reason is 1) that this function does not need to GIL (there is no Python interaction), so it can be marked with nogil, and 2) actually marking it as such ensures that we can call this method from within a with nogil: ... block (and given that calling unwrap() is typically needed when calling some C++ function, and we typically use with nogil: ... blocks when we are calling a C++ function, in practice there are quite some cases where we indeed use unwrap() inside such a block)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first section in this doc really helped me understand when/why to release the gil: https://cython.readthedocs.io/en/latest/src/userguide/nogil.html

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 20, 2024
@jorisvandenbossche
Copy link
Member Author

I will leave copy_or_view_to for a follow-up issue/PR, so I think this one is ready now

@jorisvandenbossche jorisvandenbossche merged commit ffee537 into apache:main Aug 21, 2024
16 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting changes Awaiting changes label Aug 21, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit ffee537.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants