Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Zero-copy nested types with other GPU libraries (like Awkward array) #14959

Open
shwina opened this issue Feb 2, 2024 · 14 comments
Open
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@shwina
Copy link
Contributor

shwina commented Feb 2, 2024

In a conversation with @martindurant and @jpivarski, it came up that there's no supported way to exchange data zero copy between cuDF and Awkward Array (which has GPU support).

The standard 0-copy mechanisms like dlpack and __cuda_array_interface__ don't support nested types like lists or structs. And our to/from_arrow() methods convert to and from host data so they're not useful when we want to 0-copy device data.

Option 1

We support a gpu=True (or similar) keyword argument in to_arrow() which would then return a PyArrow array backed by device data. Now, PyArrow does not seemingly support it, but it's possible to create a PyArrow array backed by device data:

In [5]: a = cp.asarray([1, 2, 3])

In [6]: buf = pa.foreign_buffer(a.data.ptr, a.nbytes, a)

In [7]: type(buf)
Out[7]: pyarrow.lib.Buffer

In [8]: print(buf)
<pyarrow.Buffer address=0x7f2f6fa00200 size=24 is_cpu=True is_mutable=False>

The problem (as can be seen above) is that PyArrow thinks this is a CPU-backed buffer. So attempting to do anything with it segfaults:

In [9]: arr = pa.Array.from_buffers(pa.int64(), len(a), buffers=[None, buf])

In [10]: print(arr)  # segfault

Option 2

We could expose new Series.to_buffers() and Series.from_buffers() functions that would produce and consume GPU buffers (along with a schema), presumably in the same order as arrow's from_buffers and buffers methods. We could use CuPy arrays to represent the buffers.


Curious what folks think? Interested also in @kkraus14's thoughts here if any.

@shwina shwina added feature request New feature or request Python Affects Python cuDF API. labels Feb 2, 2024
@shwina shwina self-assigned this Feb 2, 2024
@vyasr
Copy link
Contributor

vyasr commented Feb 2, 2024

I think #14926 is pretty relevant here.

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 2, 2024

I agree that this is kinda the exact use case that #14926 is designed for. Along with something like a PyCapsule based protocol.

@jpivarski
Copy link

I should add here that, from the Awkward Array side, any format that preserves all of the information is equally good. If given CuPy arrays (option 2), we might internally convert them to a format that follows a pyarrow array's Buffers so that we can reuse code that makes the adjustments between Arrow and Awkward, but that's our business.

I suggested option 1, making a pyarrow array that would segfault if you touch it, because this works for us (we'll be careful to not dereference the GPU pointers) and if pyarrow ever does add the infrastructure to interpret it correctly, the same interface on cuDF will work for both Awkward and Arrow.

@martindurant
Copy link

Ping on this, @shwina ; I gather work is ongoing in the linked issue, but I would appreciate a brief summary here of status and what we can expect for awkward integration.

@shwina
Copy link
Contributor Author

shwina commented Feb 9, 2024

Thanks, @martindurant - I believe we should see a PR up for #14926 soon. At that point, we would be very grateful if you could provide feedback or perhaps do some early testing!

@martindurant
Copy link

Certainly, just let us know

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 9, 2024

Just a note that #14926 will first yield the C++ level functions and C structs, and there would likely need to be a follow up in implementing the Python protocol around it. The issue tracking that work in Arrow is here: apache/arrow#38325

@shwina
Copy link
Contributor Author

shwina commented Feb 10, 2024

Wouldn't nanoarrow provide a way to access the DeviceArray from Cython?

@shwina
Copy link
Contributor Author

shwina commented Feb 16, 2024

Wouldn't nanoarrow provide a way to access the DeviceArray from Cython?

Would be very grateful if @paleolimbot could advise here!

@paleolimbot
Copy link

It's been touched on here, but I think the intention is ( apache/arrow#38325 ) to add a protocol __arrow_c_device_array__() to mirror how __arrow_c_array__() works but with explicit non-CPU support ( https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html ).

When nanoarrow for Python has matured a bit it might be able to help export (and test), but the Cython needed to make the required Capsule is pretty compact and any library doing exporting should probably just copy it (or translate it to pybind11 or nanobind): https://github.com/apache/arrow-nanoarrow/blob/main/python/src/nanoarrow/_lib.pyx#L112-L127 .

@shwina
Copy link
Contributor Author

shwina commented Feb 26, 2024

@martindurant just an update here that I'm waiting for #15047 to take some shape before I try and kick the tires with accessing from Python.

@martindurant
Copy link

@jpivarski , can you please link the experimental conversions code you wrote in awkward?

@jpivarski
Copy link

This is the script that I used to test conversion of CuDF's Arrow data into Awkward. (The other direction should be even easier.)

https://github.com/scikit-hep/awkward/blob/main/studies/cudf-to-awkward.py

@vyasr
Copy link
Contributor

vyasr commented Mar 21, 2024

Just wanted to provide a quick status update here. I've put together a prototype of the device data capsule protocols in #15370. It's not usable yet for a number of reasons, largely boiling down to the need for a D2D copy at the moment (although that may still be enough of an improvement over the current D2H2D that our existing to/from_arrow methods do that you'd still find it useful for testing), but we should be able to make some progress on that soon. I've started the discussion on how best to proceed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

6 participants