-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Zero-copy nested types with other GPU libraries (like Awkward array) #14959
Comments
I think #14926 is pretty relevant here. |
I agree that this is kinda the exact use case that #14926 is designed for. Along with something like a PyCapsule based protocol. |
I should add here that, from the Awkward Array side, any format that preserves all of the information is equally good. If given CuPy arrays (option 2), we might internally convert them to a format that follows a pyarrow array's Buffers so that we can reuse code that makes the adjustments between Arrow and Awkward, but that's our business. I suggested option 1, making a pyarrow array that would segfault if you touch it, because this works for us (we'll be careful to not dereference the GPU pointers) and if pyarrow ever does add the infrastructure to interpret it correctly, the same interface on cuDF will work for both Awkward and Arrow. |
Ping on this, @shwina ; I gather work is ongoing in the linked issue, but I would appreciate a brief summary here of status and what we can expect for awkward integration. |
Thanks, @martindurant - I believe we should see a PR up for #14926 soon. At that point, we would be very grateful if you could provide feedback or perhaps do some early testing! |
Certainly, just let us know |
Just a note that #14926 will first yield the C++ level functions and C structs, and there would likely need to be a follow up in implementing the Python protocol around it. The issue tracking that work in Arrow is here: apache/arrow#38325 |
Wouldn't |
Would be very grateful if @paleolimbot could advise here! |
It's been touched on here, but I think the intention is ( apache/arrow#38325 ) to add a protocol When nanoarrow for Python has matured a bit it might be able to help export (and test), but the Cython needed to make the required Capsule is pretty compact and any library doing exporting should probably just copy it (or translate it to pybind11 or nanobind): https://github.com/apache/arrow-nanoarrow/blob/main/python/src/nanoarrow/_lib.pyx#L112-L127 . |
@martindurant just an update here that I'm waiting for #15047 to take some shape before I try and kick the tires with accessing from Python. |
@jpivarski , can you please link the experimental conversions code you wrote in awkward? |
This is the script that I used to test conversion of CuDF's Arrow data into Awkward. (The other direction should be even easier.) https://github.com/scikit-hep/awkward/blob/main/studies/cudf-to-awkward.py |
Just wanted to provide a quick status update here. I've put together a prototype of the device data capsule protocols in #15370. It's not usable yet for a number of reasons, largely boiling down to the need for a D2D copy at the moment (although that may still be enough of an improvement over the current D2H2D that our existing to/from_arrow methods do that you'd still find it useful for testing), but we should be able to make some progress on that soon. I've started the discussion on how best to proceed here. |
In a conversation with @martindurant and @jpivarski, it came up that there's no supported way to exchange data zero copy between cuDF and Awkward Array (which has GPU support).
The standard 0-copy mechanisms like dlpack and
__cuda_array_interface__
don't support nested types like lists or structs. And ourto/from_arrow()
methods convert to and from host data so they're not useful when we want to 0-copy device data.Option 1
We support a
gpu=True
(or similar) keyword argument into_arrow()
which would then return a PyArrow array backed by device data. Now, PyArrow does not seemingly support it, but it's possible to create a PyArrow array backed by device data:The problem (as can be seen above) is that PyArrow thinks this is a CPU-backed buffer. So attempting to do anything with it segfaults:
Option 2
We could expose new
Series.to_buffers()
andSeries.from_buffers()
functions that would produce and consume GPU buffers (along with a schema), presumably in the same order as arrow's from_buffers and buffers methods. We could use CuPy arrays to represent the buffers.Curious what folks think? Interested also in @kkraus14's thoughts here if any.
The text was updated successfully, but these errors were encountered: