[FEA] Allow libraries that implement `__array_ufunc__` to override CUDAUFunc #37

ianna · 2024-08-01T09:43:41Z

Is your feature request related to a problem? Please describe.
Array-like objects that define an __array_ufunc__ method (NEP-13) can be used with ufuncs created by np.vectorize as follows:

>>> import numba as nb
>>> @nb.vectorize()
... def _square(x):
...     return x * x
... 
>>> _square        
<numba._DUFunc '_square'>

we would like to have similar functionality on CUDA to allow the following:

>>> @nb.vectorize(
...     target="cuda",
... )
... def _square_cuda(x):
...     return x * x
... 
>>> _square_cuda
<numba.cuda.vectorizers.CUDAUFuncDispatcher object at 0x7259f82fa790>

Describe the solution you'd like
Maybe this function is missing __array_ufunc__ handling?
https://github.com/NVIDIA/numba-cuda/blob/main/numba_cuda/numba/cuda/deviceufunc.py#L241-L329

Describe alternatives you've considered
If we wrap in a flatten/unflatten we are able to get this to work, which is a bit clunky.

import awkward as ak
import cupy as cp
import numba as nb

ak.numba.register_and_check()

@nb.vectorize()
def _square(x):
    return x * x

@nb.vectorize(
    target="cuda",
)
def _square_cuda(x):
    return x * x

def square_cuda_wrapped(x):
    counts = x.layout.offsets.data[1:] - x.layout.offsets.data[:-1]
    return ak.unflatten(cp.array(_square_cuda(ak.flatten(x))), counts)

counts = cp.random.poisson(lam=3, size=5000000)
flat_values = cp.random.normal(size=int(counts.sum()))

values = ak.unflatten(flat_values, counts)

values2_cpu = _square(ak.to_backend(values, "cpu"))

print(values2_cpu)

values2 = square_cuda_wrapped(values)

print(values2)

Additional info
Version of Awkward Array is 2.6.6
Code to reproduce:

import awkward as ak
import cupy as cp
import numba as nb

ak.numba.register_and_check()

@nb.vectorize()
def _square(x):
    return x * x

@nb.vectorize(
    target="cuda",
)
def _square_cuda(x):
    return x * x

counts = cp.random.poisson(lam=3, size=50)
flat_values = cp.random.normal(size=int(counts.sum()))

values = ak.unflatten(flat_values, counts)

values2_cpu = _square(ak.to_backend(values, "cpu"))

print(values2_cpu)

values2 = _square_cuda(values)

print(values2)

resulting in the output:

[[0.045, 1.83], [1.55, 0.224, 0.621, ..., 1.24, 3.87], ..., [0.153, 0.0017]]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 33
     29 values2_cpu = _square(ak.to_backend(values, "cpu"))
     31 print(values2_cpu)
---> 33 values2 = _square_cuda(values)
     35 print(values2)

File ~/.conda/envs/coffea-gpu/lib/python3.11/site-packages/numba/cuda/vectorizers.py:28, in CUDAUFuncDispatcher.__call__(self, *args, **kws)
     17 def __call__(self, *args, **kws):
     18     """
     19     *args: numpy arrays or DeviceArrayBase (created by cuda.to_device).
     20            Cannot mix the two types in one call.
   (...)
     26                   the input arguments.
     27     """
---> 28     return CUDAUFuncMechanism.call(self.functions, args, kws)

File ~/.conda/envs/coffea-gpu/lib/python3.11/site-packages/numba/np/ufunc/deviceufunc.py:254, in UFuncMechanism.call(cls, typemap, args, kws)
    252 # Begin call resolution
    253 cr = cls(typemap, args)
--> 254 args = cr.get_arguments()
    255 resty, func = cr.get_function()
    257 outshape = args[0].shape

File ~/.conda/envs/coffea-gpu/lib/python3.11/site-packages/numba/np/ufunc/deviceufunc.py:202, in UFuncMechanism.get_arguments(self)
    198 def get_arguments(self):
    199     """Prepare and return the arguments for the ufunc.
    200     Does not call to_device().
    201     """
--> 202     self._fill_arrays()
    203     self._fill_argtypes()
    204     self._resolve_signature()

File ~/.conda/envs/coffea-gpu/lib/python3.11/site-packages/numba/np/ufunc/deviceufunc.py:100, in UFuncMechanism._fill_arrays(self)
     98     self.scalarpos.append(i)
     99 else:
--> 100     self.arrays[i] = np.asarray(arg)

File [~/coffea-gpu/awkward/src/awkward/highlevel.py:1434](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/highlevel.py#line=1433), in Array.__array__(self, dtype)
   1429 with ak._errors.OperationErrorContext(
   1430     "numpy.asarray", (self,), {"dtype": dtype}
   1431 ):
   1432     from awkward._connect.numpy import convert_to_array
-> 1434     return convert_to_array(self._layout, dtype=dtype)

File [~/coffea-gpu/awkward/src/awkward/_connect/numpy.py:481](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_connect/numpy.py#line=480), in convert_to_array(layout, dtype)
    480 def convert_to_array(layout, dtype=None):
--> 481     out = ak.operations.to_numpy(layout, allow_missing=False)
    482     if dtype is None:
    483         return out

File [~/coffea-gpu/awkward/src/awkward/_dispatch.py:64](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_dispatch.py#line=63), in named_high_level_function.<locals>.dispatch(*args, **kwargs)
     62 # Failed to find a custom overload, so resume the original function
     63 try:
---> 64     next(gen_or_result)
     65 except StopIteration as err:
     66     return err.value

File [~/coffea-gpu/awkward/src/awkward/operations/ak_to_numpy.py:48](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_to_numpy.py#line=47), in to_numpy(array, allow_missing)
     45 yield (array,)
     47 # Implementation
---> 48 return _impl(array, allow_missing)

File [~/coffea-gpu/awkward/src/awkward/operations/ak_to_numpy.py:60](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_to_numpy.py#line=59), in _impl(array, allow_missing)
     57 backend = NumpyBackend.instance()
     58 numpy_layout = layout.to_backend(backend)
---> 60 return numpy_layout.to_backend_array(allow_missing=allow_missing)

File [~/coffea-gpu/awkward/src/awkward/contents/content.py:1020](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/content.py#line=1019), in Content.to_backend_array(self, allow_missing, backend)
   1018 else:
   1019     backend = regularize_backend(backend)
-> 1020 return self._to_backend_array(allow_missing, backend)

File [~/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py:2072](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py#line=2071), in ListOffsetArray._to_backend_array(self, allow_missing, backend)
   2070     return buffer.view(np.dtype(("S", max_count)))
   2071 else:
-> 2072     return self.to_RegularArray()._to_backend_array(allow_missing, backend)

File [~/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py:283](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py#line=282), in ListOffsetArray.to_RegularArray(self)
    278 _size = Index64.empty(1, self._backend.index_nplike)
    279 assert (
    280     _size.nplike is self._backend.index_nplike
    281     and self._offsets.nplike is self._backend.index_nplike
    282 )
--> 283 self._backend.maybe_kernel_error(
    284     self._backend[
    285         "awkward_ListOffsetArray_toRegularArray",
    286         _size.dtype.type,
    287         self._offsets.dtype.type,
    288     ](
    289         _size.data,
    290         self._offsets.data,
    291         self._offsets.length,
    292     )
    293 )
    294 size = self._backend.index_nplike.index_as_shape_item(_size[0])
    295 length = self._offsets.length - 1

File [awkward/src/awkward/_backends/backend.py:67](awkward/src/awkward/_backends/backend.py#line=66), in Backend.maybe_kernel_error(self, error)
     65     return
     66 else:
---> 67     raise ValueError(self.format_kernel_error(error))

ValueError: cannot convert to RegularArray because subarray lengths are not regular (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-35/awkward-cpp/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)

This error occurred while calling

    numpy.asarray(
        <Array [[0.2120250193289826, ...], ...] type='50 * var * float64'>
        dtype = None
    )

The text was updated successfully, but these errors were encountered:

ianna · 2024-08-01T09:46:24Z

duplicate of #36

gmarkall · 2024-08-01T09:52:11Z

Reopening this as it's a much better-written request than mine in #36. My notes from #36:

Discussion originally from Allow libraries that implement array_ufunc to override CUDAUFuncDispatcher on the Numba Discourse.
There was a PR implementing this for the CPU target: Allow libraries that implement __array_ufunc__ to override DUFunc.__c… numba/numba#8995
A related issue on the Awkward issue tracker: "cuda" target in numba.vectorize not working correctly? scikit-hep/awkward#3179
This is to support using Coffea on CUDA.
The CUDA Ufunc mechanism doesn't yet support dynamic Ufuncs, which seem to be a part of the implementation of the PR for the CPU target.
The __array_ufunc__ mechanism seems CPU-centric - will we need to define a new __cuda_array_ufunc__ mechanism for this to be practical? c.f. __cuda_array_interface__ vs. __array_interface__.

lgray · 2024-08-01T14:47:14Z

Awesome! Thank you both for getting this started.

lgray · 2024-08-01T16:22:55Z

An off the cuff two cents: but wouldn't __cuda_array_interface__ + __array_ufunc__ imply __cuda_array_ufunc__, so a new interface may not be entirely necessary?

Dispatch to other ufuncs is already handled quite well by awkward. i.e. an awkward array with cuda backend can be straightforwardly passed to np.abs (or similar) and it correctly dispatches to cp.abs.

lgray · 2024-08-19T21:23:16Z

Just to bump this - has there been any further thought in this direction?

gmarkall · 2024-08-21T12:25:46Z

@lgray - thanks for the bump - I haven't been able to look into this further yet as I don't have enough of a grasp of the concepts to sketch out an implementation plan further without spending time doing some research... I think you might be a bit more ahead of me in your thinking about this - do you have some thoughts about what the implementation should / could look like?

lgray · 2024-08-22T19:30:50Z

@gmarkall I don't really have recommendations on low level implementation, but I do know how we would like things to operate from a high level.

Essentially we'd like our data scientists (high energy particle physics experiment scientists and PhD students) to we able to design analyses on their laptops for CPU and redeploy it with a few configuration changes on GPU using awkward array.

We can detect when data is on GPU vs. CPU and switch between kernels automatically and easily with awkward, so that's largely a matter of user interface.

What we need from numba-cuda is for it to interact seamlessly with awkward arrays that are on-device as well as it already does with host-side arrays and regular numba.

Training users to write effective cuda kernels with numba is a different matter entirely that we will not touch here. I'm just considering pretty simple ufuncs that you get through @vectorize.

So really, on the backend we just need it to be able to identify ufuncs and then to be able to distinguish when those ufuncs accept device side arrays. So I think some scaffolding is missing, and not much else, essentially to smoothen the experience on the user side? I'm not quite aware of the entry points to change things to give it a shot, off the top of my head.

lgray · 2024-09-10T16:03:32Z

@gmarkall I was talking to @jpivarski and @ianna last week and I hadn't realized that cupy itself didn't implement a nep13-like protocol when calling the cupy version of the ufunc. So this makes it clear why this had problems working in the first place and now I agree that we need something like nep13 so that awkward can detect and override the application of cupy specific ufuncs. Then we can use that with numba and we're in a much better place.

Is this a more accurate understand of the situation from my side?

gmarkall · 2024-09-11T09:08:41Z

It sounds like you've mapped out the issue and what we need to do to resolve it a bit further - I still don't have any expertise in this area so I can't comment definitively, but what you said makes sense and it seems to give us more understanding of the situation.

Do we need to have a feature request in CuPy for NEP13 or some NEP13-like support?

ianna · 2024-09-11T09:33:55Z

Yes, I think we need to ask @leofang. I will open an issue on CuPy github.

ianna added the feature request label Aug 1, 2024

ianna closed this as completed Aug 1, 2024

gmarkall reopened this Aug 1, 2024

gmarkall mentioned this issue Aug 1, 2024

[FEA] Allow libraries that implement __array_ufunc__ to override CUDAUFuncDispatcher #36

Closed

ianna mentioned this issue Aug 16, 2024

test: add vectorize with target cuda test scikit-hep/awkward#3194

Draft

ianna mentioned this issue Sep 12, 2024

Support for NEP-13 cupy/cupy#8597

Open

lgray mentioned this issue Sep 17, 2024

Should cupy ufuncs work on awkward arrays that are correctly on-device? scikit-hep/awkward#3232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Allow libraries that implement `__array_ufunc__` to override CUDAUFunc #37

[FEA] Allow libraries that implement `__array_ufunc__` to override CUDAUFunc #37

ianna commented Aug 1, 2024 •

edited

Loading

ianna commented Aug 1, 2024

gmarkall commented Aug 1, 2024

lgray commented Aug 1, 2024

lgray commented Aug 1, 2024 •

edited

Loading

lgray commented Aug 19, 2024

gmarkall commented Aug 21, 2024

lgray commented Aug 22, 2024

lgray commented Sep 10, 2024

gmarkall commented Sep 11, 2024

ianna commented Sep 11, 2024

[FEA] Allow libraries that implement __array_ufunc__ to override CUDAUFunc #37

[FEA] Allow libraries that implement __array_ufunc__ to override CUDAUFunc #37

Comments

ianna commented Aug 1, 2024 • edited Loading

ianna commented Aug 1, 2024

gmarkall commented Aug 1, 2024

lgray commented Aug 1, 2024

lgray commented Aug 1, 2024 • edited Loading

lgray commented Aug 19, 2024

gmarkall commented Aug 21, 2024

lgray commented Aug 22, 2024

lgray commented Sep 10, 2024

gmarkall commented Sep 11, 2024

ianna commented Sep 11, 2024

[FEA] Allow libraries that implement `__array_ufunc__` to override CUDAUFunc #37

[FEA] Allow libraries that implement `__array_ufunc__` to override CUDAUFunc #37

ianna commented Aug 1, 2024 •

edited

Loading

lgray commented Aug 1, 2024 •

edited

Loading