-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Implementing __array_function__ #1728
Comments
Just thinking about this in light of some conversations about moving data back and forth between cuDF and CuPy, maybe we might be able to implement this in a somewhat simple way. Namely we do a zero-copy conversion to CuPy then leverage CuPy's |
We can't do zero copy in the situation where there's nulls though. |
To make sure I understand correctly, your concern is we have to replace the nulls somehow before going to CuPy correct? |
Yes, and in the case of integers and booleans there isn't a value like |
CuPy doesn't appear to have a concept of nulls in arrays of integer/boolean dtype, but does in their float dtype. This appears similar to how numpy and pandas generally behave (technically used to behave for pandas as of the nullable integer dtype). Despite this limitation, leveraging pandas/numpy interop is useful. If there is a null, passing what "would" be an integer or boolean array could be cast to a float so as to use NaNs. |
Yep, this is a great point. So how should we handle that case? Here's some random thoughts (though please free to add others).
|
The problem with 2-4 is that you can no longer reference the same data as underneath the cudf column which is a common use case for using the array protocol in the case of pandas. This is very similar to the discussion happening in #1824. |
Yep, 2 & 3 definitely introduce copies. Would think 4 avoids this issue as we are referring to subsections of the same data. Though it is admittedly more complicated. TBH I'm less worried about producing a copy for Given the simplicity of 1 and the fact that it leaves our options relatively open, I'd be curious what your thoughts are on just starting there. Though if you have different thoughts on where to start, it would be great to hear them as well. 🙂 |
I'm definitely on board with 1 to start, but ideally we just implement the functions needed in libcudf and handle nulls properly, then build out the Python APIs + hook up to |
One tweak on my original suggestion could be we have a list of functions that cuDF implements that |
We might not even need an explicit list. Perhaps we can just check that a function with the right name is defined in cuDF and fallback if not like so. |
Thanks for tackling this! 😄 |
Is your feature request related to a problem? Please describe.
It would be nice for many NumPy functions that are not
ufunc
s to just work on cuDF objects (e.g.DataFrame
,Series
). This would allow users to switch between Pandas and cuDF more easily.Describe the solution you'd like
NumPy's NEP 18 adds
__array_function__
to allow array libraries to override how NumPy functions will behave on their data. This is incredibly useful as it makes swapping between different array libraries like NumPy and CuPy trivial for end users. It would be useful to implement__array_function__
on cuDF objects (e.g.DataFrame
,Series
) to allow NumPy functions (where they make sense) to just work.Describe alternatives you've considered
NEP 18 thoroughly layouts out the alternatives (some of which the community has tried).
The text was updated successfully, but these errors were encountered: