Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implementing __array_function__ #1728

Closed
jakirkham opened this issue May 14, 2019 · 12 comments · Fixed by #2157
Closed

[FEA] Implementing __array_function__ #1728

jakirkham opened this issue May 14, 2019 · 12 comments · Fixed by #2157
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@jakirkham
Copy link
Member

Is your feature request related to a problem? Please describe.

It would be nice for many NumPy functions that are not ufuncs to just work on cuDF objects (e.g. DataFrame, Series). This would allow users to switch between Pandas and cuDF more easily.

Describe the solution you'd like

NumPy's NEP 18 adds __array_function__ to allow array libraries to override how NumPy functions will behave on their data. This is incredibly useful as it makes swapping between different array libraries like NumPy and CuPy trivial for end users. It would be useful to implement __array_function__ on cuDF objects (e.g. DataFrame, Series) to allow NumPy functions (where they make sense) to just work.

Describe alternatives you've considered

NEP 18 thoroughly layouts out the alternatives (some of which the community has tried).

@jakirkham jakirkham added Needs Triage Need team to review and classify feature request New feature or request labels May 14, 2019
@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 14, 2019
@jakirkham
Copy link
Member Author

Just thinking about this in light of some conversations about moving data back and forth between cuDF and CuPy, maybe we might be able to implement this in a somewhat simple way. Namely we do a zero-copy conversion to CuPy then leverage CuPy's __array_function__ support and do a zero-copy conversion back to the appropriate cuDF type. This would let us leverage many functions from CuPy without needing to reimplement things here.

@kkraus14
Copy link
Collaborator

Just thinking about this in light of some conversations about moving data back and forth between cuDF and CuPy, maybe we might be able to implement this in a somewhat simple way. Namely we do a zero-copy conversion to CuPy then leverage CuPy's __array_function__ support and do a zero-copy conversion back to the appropriate cuDF type. This would let us leverage many functions from CuPy without needing to reimplement things here.

We can't do zero copy in the situation where there's nulls though.

@jakirkham
Copy link
Member Author

To make sure I understand correctly, your concern is we have to replace the nulls somehow before going to CuPy correct?

@kkraus14
Copy link
Collaborator

To make sure I understand correctly, your concern is we have to replace the nulls somehow before going to CuPy correct?

Yes, and in the case of integers and booleans there isn't a value like NaN that we can replace nulls with.

@beckernick
Copy link
Member

beckernick commented May 30, 2019

Yes, and in the case of integers and booleans there isn't a value like NaN that we can replace nulls with.

CuPy doesn't appear to have a concept of nulls in arrays of integer/boolean dtype, but does in their float dtype. This appears similar to how numpy and pandas generally behave (technically used to behave for pandas as of the nullable integer dtype). Despite this limitation, leveraging pandas/numpy interop is useful. If there is a null, passing what "would" be an integer or boolean array could be cast to a float so as to use NaNs.

@jakirkham
Copy link
Member Author

Yep, this is a great point. So how should we handle that case? Here's some random thoughts (though please free to add others).

  1. Error and ask the user to replace the nulls
  2. Try to do some smart substitution for the user
  3. Drop out the problematic values on the CuPy side (using a mask)
  4. Pass smaller null-free chunks of data to CuPy and consolidate the results
  5. Work with CuPy to create a null aware array type (kind of like NumPy masked arrays)
  6. ?

@kkraus14
Copy link
Collaborator

Yep, this is a great point. So how should we handle that case? Here's some random thoughts (though please free to add others).

  1. Error and ask the user to replace the nulls
  2. Try to do some smart substitution for the user
  3. Drop out the problematic values on the CuPy side (using a mask)
  4. Pass smaller null-free chunks of data to CuPy and consolidate the results
  5. Work with CuPy to create a null aware array type (kind of like NumPy masked arrays)
  6. ?

The problem with 2-4 is that you can no longer reference the same data as underneath the cudf column which is a common use case for using the array protocol in the case of pandas. This is very similar to the discussion happening in #1824.

@jakirkham
Copy link
Member Author

Yep, 2 & 3 definitely introduce copies. Would think 4 avoids this issue as we are referring to subsections of the same data. Though it is admittedly more complicated.

TBH I'm less worried about producing a copy for __array_function__ than with something like .values as the intent of __array_function__ is to eventually produce a new object with a result (not view the same data). Also would hope this is a great way for us to get large number of array functions for relatively little work.

Given the simplicity of 1 and the fact that it leaves our options relatively open, I'd be curious what your thoughts are on just starting there. Though if you have different thoughts on where to start, it would be great to hear them as well. 🙂

@kkraus14
Copy link
Collaborator

Yep, 2 & 3 definitely introduce copies. Would think 4 avoids this issue as we are referring to subsections of the same data. Though it is admittedly more complicated.

TBH I'm less worried about producing a copy for __array_function__ than with something like .values as the intent of __array_function__ is to eventually produce a new object with a result (not view the same data). Also would hope this is a great way for us to get large number of array functions for relatively little work.

Given the simplicity of 1 and the fact that it leaves our options relatively open, I'd be curious what your thoughts are on just starting there. Though if you have different thoughts on where to start, it would be great to hear them as well. 🙂

I'm definitely on board with 1 to start, but ideally we just implement the functions needed in libcudf and handle nulls properly, then build out the Python APIs + hook up to __array_function__.

@jakirkham
Copy link
Member Author

One tweak on my original suggestion could be we have a list of functions that cuDF implements that __array_function__ tries first and then it falls back to CuPy for everything else. When cuDF implements new functions that work with __array_function__, we can just add them to this list. Maybe that strikes the balance between having a large variety of functions and handling nulls correctly. Thoughts?

@jakirkham
Copy link
Member Author

One tweak on my original suggestion could be we have a list of functions that cuDF implements that __array_function__ tries first and then it falls back to CuPy for everything else.

We might not even need an explicit list. Perhaps we can just check that a function with the right name is defined in cuDF and fallback if not like so.

@jakirkham
Copy link
Member Author

Thanks for tackling this! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
4 participants