[FEA] Implementing __array_function__ #1728

jakirkham · 2019-05-14T00:25:16Z

Is your feature request related to a problem? Please describe.

It would be nice for many NumPy functions that are not ufuncs to just work on cuDF objects (e.g. DataFrame, Series). This would allow users to switch between Pandas and cuDF more easily.

Describe the solution you'd like

NumPy's NEP 18 adds __array_function__ to allow array libraries to override how NumPy functions will behave on their data. This is incredibly useful as it makes swapping between different array libraries like NumPy and CuPy trivial for end users. It would be useful to implement __array_function__ on cuDF objects (e.g. DataFrame, Series) to allow NumPy functions (where they make sense) to just work.

Describe alternatives you've considered

NEP 18 thoroughly layouts out the alternatives (some of which the community has tried).

The text was updated successfully, but these errors were encountered:

jakirkham · 2019-05-30T14:34:12Z

Just thinking about this in light of some conversations about moving data back and forth between cuDF and CuPy, maybe we might be able to implement this in a somewhat simple way. Namely we do a zero-copy conversion to CuPy then leverage CuPy's __array_function__ support and do a zero-copy conversion back to the appropriate cuDF type. This would let us leverage many functions from CuPy without needing to reimplement things here.

kkraus14 · 2019-05-30T14:36:19Z

Just thinking about this in light of some conversations about moving data back and forth between cuDF and CuPy, maybe we might be able to implement this in a somewhat simple way. Namely we do a zero-copy conversion to CuPy then leverage CuPy's __array_function__ support and do a zero-copy conversion back to the appropriate cuDF type. This would let us leverage many functions from CuPy without needing to reimplement things here.

We can't do zero copy in the situation where there's nulls though.

jakirkham · 2019-05-30T14:43:25Z

To make sure I understand correctly, your concern is we have to replace the nulls somehow before going to CuPy correct?

kkraus14 · 2019-05-30T14:46:03Z

To make sure I understand correctly, your concern is we have to replace the nulls somehow before going to CuPy correct?

Yes, and in the case of integers and booleans there isn't a value like NaN that we can replace nulls with.

beckernick · 2019-05-30T15:00:12Z

Yes, and in the case of integers and booleans there isn't a value like NaN that we can replace nulls with.

CuPy doesn't appear to have a concept of nulls in arrays of integer/boolean dtype, but does in their float dtype. This appears similar to how numpy and pandas generally behave (technically used to behave for pandas as of the nullable integer dtype). Despite this limitation, leveraging pandas/numpy interop is useful. If there is a null, passing what "would" be an integer or boolean array could be cast to a float so as to use NaNs.

jakirkham · 2019-05-30T15:01:24Z

Yep, this is a great point. So how should we handle that case? Here's some random thoughts (though please free to add others).

Error and ask the user to replace the nulls
Try to do some smart substitution for the user
Drop out the problematic values on the CuPy side (using a mask)
Pass smaller null-free chunks of data to CuPy and consolidate the results
Work with CuPy to create a null aware array type (kind of like NumPy masked arrays)
?

kkraus14 · 2019-05-30T15:07:11Z

Yep, this is a great point. So how should we handle that case? Here's some random thoughts (though please free to add others).

Error and ask the user to replace the nulls

Try to do some smart substitution for the user

Drop out the problematic values on the CuPy side (using a mask)

Pass smaller null-free chunks of data to CuPy and consolidate the results

Work with CuPy to create a null aware array type (kind of like NumPy masked arrays)

?

The problem with 2-4 is that you can no longer reference the same data as underneath the cudf column which is a common use case for using the array protocol in the case of pandas. This is very similar to the discussion happening in #1824.

jakirkham · 2019-05-30T15:23:15Z

Yep, 2 & 3 definitely introduce copies. Would think 4 avoids this issue as we are referring to subsections of the same data. Though it is admittedly more complicated.

TBH I'm less worried about producing a copy for __array_function__ than with something like .values as the intent of __array_function__ is to eventually produce a new object with a result (not view the same data). Also would hope this is a great way for us to get large number of array functions for relatively little work.

Given the simplicity of 1 and the fact that it leaves our options relatively open, I'd be curious what your thoughts are on just starting there. Though if you have different thoughts on where to start, it would be great to hear them as well. 🙂

kkraus14 · 2019-05-30T15:25:03Z

Yep, 2 & 3 definitely introduce copies. Would think 4 avoids this issue as we are referring to subsections of the same data. Though it is admittedly more complicated.

TBH I'm less worried about producing a copy for __array_function__ than with something like .values as the intent of __array_function__ is to eventually produce a new object with a result (not view the same data). Also would hope this is a great way for us to get large number of array functions for relatively little work.

Given the simplicity of 1 and the fact that it leaves our options relatively open, I'd be curious what your thoughts are on just starting there. Though if you have different thoughts on where to start, it would be great to hear them as well. 🙂

I'm definitely on board with 1 to start, but ideally we just implement the functions needed in libcudf and handle nulls properly, then build out the Python APIs + hook up to __array_function__.

jakirkham · 2019-05-30T15:35:50Z

One tweak on my original suggestion could be we have a list of functions that cuDF implements that __array_function__ tries first and then it falls back to CuPy for everything else. When cuDF implements new functions that work with __array_function__, we can just add them to this list. Maybe that strikes the balance between having a large variety of functions and handling nulls correctly. Thoughts?

jakirkham · 2019-06-03T18:00:36Z

One tweak on my original suggestion could be we have a list of functions that cuDF implements that __array_function__ tries first and then it falls back to CuPy for everything else.

We might not even need an explicit list. Perhaps we can just check that a function with the right name is defined in cuDF and fallback if not like so.

jakirkham · 2019-07-11T04:06:24Z

Thanks for tackling this! 😄

jakirkham added Needs Triage Need team to review and classify feature request New feature or request labels May 14, 2019

This was referenced May 14, 2019

[FEA] Implementing __array_wrap__ #1724

Closed

Check for __array_function__ when skipping __array_wrap__ dask/dask#4797

Open

kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 14, 2019

quasiben mentioned this issue Jun 10, 2019

Var calculation fails with non-pandas Dataframe dask/dask#4910

Closed

randerzander assigned VibhuJawa Jun 18, 2019

This was referenced Jun 29, 2019

[WIP] Implement __array_function__ #2148

Closed

[REVIEW] Add __array_function__ to DataFrame and Series #2157

Merged

kkraus14 closed this as completed in #2157 Jul 11, 2019

VibhuJawa mentioned this issue Sep 3, 2019

[FEA] Using cupy with __array_function__ for functions not currently supported #2718

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Implementing __array_function__ #1728

[FEA] Implementing __array_function__ #1728

jakirkham commented May 14, 2019

jakirkham commented May 30, 2019

kkraus14 commented May 30, 2019

jakirkham commented May 30, 2019

kkraus14 commented May 30, 2019

beckernick commented May 30, 2019 •

edited

Loading

jakirkham commented May 30, 2019

kkraus14 commented May 30, 2019

jakirkham commented May 30, 2019

kkraus14 commented May 30, 2019

jakirkham commented May 30, 2019

jakirkham commented Jun 3, 2019

jakirkham commented Jul 11, 2019

[FEA] Implementing __array_function__ #1728

[FEA] Implementing __array_function__ #1728

Comments

jakirkham commented May 14, 2019

jakirkham commented May 30, 2019

kkraus14 commented May 30, 2019

jakirkham commented May 30, 2019

kkraus14 commented May 30, 2019

beckernick commented May 30, 2019 • edited Loading

jakirkham commented May 30, 2019

kkraus14 commented May 30, 2019

jakirkham commented May 30, 2019

kkraus14 commented May 30, 2019

jakirkham commented May 30, 2019

jakirkham commented Jun 3, 2019

jakirkham commented Jul 11, 2019

beckernick commented May 30, 2019 •

edited

Loading