Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add casting operators to masked UDFs #11578

Conversation

brandon-b-miller
Copy link
Contributor

While working through #11031 it was discovered that we were missing the ability to "cast" between python "classes" (int, float, and bool) within UDFs. This PR introduces the equivalent syntax into masked UDFs. These operations shall be interpreted as mapping to int64, float64 and bool types, following numpy and numba's existing handling for scalar types.

@brandon-b-miller brandon-b-miller requested a review from a team as a code owner August 22, 2022 18:20
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 22, 2022
@brandon-b-miller brandon-b-miller added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Aug 22, 2022
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments.

"""
Typing for float(Masked)
returns the result of calling "float" on the input
TODO: retains the validity of the input rather than
Copy link
Contributor

@bdice bdice Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO something we can act on, or should we instead mark it as an difference in from the pandas behavior? e.g. "Unlike pandas, this cast retains the validity of the input instead of raising an error in the case of float(pd.NA)."

Also does this behavior in pandas vary depending on the selected engine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a longstanding issue around doing this that it might be time we tackle. The short answer is yes, we it seems there's a way of raising in this case although it could come with the price of some performance. It's probably out of scope of this PR.

By engine I assume you are referring to the ability to pass engine="numba" to several pandas APIs. TLDR I wouldn't expect this to work at all. AFAIK pandas doesn't do any extending of numba, so I wouldn't expect to be able to use pandas to JIT any python code that involves any objects that aren't pre-built into numba (basically scalars and arrays on the CPU). This includes pd.NA.

For context, here is the NAType we use to back instances of pd.NA and cudf.NA in our numba extension for scalar UDFs. Without this, numba would error and complain there's something unidentified in the UDF if NA appeared anywhere. Pandas would also have to implement the kinds of operations and statements that would commonly appear in a UDF, such as x is NA like we have here.

It turns out that straight apply doesn't accept engine=, surprisingly, but Rolling.apply does so we can use it to see what happens in this case.

First the baseline case which works both with and without engine="numba":

>>> data = pd.Series(range(1000))
>>> def f(window):
...     sum = 0
...     for element in window:
...         sum += element
...     return float(sum)
... 
>>> data.rolling(10).apply(f)
0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
        ...  
995    9905.0
996    9915.0
997    9925.0
998    9935.0
999    9945.0
Length: 1000, dtype: float64
>>> data.rolling(10).apply(f, engine='numba', raw=True)
0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
        ...  
995    9905.0
996    9915.0
997    9925.0
998    9935.0
999    9945.0
Length: 1000, dtype: float64

Let's add some nulls to the data

data[::2] = pd.NA
data[:10]
>>> data[::2] = pd.NA
>>> data[:10]
0    <NA>
1       1
2    <NA>
3       3
4    <NA>
5       5
6    <NA>
7       7
8    <NA>
9       9
dtype: object

In both cases, invoking the numba engine will cause a failure when the null is encountered. It is notable that this happens during the data prep phase as well, so the function never even makes it to the compilation stage.

data.rolling(10).apply(f)
TypeError: float() argument must be a string or a number, not 'NAType'
data.rolling(10).apply(f, engine='numba', raw=True)
TypeError: float() argument must be a string or a number, not 'NAType'

One can play around with it in several different ways but I wasn't able to obtain a sensible result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic answer. Thank you, I learned a lot from the linked issue and your description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm that pandas doesn't extend numba at all and just throws the UDF at numba to JIT.

It turns out that straight apply doesn't accept engine=, surprisingly, but Rolling.apply does so we can use it to see what happens in this case.

Related pandas-dev/pandas#31845. Rolling.apply acts essentially like DataFrame.transform so it was easier to implement since the shape of the result was predictable. I didn't end up implementing engine="numba" for Groupby.apply or DataFrame.apply because those functions can either act like a transform or agg so the unpredictability of the resulting shape was difficult to emulate. (That's why Groupby.agg and Groupby.transform only support engine="numba")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yeah, that makes a lot of sense. Thanks for the context, I had wondered about that.

python/cudf/cudf/tests/test_udf_masked_ops.py Outdated Show resolved Hide resolved
Comment on lines +685 to +686
def func(x):
return operator(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just pass operator directly as the function, or is this level of indirection (wrapped in a function) necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it to be a function so that it can have a __code__ attribute, this is used to generate the cache key later down the line. We could maybe add some special casing for this but it would be basically enabling this:

sr = cudf.Series([1,2,3])
sr.apply(float)

which feels like just a way more complicated way of calling astype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I think I've asked about this kind of thing before, relating to the reliance on __code__ attributes in the UDF machinery. Not all functions have a __code__ attribute, and I think this should be permissible/supported behavior. It's fine if we don't do it in this PR and file an issue instead, but it should work one way or another, even if it means that we simply skip using the cache or have to wrap it.

@codecov
Copy link

codecov bot commented Aug 22, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@8410fc1). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.10   #11578   +/-   ##
===============================================
  Coverage                ?   86.42%           
===============================================
  Files                   ?      145           
  Lines                   ?    22981           
  Branches                ?        0           
===============================================
  Hits                    ?    19861           
  Misses                  ?     3120           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandon-b-miller I'd say this can be merged if you're ready. Please file an issue about the __code__ issue affecting series.apply(float) before merging.

@brandon-b-miller
Copy link
Contributor Author

rerun tests

@bdice
Copy link
Contributor

bdice commented Aug 25, 2022

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants