-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add casting operators to masked UDFs #11578
Add casting operators to masked UDFs #11578
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments.
""" | ||
Typing for float(Masked) | ||
returns the result of calling "float" on the input | ||
TODO: retains the validity of the input rather than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO something we can act on, or should we instead mark it as an difference in from the pandas behavior? e.g. "Unlike pandas, this cast retains the validity of the input instead of raising an error in the case of float(pd.NA)
."
Also does this behavior in pandas vary depending on the selected engine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a longstanding issue around doing this that it might be time we tackle. The short answer is yes, we it seems there's a way of raising in this case although it could come with the price of some performance. It's probably out of scope of this PR.
By engine I assume you are referring to the ability to pass engine="numba"
to several pandas APIs. TLDR I wouldn't expect this to work at all. AFAIK pandas doesn't do any extending of numba, so I wouldn't expect to be able to use pandas to JIT any python code that involves any objects that aren't pre-built into numba (basically scalars and arrays on the CPU). This includes pd.NA
.
For context, here is the NAType we use to back instances of pd.NA and cudf.NA in our numba extension for scalar UDFs. Without this, numba would error and complain there's something unidentified in the UDF if NA
appeared anywhere. Pandas would also have to implement the kinds of operations and statements that would commonly appear in a UDF, such as x is NA
like we have here.
It turns out that straight apply
doesn't accept engine=
, surprisingly, but Rolling.apply
does so we can use it to see what happens in this case.
First the baseline case which works both with and without engine="numba"
:
>>> data = pd.Series(range(1000))
>>> def f(window):
... sum = 0
... for element in window:
... sum += element
... return float(sum)
...
>>> data.rolling(10).apply(f)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
995 9905.0
996 9915.0
997 9925.0
998 9935.0
999 9945.0
Length: 1000, dtype: float64
>>> data.rolling(10).apply(f, engine='numba', raw=True)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
995 9905.0
996 9915.0
997 9925.0
998 9935.0
999 9945.0
Length: 1000, dtype: float64
Let's add some nulls to the data
data[::2] = pd.NA
data[:10]
>>> data[::2] = pd.NA
>>> data[:10]
0 <NA>
1 1
2 <NA>
3 3
4 <NA>
5 5
6 <NA>
7 7
8 <NA>
9 9
dtype: object
In both cases, invoking the numba engine will cause a failure when the null is encountered. It is notable that this happens during the data prep phase as well, so the function never even makes it to the compilation stage.
data.rolling(10).apply(f)
TypeError: float() argument must be a string or a number, not 'NAType'
data.rolling(10).apply(f, engine='numba', raw=True)
TypeError: float() argument must be a string or a number, not 'NAType'
One can play around with it in several different ways but I wasn't able to obtain a sensible result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic answer. Thank you, I learned a lot from the linked issue and your description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can confirm that pandas doesn't extend numba at all and just throws the UDF at numba to JIT.
It turns out that straight apply doesn't accept engine=, surprisingly, but Rolling.apply does so we can use it to see what happens in this case.
Related pandas-dev/pandas#31845. Rolling.apply
acts essentially like DataFrame.transform
so it was easier to implement since the shape of the result was predictable. I didn't end up implementing engine="numba"
for Groupby.apply
or DataFrame.apply
because those functions can either act like a transform
or agg
so the unpredictability of the resulting shape was difficult to emulate. (That's why Groupby.agg
and Groupby.transform
only support engine="numba"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yeah, that makes a lot of sense. Thanks for the context, I had wondered about that.
def func(x): | ||
return operator(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just pass operator
directly as the function, or is this level of indirection (wrapped in a function) necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need it to be a function so that it can have a __code__
attribute, this is used to generate the cache key later down the line. We could maybe add some special casing for this but it would be basically enabling this:
sr = cudf.Series([1,2,3])
sr.apply(float)
which feels like just a way more complicated way of calling astype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I think I've asked about this kind of thing before, relating to the reliance on __code__
attributes in the UDF machinery. Not all functions have a __code__
attribute, and I think this should be permissible/supported behavior. It's fine if we don't do it in this PR and file an issue instead, but it should work one way or another, even if it means that we simply skip using the cache or have to wrap it.
Codecov Report
@@ Coverage Diff @@
## branch-22.10 #11578 +/- ##
===============================================
Coverage ? 86.42%
===============================================
Files ? 145
Lines ? 22981
Branches ? 0
===============================================
Hits ? 19861
Misses ? 3120
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Co-authored-by: Bradley Dice <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandon-b-miller I'd say this can be merged if you're ready. Please file an issue about the __code__
issue affecting series.apply(float)
before merging.
rerun tests |
@gpucibot merge |
While working through #11031 it was discovered that we were missing the ability to "cast" between python "classes" (
int
,float
, andbool
) within UDFs. This PR introduces the equivalent syntax into masked UDFs. These operations shall be interpreted as mapping toint64
,float64
andbool
types, following numpy and numba's existing handling for scalar types.