Add casting operators to masked UDFs #11578

brandon-b-miller · 2022-08-22T18:20:11Z

While working through #11031 it was discovered that we were missing the ability to "cast" between python "classes" (int, float, and bool) within UDFs. This PR introduces the equivalent syntax into masked UDFs. These operations shall be interpreted as mapping to int64, float64 and bool types, following numpy and numba's existing handling for scalar types.

bdice

A few minor comments.

bdice · 2022-08-22T18:39:41Z

python/cudf/cudf/core/udf/typing.py

+    """
+    Typing for float(Masked)
+    returns the result of calling "float" on the input
+    TODO: retains the validity of the input rather than


Is this TODO something we can act on, or should we instead mark it as an difference in from the pandas behavior? e.g. "Unlike pandas, this cast retains the validity of the input instead of raising an error in the case of float(pd.NA)."

Also does this behavior in pandas vary depending on the selected engine?

There's a longstanding issue around doing this that it might be time we tackle. The short answer is yes, we it seems there's a way of raising in this case although it could come with the price of some performance. It's probably out of scope of this PR.

By engine I assume you are referring to the ability to pass engine="numba" to several pandas APIs. TLDR I wouldn't expect this to work at all. AFAIK pandas doesn't do any extending of numba, so I wouldn't expect to be able to use pandas to JIT any python code that involves any objects that aren't pre-built into numba (basically scalars and arrays on the CPU). This includes pd.NA.

For context, here is the NAType we use to back instances of pd.NA and cudf.NA in our numba extension for scalar UDFs. Without this, numba would error and complain there's something unidentified in the UDF if NA appeared anywhere. Pandas would also have to implement the kinds of operations and statements that would commonly appear in a UDF, such as x is NA like we have here.

It turns out that straight apply doesn't accept engine=, surprisingly, but Rolling.apply does so we can use it to see what happens in this case.

First the baseline case which works both with and without engine="numba":

>>> data = pd.Series(range(1000)) >>> def f(window): ... sum = 0 ... for element in window: ... sum += element ... return float(sum) ... >>> data.rolling(10).apply(f) 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN ... 995 9905.0 996 9915.0 997 9925.0 998 9935.0 999 9945.0 Length: 1000, dtype: float64 >>> data.rolling(10).apply(f, engine='numba', raw=True) 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN ... 995 9905.0 996 9915.0 997 9925.0 998 9935.0 999 9945.0 Length: 1000, dtype: float64

Let's add some nulls to the data

data[::2] = pd.NA data[:10] >>> data[::2] = pd.NA >>> data[:10] 0 <NA> 1 1 2 <NA> 3 3 4 <NA> 5 5 6 <NA> 7 7 8 <NA> 9 9 dtype: object

In both cases, invoking the numba engine will cause a failure when the null is encountered. It is notable that this happens during the data prep phase as well, so the function never even makes it to the compilation stage.

data.rolling(10).apply(f) TypeError: float() argument must be a string or a number, not 'NAType' data.rolling(10).apply(f, engine='numba', raw=True) TypeError: float() argument must be a string or a number, not 'NAType'

One can play around with it in several different ways but I wasn't able to obtain a sensible result.

Fantastic answer. Thank you, I learned a lot from the linked issue and your description.

Can confirm that pandas doesn't extend numba at all and just throws the UDF at numba to JIT.

It turns out that straight apply doesn't accept engine=, surprisingly, but Rolling.apply does so we can use it to see what happens in this case.

Related pandas-dev/pandas#31845. Rolling.apply acts essentially like DataFrame.transform so it was easier to implement since the shape of the result was predictable. I didn't end up implementing engine="numba" for Groupby.apply or DataFrame.apply because those functions can either act like a transform or agg so the unpredictability of the resulting shape was difficult to emulate. (That's why Groupby.agg and Groupby.transform only support engine="numba")

I see, yeah, that makes a lot of sense. Thanks for the context, I had wondered about that.

python/cudf/cudf/tests/test_udf_masked_ops.py

bdice · 2022-08-22T18:40:57Z

python/cudf/cudf/tests/test_udf_masked_ops.py

+    def func(x):
+        return operator(x)


Can we just pass operator directly as the function, or is this level of indirection (wrapped in a function) necessary?

We need it to be a function so that it can have a __code__ attribute, this is used to generate the cache key later down the line. We could maybe add some special casing for this but it would be basically enabling this:

sr = cudf.Series([1,2,3]) sr.apply(float)

which feels like just a way more complicated way of calling astype.

Ah! I think I've asked about this kind of thing before, relating to the reliance on __code__ attributes in the UDF machinery. Not all functions have a __code__ attribute, and I think this should be permissible/supported behavior. It's fine if we don't do it in this PR and file an issue instead, but it should work one way or another, even if it means that we simply skip using the cache or have to wrap it.

codecov · 2022-08-22T20:15:58Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@8410fc1). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.10   #11578   +/-   ##
===============================================
  Coverage                ?   86.42%           
===============================================
  Files                   ?      145           
  Lines                   ?    22981           
  Branches                ?        0           
===============================================
  Hits                    ?    19861           
  Misses                  ?     3120           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Co-authored-by: Bradley Dice <[email protected]>

bdice

@brandon-b-miller I'd say this can be merged if you're ready. Please file an issue about the __code__ issue affecting series.apply(float) before merging.

brandon-b-miller · 2022-08-25T16:58:15Z

rerun tests

bdice · 2022-08-25T22:01:23Z

@gpucibot merge

brandon-b-miller added 3 commits August 18, 2022 11:03

broken impl

71e35f4

always allocate return value

11271e2

remove unwanted files

6913f7d

brandon-b-miller requested a review from a team as a code owner August 22, 2022 18:20

brandon-b-miller requested review from shwina and skirui-source August 22, 2022 18:20

github-actions bot added the Python Affects Python cuDF API. label Aug 22, 2022

brandon-b-miller added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Aug 22, 2022

bdice reviewed Aug 22, 2022

View reviewed changes

Update python/cudf/cudf/tests/test_udf_masked_ops.py

fdd4b49

Co-authored-by: Bradley Dice <[email protected]>

bdice approved these changes Aug 23, 2022

View reviewed changes

brandon-b-miller mentioned this pull request Aug 25, 2022

[FEA] Handle selected functions that don't have bytecode in Series.apply #11595

Open

Merge branch 'branch-22.10' into fea-maskedtype-casting-operators

0ea15c5

mroeschke approved these changes Aug 25, 2022

View reviewed changes

rapids-bot bot merged commit 096bbc4 into rapidsai:branch-22.10 Aug 25, 2022

brandon-b-miller mentioned this pull request May 26, 2023

[FEA] Support for builtin operators in apply functions #13462

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add casting operators to masked UDFs #11578

Add casting operators to masked UDFs #11578

brandon-b-miller commented Aug 22, 2022

bdice left a comment

bdice Aug 22, 2022 •

edited

Loading

brandon-b-miller Aug 23, 2022

bdice Aug 23, 2022

mroeschke Aug 24, 2022

brandon-b-miller Aug 24, 2022

bdice Aug 22, 2022

brandon-b-miller Aug 23, 2022

bdice Aug 23, 2022

codecov bot commented Aug 22, 2022 •

edited

Loading

bdice left a comment •

edited

Loading

brandon-b-miller commented Aug 25, 2022

bdice commented Aug 25, 2022

Add casting operators to masked UDFs #11578

Add casting operators to masked UDFs #11578

Conversation

brandon-b-miller commented Aug 22, 2022

bdice left a comment

Choose a reason for hiding this comment

bdice Aug 22, 2022 • edited Loading

Choose a reason for hiding this comment

brandon-b-miller Aug 23, 2022

Choose a reason for hiding this comment

bdice Aug 23, 2022

Choose a reason for hiding this comment

mroeschke Aug 24, 2022

Choose a reason for hiding this comment

brandon-b-miller Aug 24, 2022

Choose a reason for hiding this comment

bdice Aug 22, 2022

Choose a reason for hiding this comment

brandon-b-miller Aug 23, 2022

Choose a reason for hiding this comment

bdice Aug 23, 2022

Choose a reason for hiding this comment

codecov bot commented Aug 22, 2022 • edited Loading

Codecov Report

bdice left a comment • edited Loading

Choose a reason for hiding this comment

brandon-b-miller commented Aug 25, 2022

bdice commented Aug 25, 2022

bdice Aug 22, 2022 •

edited

Loading

codecov bot commented Aug 22, 2022 •

edited

Loading

bdice left a comment •

edited

Loading