Automatically select `GroupBy.apply` algorithm based on if the UDF is jittable #13113

brandon-b-miller · 2023-04-11T14:24:59Z

Closes #13103

python/cudf/cudf/core/udf/groupby_utils.py

brandon-b-miller · 2023-04-24T15:30:40Z

hi all, this should be ready for review.

vyasr

One minor suggestion and a question, but LGTM.

python/cudf/cudf/core/udf/groupby_utils.py

wence-

Two minor quibbles.

wence- · 2023-04-26T15:51:23Z

python/cudf/cudf/core/groupby/groupby.py

-                raise ValueError(
-                    "Nulls not yet supported with groupby JIT engine"
-                )
+        # TODO: don't check this twice under `engine='auto'`


Is there (will there be?) a bug for this? Or do you intend to fix it here?

I thought this might be a perf hit but since each column caches its null count I think it's actually ok to just leave it.

wence- · 2023-04-26T15:52:01Z

python/cudf/cudf/core/groupby/groupby.py

@@ -1198,7 +1199,7 @@ def _iterative_groupby_apply(
                result.index = cudf.MultiIndex._from_data(index_data)
        return result

-    def apply(self, function, *args, engine="cudf"):
+    def apply(self, function, *args, engine="auto"):


docstring needs to be updated to discuss new "auto" option.

Added some docs here.

wence-

Thanks!

python/cudf/cudf/core/groupby/groupby.py

bdice · 2023-05-11T19:43:58Z

python/cudf/cudf/core/groupby/groupby.py

@@ -1252,7 +1249,7 @@ def apply(self, function, *args, engine="cudf"):
          on the grouped chunk.
        args : tuple
            Optional positional arguments to pass to the function.
-        engine: {'cudf', 'jit'}, default 'cudf'
+        engine: {'cudf', 'jit'}, default 'auto'


For my knowledge: Is there a performance cost to the fallback? i.e. Does the JIT attempt have measurable overhead?

Yes, it does have measurable overhead. One way of measuring it is with

import cudf df = cudf.DataFrame({ 'a':[0,1,1], 'b':[1,2,3] }) def func(grp): # binops can't be jitted without refcounting return grp + grp grouped = df.groupby('a') import cProfile cProfile.run('grouped.apply(func)', sort='cumtime')

For this I get

1 0.000 0.000 0.213 0.213 groupby_utils.py:207(_can_be_jitted)

Meaning it's quite impactful. However if this becomes a problem users should be able to obtain the old behavior by just passing engine='cudf'.

Full disclosure, we anticipated this and I was OK with it. I think the tradeoff is generally worthwhile. If we think it isn't then I think we'd just stop this work and remove 'auto' altogether.

Also I just noticed that 'auto' is not listed in the set of valid engine arguments here.

Just to clarify — your snippet above is saying the JIT attempt costs ~200ms? That sounds right to me. I would think a cache could also be used here if needed to prevent multiple failed attempts from paying the overhead for the same function.

I am supportive of this change, because when it does pay off, it’s a big win. Just want to make sure we’re putting in the appropriate amount of engineering effort to mitigate the downside risk.

Also I just noticed that 'auto' is not listed in the set of valid engine arguments here.

Fixed this.

python/cudf/cudf/core/groupby/groupby.py

brandon-b-miller · 2023-05-16T15:19:10Z

/merge

brandon-b-miller added 2 commits April 10, 2023 11:56

initial untested

b3a736a

cleanup

da6659f

brandon-b-miller added feature request New feature or request numba Numba issue non-breaking Non-breaking change python labels Apr 11, 2023

brandon-b-miller self-assigned this Apr 11, 2023

github-actions bot added the Python Affects Python cuDF API. label Apr 11, 2023

brandon-b-miller commented Apr 11, 2023

View reviewed changes

python/cudf/cudf/core/udf/groupby_utils.py Show resolved Hide resolved

brandon-b-miller added 2 commits April 21, 2023 10:27

Merge branch 'branch-23.06' into fea-groupby-apply-fallback

acc966e

fixes

7795929

brandon-b-miller marked this pull request as ready for review April 24, 2023 14:06

brandon-b-miller requested a review from a team as a code owner April 24, 2023 14:06

brandon-b-miller requested review from wence- and bdice April 24, 2023 14:06

vyasr approved these changes Apr 24, 2023

View reviewed changes

python/cudf/cudf/core/udf/groupby_utils.py Outdated Show resolved Hide resolved

python/cudf/cudf/core/udf/groupby_utils.py Show resolved Hide resolved

wence- requested changes Apr 26, 2023

View reviewed changes

brandon-b-miller added 5 commits April 26, 2023 09:41

Merge branch 'branch-23.06' into fea-groupby-apply-fallback

9ecd4d4

merge latest and resolve conflicts

bb8e894

update docs

9dadf35

remove todo

064f52d

Merge branch 'branch-23.06' into fea-groupby-apply-fallback

c110e20

brandon-b-miller requested a review from wence- May 11, 2023 14:28

inline function

ad4dbaa

wence- approved these changes May 11, 2023

View reviewed changes

bdice reviewed May 11, 2023

View reviewed changes

wence- reviewed May 15, 2023

View reviewed changes

python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved

Mention auto as a valid engine type

11d4f73

bdice approved these changes May 15, 2023

View reviewed changes

Merge branch 'branch-23.06' into fea-groupby-apply-fallback

6f2cebc

rapids-bot bot merged commit 89feac7 into rapidsai:branch-23.06 May 16, 2023

brandon-b-miller mentioned this pull request May 25, 2023

[BUG] groupby apply on a functools.partial func fails #13426

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically select `GroupBy.apply` algorithm based on if the UDF is jittable #13113

Automatically select `GroupBy.apply` algorithm based on if the UDF is jittable #13113

brandon-b-miller commented Apr 11, 2023

brandon-b-miller commented Apr 24, 2023

vyasr left a comment

wence- left a comment

wence- Apr 26, 2023

brandon-b-miller Apr 26, 2023

wence- Apr 26, 2023

brandon-b-miller Apr 26, 2023

wence- left a comment

bdice May 11, 2023

brandon-b-miller May 12, 2023

vyasr May 13, 2023

bdice May 13, 2023

wence- May 15, 2023

brandon-b-miller commented May 16, 2023

Automatically select GroupBy.apply algorithm based on if the UDF is jittable #13113

Automatically select GroupBy.apply algorithm based on if the UDF is jittable #13113

Conversation

brandon-b-miller commented Apr 11, 2023

brandon-b-miller commented Apr 24, 2023

vyasr left a comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandon-b-miller commented May 16, 2023

Automatically select `GroupBy.apply` algorithm based on if the UDF is jittable #13113

Automatically select `GroupBy.apply` algorithm based on if the UDF is jittable #13113