Cache JIT `GroupBy.apply` functions #12802

brandon-b-miller · 2023-02-17T14:21:30Z

This PR sends incoming UDFs that go through the engine='jit' codepath through the main UDF cache. This should avoid recompiling if a user reuses the same UDF on different input data, so long as the types of that data are the same.

bdice · 2023-02-17T15:07:18Z

Can we merge the big move PR #12669 before doing feature work on these areas of the code?

brandon-b-miller · 2023-02-22T19:38:55Z

with the merge of #12669 this should be ready for review.

bdice

Is it possible to test the caching behavior?

brandon-b-miller · 2023-02-22T20:59:03Z

Is it possible to test the caching behavior?

What do you think about the approach here ?

bdice · 2023-02-23T00:19:06Z

Is it possible to test the caching behavior?

What do you think about the approach here ?

Looks good to me, that would be a good test if you copied and altered it for the groupby case.

bdice · 2023-03-15T18:38:37Z

python/cudf/cudf/core/udf/groupby_utils.py

-        grouped_values, function, args
-    )
-    return_type = numpy_support.as_dtype(return_type)
+    cache_key = _generate_cache_key(grouped_values, function)


Is there a reason we don't use lru_cache and instead manually track cache keys? I assume it has to do with types being supported in lru_cache keys?

In this context, precompiled is a cachetools.LRUCache. Are you asking why we don't do the following from functools?

@functools.lru_cache def _get_groupby_apply_kernel(...)

If so the reason was that I wanted different UDF pipelines (apply, groupby.apply etc) to share the same cache.

Nevermind. 😄 I didn't look closely enough at precompiled. But to clarify, how do you distinguish the type of UDF? Could an apply function and a groupby apply function reuse the same exact kernel? If not, how are the cache keys distinguished (for functions of the same data types)?

The cache key is based on the bytecode of the UDF, the particulars are found here. I suppose you could get a cache hit erroneously if you:

wrote a function f and executed it using DataFrame.apply

applied the exact same f on a groupby result whose columns were the exact same dtypes as the dataframe that it was first applied to

However I would expect the above to cause a crash in pandas case as well since each API enforces a different syntax for the kinds of UDFs it accepts, so using one kind of function with the other's apply API probably wouldn't work in most cases.

I'll defer to your judgment here -- but distinguishing keys clearly would be a plus, in my eyes. An erroneous cache hit would be bad.

I added an extra tuple element to UDFs that go through GroupBy.apply that should break this degeneracy.

425a912

Nice. Thanks!

fwiw, I think I would have preferred an approach like _generate_cache_key(grouped_values, function, suffix="__GROUPBY_APPLY_UDF") where you provide a suffix to the function making the key. Not a dealbreaker but worth considering if we have more JIT code paths with separate JIT caches.

I agree with you! The cache key should be generated within _generate_cache_key, not half in _generate_cache_key and half outside of the function. I updated this.

python/cudf/cudf/core/udf/groupby_utils.py

bdice

Test looks good but I think we can improve the cache itself.

shwina

Approving pending @bdice's improvement

Co-authored-by: Bradley Dice <[email protected]>

…udf into groupby-apply-caching

brandon-b-miller · 2023-03-24T12:56:08Z

/merge

add caching to groupby jit

6e7af41

brandon-b-miller added feature request New feature or request 3 - Ready for Review Ready for review by team numba Numba issue Python Affects Python cuDF API. non-breaking Non-breaking change labels Feb 17, 2023

brandon-b-miller requested a review from a team as a code owner February 17, 2023 14:21

brandon-b-miller requested review from shwina and bdice February 17, 2023 14:21

Merge branch 'branch-23.04' into groupby-apply-caching

416589a

bdice reviewed Feb 22, 2023

View reviewed changes

brandon-b-miller added 2 commits February 22, 2023 17:21

add tests

d0ef1dc

Merge branch 'branch-23.04' into groupby-apply-caching

e41b72c

bdice reviewed Mar 15, 2023

View reviewed changes

python/cudf/cudf/core/udf/groupby_utils.py Outdated Show resolved Hide resolved

bdice reviewed Mar 15, 2023

View reviewed changes

shwina approved these changes Mar 15, 2023

View reviewed changes

brandon-b-miller and others added 3 commits March 15, 2023 13:50

Update python/cudf/cudf/core/udf/groupby_utils.py

6866975

Co-authored-by: Bradley Dice <[email protected]>

Merge branch 'branch-23.04' into groupby-apply-caching

4858af2

Merge branch 'groupby-apply-caching' of github.com:brandon-b-miller/c…

197bce6

…udf into groupby-apply-caching

bdice approved these changes Mar 17, 2023

View reviewed changes

brandon-b-miller added 3 commits March 21, 2023 12:16

denote groupby apply udfs in cache key

425a912

Merge branch 'branch-23.04' into groupby-apply-caching

1fbc8f0

generate all of the cache key in _generate_cache_key

136bc11

rapids-bot bot merged commit a0473cf into rapidsai:branch-23.04 Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache JIT `GroupBy.apply` functions #12802

Cache JIT `GroupBy.apply` functions #12802

brandon-b-miller commented Feb 17, 2023

bdice commented Feb 17, 2023

brandon-b-miller commented Feb 22, 2023

bdice left a comment

brandon-b-miller commented Feb 22, 2023

bdice commented Feb 23, 2023

bdice Mar 15, 2023

brandon-b-miller Mar 15, 2023

bdice Mar 15, 2023

brandon-b-miller Mar 15, 2023 •

edited

Loading

bdice Mar 17, 2023

brandon-b-miller Mar 21, 2023

bdice Mar 21, 2023

bdice Mar 21, 2023

brandon-b-miller Mar 22, 2023

bdice left a comment

shwina left a comment

brandon-b-miller commented Mar 24, 2023

Cache JIT GroupBy.apply functions #12802

Cache JIT GroupBy.apply functions #12802

Conversation

brandon-b-miller commented Feb 17, 2023

bdice commented Feb 17, 2023

brandon-b-miller commented Feb 22, 2023

bdice left a comment

Choose a reason for hiding this comment

brandon-b-miller commented Feb 22, 2023

bdice commented Feb 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandon-b-miller Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

shwina left a comment

Choose a reason for hiding this comment

brandon-b-miller commented Mar 24, 2023

Cache JIT `GroupBy.apply` functions #12802

Cache JIT `GroupBy.apply` functions #12802

brandon-b-miller Mar 15, 2023 •

edited

Loading