Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache JIT GroupBy.apply functions #12802

Merged

Conversation

brandon-b-miller
Copy link
Contributor

This PR sends incoming UDFs that go through the engine='jit' codepath through the main UDF cache. This should avoid recompiling if a user reuses the same UDF on different input data, so long as the types of that data are the same.

@brandon-b-miller brandon-b-miller added feature request New feature or request 3 - Ready for Review Ready for review by team numba Numba issue Python Affects Python cuDF API. non-breaking Non-breaking change labels Feb 17, 2023
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner February 17, 2023 14:21
@bdice
Copy link
Contributor

bdice commented Feb 17, 2023

Can we merge the big move PR #12669 before doing feature work on these areas of the code?

@brandon-b-miller
Copy link
Contributor Author

with the merge of #12669 this should be ready for review.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to test the caching behavior?

@brandon-b-miller
Copy link
Contributor Author

Is it possible to test the caching behavior?

What do you think about the approach here ?

@bdice
Copy link
Contributor

bdice commented Feb 23, 2023

Is it possible to test the caching behavior?

What do you think about the approach here ?

Looks good to me, that would be a good test if you copied and altered it for the groupby case.

grouped_values, function, args
)
return_type = numpy_support.as_dtype(return_type)
cache_key = _generate_cache_key(grouped_values, function)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't use lru_cache and instead manually track cache keys? I assume it has to do with types being supported in lru_cache keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context, precompiled is a cachetools.LRUCache. Are you asking why we don't do the following from functools?

@functools.lru_cache
def _get_groupby_apply_kernel(...)

If so the reason was that I wanted different UDF pipelines (apply, groupby.apply etc) to share the same cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. 😄 I didn't look closely enough at precompiled. But to clarify, how do you distinguish the type of UDF? Could an apply function and a groupby apply function reuse the same exact kernel? If not, how are the cache keys distinguished (for functions of the same data types)?

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache key is based on the bytecode of the UDF, the particulars are found here. I suppose you could get a cache hit erroneously if you:

  1. wrote a function f and executed it using DataFrame.apply
  2. applied the exact same f on a groupby result whose columns were the exact same dtypes as the dataframe that it was first applied to

However I would expect the above to cause a crash in pandas case as well since each API enforces a different syntax for the kinds of UDFs it accepts, so using one kind of function with the other's apply API probably wouldn't work in most cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to your judgment here -- but distinguishing keys clearly would be a plus, in my eyes. An erroneous cache hit would be bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an extra tuple element to UDFs that go through GroupBy.apply that should break this degeneracy.

425a912

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, I think I would have preferred an approach like _generate_cache_key(grouped_values, function, suffix="__GROUPBY_APPLY_UDF") where you provide a suffix to the function making the key. Not a dealbreaker but worth considering if we have more JIT code paths with separate JIT caches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you! The cache key should be generated within _generate_cache_key, not half in _generate_cache_key and half outside of the function. I updated this.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test looks good but I think we can improve the cache itself.

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving pending @bdice's improvement

@brandon-b-miller
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit a0473cf into rapidsai:branch-23.04 Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change numba Numba issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants