-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache JIT GroupBy.apply
functions
#12802
Merged
rapids-bot
merged 10 commits into
rapidsai:branch-23.04
from
brandon-b-miller:groupby-apply-caching
Mar 24, 2023
Merged
Changes from 4 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
6e7af41
add caching to groupby jit
brandon-b-miller 416589a
Merge branch 'branch-23.04' into groupby-apply-caching
brandon-b-miller d0ef1dc
add tests
brandon-b-miller e41b72c
Merge branch 'branch-23.04' into groupby-apply-caching
brandon-b-miller 6866975
Update python/cudf/cudf/core/udf/groupby_utils.py
brandon-b-miller 4858af2
Merge branch 'branch-23.04' into groupby-apply-caching
brandon-b-miller 197bce6
Merge branch 'groupby-apply-caching' of github.com:brandon-b-miller/c…
brandon-b-miller 425a912
denote groupby apply udfs in cache key
brandon-b-miller 1fbc8f0
Merge branch 'branch-23.04' into groupby-apply-caching
brandon-b-miller 136bc11
generate all of the cache key in _generate_cache_key
brandon-b-miller File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we don't use
lru_cache
and instead manually track cache keys? I assume it has to do with types being supported inlru_cache
keys?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this context,
precompiled
is acachetools.LRUCache
. Are you asking why we don't do the following from functools?If so the reason was that I wanted different UDF pipelines (
apply
,groupby.apply
etc) to share the same cache.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. 😄 I didn't look closely enough at
precompiled
. But to clarify, how do you distinguish the type of UDF? Could an apply function and a groupby apply function reuse the same exact kernel? If not, how are the cache keys distinguished (for functions of the same data types)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache key is based on the bytecode of the UDF, the particulars are found here. I suppose you could get a cache hit erroneously if you:
f
and executed it usingDataFrame.apply
f
on agroupby
result whose columns were the exact same dtypes as the dataframe that it was first applied toHowever I would expect the above to cause a crash in pandas case as well since each API enforces a different syntax for the kinds of UDFs it accepts, so using one kind of function with the other's
apply
API probably wouldn't work in most cases.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to your judgment here -- but distinguishing keys clearly would be a plus, in my eyes. An erroneous cache hit would be bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an extra tuple element to UDFs that go through
GroupBy.apply
that should break this degeneracy.425a912
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, I think I would have preferred an approach like
_generate_cache_key(grouped_values, function, suffix="__GROUPBY_APPLY_UDF")
where you provide a suffix to the function making the key. Not a dealbreaker but worth considering if we have more JIT code paths with separate JIT caches.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you! The cache key should be generated within
_generate_cache_key
, not half in_generate_cache_key
and half outside of the function. I updated this.