-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache JIT GroupBy.apply
functions
#12802
Cache JIT GroupBy.apply
functions
#12802
Conversation
Can we merge the big move PR #12669 before doing feature work on these areas of the code? |
with the merge of #12669 this should be ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to test the caching behavior?
What do you think about the approach here ? |
Looks good to me, that would be a good test if you copied and altered it for the groupby case. |
grouped_values, function, args | ||
) | ||
return_type = numpy_support.as_dtype(return_type) | ||
cache_key = _generate_cache_key(grouped_values, function) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we don't use lru_cache
and instead manually track cache keys? I assume it has to do with types being supported in lru_cache
keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this context, precompiled
is a cachetools.LRUCache
. Are you asking why we don't do the following from functools?
@functools.lru_cache
def _get_groupby_apply_kernel(...)
If so the reason was that I wanted different UDF pipelines (apply
, groupby.apply
etc) to share the same cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. 😄 I didn't look closely enough at precompiled
. But to clarify, how do you distinguish the type of UDF? Could an apply function and a groupby apply function reuse the same exact kernel? If not, how are the cache keys distinguished (for functions of the same data types)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache key is based on the bytecode of the UDF, the particulars are found here. I suppose you could get a cache hit erroneously if you:
- wrote a function
f
and executed it usingDataFrame.apply
- applied the exact same
f
on agroupby
result whose columns were the exact same dtypes as the dataframe that it was first applied to
However I would expect the above to cause a crash in pandas case as well since each API enforces a different syntax for the kinds of UDFs it accepts, so using one kind of function with the other's apply
API probably wouldn't work in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to your judgment here -- but distinguishing keys clearly would be a plus, in my eyes. An erroneous cache hit would be bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an extra tuple element to UDFs that go through GroupBy.apply
that should break this degeneracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, I think I would have preferred an approach like _generate_cache_key(grouped_values, function, suffix="__GROUPBY_APPLY_UDF")
where you provide a suffix to the function making the key. Not a dealbreaker but worth considering if we have more JIT code paths with separate JIT caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you! The cache key should be generated within _generate_cache_key
, not half in _generate_cache_key
and half outside of the function. I updated this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test looks good but I think we can improve the cache itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving pending @bdice's improvement
Co-authored-by: Bradley Dice <[email protected]>
…udf into groupby-apply-caching
/merge |
This PR sends incoming UDFs that go through the
engine='jit'
codepath through the main UDF cache. This should avoid recompiling if a user reuses the same UDF on different input data, so long as the types of that data are the same.