Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Attempt to JIT GroupBy.apply functions by default and fall back to iterative algorithm #13103

Closed
brandon-b-miller opened this issue Apr 10, 2023 · 3 comments · Fixed by #13113
Assignees
Labels
feature request New feature or request numba Numba issue

Comments

@brandon-b-miller
Copy link
Contributor

With #11452 we introduced a framework for JIT compiling groupby UDFs with numba, along with the GroupBy.apply engine='jit' kwarg. This is an o.k. approach since generally we are alright with introducing things that are a superset of the Pandas API.

Recently we've discussed changing things so that when a user uses GroupBy.apply we try and JIT the UDF first and if it doesn't work, then fall back to the iterative method. This would provide a unified API with less to learn for users and no wondering if the UDF conforms to the restrictions on JIT apply. It also provides an easier internal interface for features that build on top of GroupBy.apply, such as filter. However it introduces JIT overhead to workflows that ultimately won't even use it. This is not ideal, but iterative groupby apply is pretty slow already.

@brandon-b-miller brandon-b-miller added feature request New feature or request numba Numba issue python labels Apr 10, 2023
@brandon-b-miller brandon-b-miller self-assigned this Apr 10, 2023
@vyasr
Copy link
Contributor

vyasr commented Apr 10, 2023

I'm fine with introducing the extra overhead in the cases that won't JIT to help the cases that do JIT. We could mitigate the issue by introducing a new engine='auto' mode that does this, allowing users to opt into engine='cudf' if they know that they don't have a UDF that will compile successfully and want to avoid the overhead.

@brandon-b-miller
Copy link
Contributor Author

So would engine='auto' be the default?

@vyasr
Copy link
Contributor

vyasr commented Apr 10, 2023

I think so. If we wanted to err on the side of performance we could default to jit, but that seems likely to break many workflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request numba Numba issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants