-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Groupby Mean Performance Regression #6228
Comments
From the profile @quasiben shared, it shows how the @devavret confirmed that currently calling I believe this explains the performance regression as we used to compute the |
With PR #6392 In [10]: %timeit _ = cdf.groupby('id').agg(['min', 'max', 'count', 'mean'])
886 ms ± 34.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: %timeit _ = (cdf.groupby('id').min(), cdf.groupby('id').max(), cdf.groupby('id').count(), cdf.groupby('id').mean())
909 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
@karthikeyann what are all those tiny ranges before and after the |
before & after first Added a TODO for |
closes #6228 closes #4400 - [x] added groupby hash mean aggregation. (multi-pass method). - [x] added multi-pass method (collated second pass) - [x] enabled MEAN, STD, VARIANCE, ~SUM_OF_SQUARES~ - [x] unit tests Implemented 2 pass approach for compound aggregations. compound aggregations are aggregartion which can be computed from results of simple aggregations. simple aggregations need only 1-pass through the grouped values. `aggregation::get_simple_aggregations()` will return simple aggregation for the aggregation. - find required simple aggregations for compound aggregations and add to list. - first pass is calculating the list of simple aggregations. (1 kernel launch) - second pass takes result of simple aggregations and computes results of compound aggregations. (1 kernel launch) Authors: - Karthikeyan Natarajan <[email protected]> - Karthikeyan <[email protected]> Approvers: - Devavret Makkar - Ashwin Srinath - Jake Hemstad URL: #6392
There may be a performance regression when running groupby mean in cudf when executing with many columns. When cuDF performs multiple aggregations, it should be more performant to do this with the
agg
operator rather than each aggregation expressed individually. This is not the case the reproducible below (note this is built with 260 columns and 2592001 row)cc @pentschev @zronaghi @shwina @kkraus14
I'm collecting NVTX info now in hopes that this helps
The text was updated successfully, but these errors were encountered: