Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Groupby Mean Performance Regression #6228

Closed
quasiben opened this issue Sep 14, 2020 · 4 comments · Fixed by #6392
Closed

[BUG] Groupby Mean Performance Regression #6228

quasiben opened this issue Sep 14, 2020 · 4 comments · Fixed by #6392
Assignees
Labels
bug Something isn't working Performance Performance related issue Python Affects Python cuDF API.

Comments

@quasiben
Copy link
Member

There may be a performance regression when running groupby mean in cudf when executing with many columns. When cuDF performs multiple aggregations, it should be more performant to do this with the agg operator rather than each aggregation expressed individually. This is not the case the reproducible below (note this is built with 260 columns and 2592001 row)

In [1]: import cudf

In [2]: alphabets = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']

In [3]: prefixes = alphabets[:10]

In [4]: coll_dict = dict()
   ...: for prefix in prefixes:
   ...:     for this_name in alphabets:
   ...:         coll_dict[prefix + this_name] = float
   ...:

In [5]: coll_dict['id'] = int

In [6]: cdf = cudf.datasets.timeseries(start='2000',
   ...:                      end='2000-01-31',
   ...:                      dtypes=coll_dict,
   ...:                      freq='1s'
   ...:                      seed=1,
   ...:                     ).reset_index(drop=True)

In [7]: %timeit _ = cdf.groupby('id').agg(['min', 'max', 'count', 'mean'])
3.5 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit _ = (cdf.groupby('id').min(), cdf.groupby('id').max(), cdf.groupby('id').count(), cdf.groupby('id').mean())
1.99 s ± 42.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit _ = cdf.groupby('id').mean()
1.18 s ± 7.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

cc @pentschev @zronaghi @shwina @kkraus14

I'm collecting NVTX info now in hopes that this helps

@quasiben quasiben added bug Something isn't working Needs Triage Need team to review and classify labels Sep 14, 2020
@kkraus14 kkraus14 added Python Affects Python cuDF API. Performance Performance related issue and removed Needs Triage Need team to review and classify labels Sep 14, 2020
@jrhemstad
Copy link
Contributor

image

From the profile @quasiben shared, it shows how the MEAN aggregation is taking much longer than the others. Based on the kernel activity in the MEAN aggregation it is taking the sort-based path, which explains why it is much slower.

@devavret confirmed that currently calling cudf::groupby with MEAN results in using the sort-based path.

I believe this explains the performance regression as we used to compute the MEAN aggregation using the hash-based path. This is still possible today, it just wasn't added when groupby was ported to libcudf++.

@karthikeyann
Copy link
Contributor

With PR #6392

In [10]: %timeit _ = cdf.groupby('id').agg(['min', 'max', 'count', 'mean'])
886 ms ± 34.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: %timeit _ = (cdf.groupby('id').min(), cdf.groupby('id').max(), cdf.groupby('id').count(), cdf.groupby('id').mean())
909 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

image

@jrhemstad
Copy link
Contributor

@karthikeyann what are all those tiny ranges before and after the cudaStreamSynchronize?

@karthikeyann
Copy link
Contributor

before & after first cudaStreamSynchronize are make_fixed_width_column calls. (for columns pushed into sparse_cache).
After the last cudaStreamSynchronize are null_count calls.
Red ones are cudaFree calls.

Added a TODO for make_fixed_width_column calls as room for improvement in comment now.
It could use similar approach as PR #6605 and PR #6615

@rapids-bot rapids-bot bot closed this as completed in #6392 Dec 2, 2020
rapids-bot bot pushed a commit that referenced this issue Dec 2, 2020
closes #6228
closes #4400 

- [x] added groupby hash mean aggregation. (multi-pass method).
- [x] added multi-pass method (collated second pass)
- [x] enabled MEAN, STD, VARIANCE, ~SUM_OF_SQUARES~
- [x] unit tests

Implemented 2 pass approach for compound aggregations.
compound aggregations are aggregartion which can be computed from results of simple aggregations.
simple aggregations need only 1-pass through the grouped values. 
`aggregation::get_simple_aggregations()` will return simple aggregation for the aggregation. 

- find required simple aggregations for compound aggregations and add to list.
- first pass is calculating the list of simple aggregations. (1 kernel launch)
- second pass takes result of simple aggregations and computes results of compound aggregations. (1 kernel launch)

Authors:
  - Karthikeyan Natarajan <[email protected]>
  - Karthikeyan <[email protected]>

Approvers:
  - Devavret Makkar
  - Ashwin Srinath
  - Jake Hemstad

URL: #6392
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants