[BUG] Groupby Mean Performance Regression #6228

quasiben · 2020-09-14T20:24:05Z

There may be a performance regression when running groupby mean in cudf when executing with many columns. When cuDF performs multiple aggregations, it should be more performant to do this with the agg operator rather than each aggregation expressed individually. This is not the case the reproducible below (note this is built with 260 columns and 2592001 row)

In [1]: import cudf

In [2]: alphabets = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']

In [3]: prefixes = alphabets[:10]

In [4]: coll_dict = dict()
   ...: for prefix in prefixes:
   ...:     for this_name in alphabets:
   ...:         coll_dict[prefix + this_name] = float
   ...:

In [5]: coll_dict['id'] = int

In [6]: cdf = cudf.datasets.timeseries(start='2000',
   ...:                      end='2000-01-31',
   ...:                      dtypes=coll_dict,
   ...:                      freq='1s'
   ...:                      seed=1,
   ...:                     ).reset_index(drop=True)

In [7]: %timeit _ = cdf.groupby('id').agg(['min', 'max', 'count', 'mean'])
3.5 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit _ = (cdf.groupby('id').min(), cdf.groupby('id').max(), cdf.groupby('id').count(), cdf.groupby('id').mean())
1.99 s ± 42.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit _ = cdf.groupby('id').mean()
1.18 s ± 7.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

cc @pentschev @zronaghi @shwina @kkraus14

I'm collecting NVTX info now in hopes that this helps

The text was updated successfully, but these errors were encountered:

jrhemstad · 2020-09-14T22:15:30Z

From the profile @quasiben shared, it shows how the MEAN aggregation is taking much longer than the others. Based on the kernel activity in the MEAN aggregation it is taking the sort-based path, which explains why it is much slower.

@devavret confirmed that currently calling cudf::groupby with MEAN results in using the sort-based path.

I believe this explains the performance regression as we used to compute the MEAN aggregation using the hash-based path. This is still possible today, it just wasn't added when groupby was ported to libcudf++.

karthikeyann · 2020-11-04T21:31:47Z

With PR #6392

In [10]: %timeit _ = cdf.groupby('id').agg(['min', 'max', 'count', 'mean'])
886 ms ± 34.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: %timeit _ = (cdf.groupby('id').min(), cdf.groupby('id').max(), cdf.groupby('id').count(), cdf.groupby('id').mean())
909 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

jrhemstad · 2020-11-04T21:33:14Z

@karthikeyann what are all those tiny ranges before and after the cudaStreamSynchronize?

karthikeyann · 2020-11-04T21:41:08Z

before & after first cudaStreamSynchronize are make_fixed_width_column calls. (for columns pushed into sparse_cache).
After the last cudaStreamSynchronize are null_count calls.
Red ones are cudaFree calls.

Added a TODO for make_fixed_width_column calls as room for improvement in comment now.
It could use similar approach as PR #6605 and PR #6615

closes #6228 closes #4400 - [x] added groupby hash mean aggregation. (multi-pass method). - [x] added multi-pass method (collated second pass) - [x] enabled MEAN, STD, VARIANCE, ~SUM_OF_SQUARES~ - [x] unit tests Implemented 2 pass approach for compound aggregations. compound aggregations are aggregartion which can be computed from results of simple aggregations. simple aggregations need only 1-pass through the grouped values. `aggregation::get_simple_aggregations()` will return simple aggregation for the aggregation. - find required simple aggregations for compound aggregations and add to list. - first pass is calculating the list of simple aggregations. (1 kernel launch) - second pass takes result of simple aggregations and computes results of compound aggregations. (1 kernel launch) Authors: - Karthikeyan Natarajan <[email protected]> - Karthikeyan <[email protected]> Approvers: - Devavret Makkar - Ashwin Srinath - Jake Hemstad URL: #6392

quasiben added bug Something isn't working Needs Triage Need team to review and classify labels Sep 14, 2020

kkraus14 added Python Affects Python cuDF API. Performance Performance related issue and removed Needs Triage Need team to review and classify labels Sep 14, 2020

kkraus14 assigned shwina Sep 14, 2020

kkraus14 assigned karthikeyann and unassigned shwina Sep 25, 2020

karthikeyann mentioned this issue Oct 1, 2020

[REVIEW] add groupby hash mean aggregation, 2-pass method of hash groupby #6392

Merged

4 tasks

rapids-bot bot closed this as completed in #6392 Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Groupby Mean Performance Regression #6228

[BUG] Groupby Mean Performance Regression #6228

quasiben commented Sep 14, 2020

jrhemstad commented Sep 14, 2020

karthikeyann commented Nov 4, 2020

jrhemstad commented Nov 4, 2020

karthikeyann commented Nov 4, 2020

[BUG] Groupby Mean Performance Regression #6228

[BUG] Groupby Mean Performance Regression #6228

Comments

quasiben commented Sep 14, 2020

jrhemstad commented Sep 14, 2020

karthikeyann commented Nov 4, 2020

jrhemstad commented Nov 4, 2020

karthikeyann commented Nov 4, 2020