Add support for tdigest and merge_tdigest aggregations through cudf::reduce #10433

nvdbaranec · 2022-03-14T22:41:20Z

Previously, these aggregations only worked with groupby. Now they can be invoked through cudf::reduce, producing scalar tdigest values (which under the hood are simply struct columns with 1 row).

The difference between the groupby and reduce versions is minimal. They are both fundamentally reduce_by_key operations, where the keys represent the bucketing of input values to merged centroids. In the case of groupby, the keys are further partitioned by the specific input group. So the bulk of the changes are simply adding a few extra template parameters to various internal functions to allow the reduce path to behave as if it were just a constant group.

Similarly, many of the groupby tests which involved single groups have been refactored/repurposed for the reduce tests.

Most of the important changes are in tdigest_aggregation.cu

codecov · 2022-03-15T00:20:59Z

Codecov Report

Merging #10433 (8a68f3e) into branch-22.04 (0be0b00) will increase coverage by 0.04%.
The diff coverage is 100.00%.

❗ Current head 8a68f3e differs from pull request most recent head eb6a0c4. Consider uploading reports for the commit eb6a0c4 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.04   #10433      +/-   ##
================================================
+ Coverage         86.13%   86.18%   +0.04%     
================================================
  Files               139      139              
  Lines             22438    22468      +30     
================================================
+ Hits              19328    19363      +35     
+ Misses             3110     3105       -5

Impacted Files	Coverage Δ
python/cudf/cudf/core/tools/numeric.py	`89.24% <100.00%> (+0.11%)`	⬆️
python/dask_cudf/dask_cudf/backends.py	`86.44% <100.00%> (+1.47%)`	⬆️
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py	`100.00% <100.00%> (ø)`
python/cudf/cudf/core/column/string.py	`88.39% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.57% <0.00%> (+0.22%)`	⬆️
python/cudf/cudf/core/column/numerical.py	`95.28% <0.00%> (+0.29%)`	⬆️
python/cudf/cudf/core/tools/datetimes.py	`84.49% <0.00%> (+0.30%)`	⬆️
python/cudf/cudf/core/column/lists.py	`90.56% <0.00%> (+0.47%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf936b6...eb6a0c4. Read the comment docs.

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

…tdigest aggregations is STRUCT.

vyasr

CMake LGTM.

cpp/include/cudf_test/tdigest_utilities.cuh

devavret · 2022-03-18T17:47:42Z

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

+struct make_centroid_no_nulls {
+  column_device_view const col;
+
+  centroid operator() __device__(size_type index) const


I've heard we need to replace thrust::tuple with cuda::std::tuple.

devavret · 2022-03-18T17:49:27Z

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

  offset_type const* group_offsets;

-  thrust::pair<double, int> operator() __device__(double next_limit, size_type group_index)
+  thrust::pair<double, int> operator() __device__(double next_limit, size_type group_index) const


The same thing about thrust::pair

devavret · 2022-03-18T17:52:10Z

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

 *
 * This functor assumes the weight for all scalars is simply 1. Under this assumption,
 * the nearest weight that will be <= the next limit is simply the nearest integer < the limit,
 * which we can get by just taking floor(next_limit).  For example if our next limit is 3.56, the
 * nearest whole number <= it is floor(3.56) == 3.
 */
-struct nearest_value_scalar_weights {
+struct nearest_value_scalar_weights_grouped {
  offset_type const* group_offsets;


Prefer device_span.

devavret

LGTM

nvdbaranec · 2022-03-21T20:32:09Z

rerun tests

nvdbaranec · 2022-03-21T22:23:31Z

@gpucibot merge

nvdbaranec added 22 commits February 22, 2022 11:44

Add scan_aggregation and reduce_aggregations. C++ side only.

245e68c

Java bindings.

c884d5c

Merge branch 'branch-22.04' into scan_reduce_aggregations

321c9b2

Python bindings.

900d55c

Copyright updates.

0398a0d

PR review comments.

a3a71b8

Formatting

56a6c0f

Centralize tdigest aggregation code to quantiles/tdigest.

8917445

Clean up some test code.

e693562

Merge branch 'scan_reduce_aggregations' into tdigest_code_move

f49e2c9

Small test tweak.

23cae44

Merge branch 'scan_reduce_aggregations' into tdigest_code_move

7fdc9f5

tdigest reduce_aggregation functionality and tests.

3088ec8

Merge branch 'branch-22.04' into scan_reduce_aggregations

6f940fd

Merge branch 'scan_reduce_aggregations' into tdigest_code_move

13c776a

Merge branch 'tdigest_code_move' into tdigest_reduction

3140f5f

Merge branch 'branch-22.04' into tdigest_code_move

27a854e

Copyright update.

6827e8f

cmake format fixes.

25c1849

Merge branch 'tdigest_code_move' into tdigest_reduction

b86b3db

Merge branch 'branch-22.04' into tdigest_reduction

83f4d31

Merge tdigest aggregation for cudf::reduce

6a2d50e

nvdbaranec added feature request libcudf DO NOT MERGE non-breaking labels Mar 14, 2022

nvdbaranec requested review from a team as code owners March 14, 2022 22:41

nvdbaranec requested a review from cwharris March 14, 2022 22:41

nvdbaranec requested a review from devavret March 14, 2022 22:41

github-actions bot added the CMake label Mar 14, 2022

Formatting fixes.

0fdd74e

hyperbolic2346 reviewed Mar 15, 2022

View reviewed changes

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Show resolved Hide resolved

Simplified the conversion in to_tdigest_scalar.

98e76ef

andygrove mentioned this pull request Mar 16, 2022

Add Java bindings for t-digest reduction #10446

Merged

Add enforcement that the output_dtype parameter passed to reduce for …

eb6a0c4

…tdigest aggregations is STRUCT.

vyasr approved these changes Mar 16, 2022

View reviewed changes

devavret reviewed Mar 18, 2022

View reviewed changes

cpp/include/cudf_test/tdigest_utilities.cuh Show resolved Hide resolved

devavret reviewed Mar 18, 2022

View reviewed changes

devavret approved these changes Mar 18, 2022

View reviewed changes

nvdbaranec removed the DO NOT MERGE label Mar 21, 2022

cwharris approved these changes Mar 21, 2022

View reviewed changes

rapids-bot bot merged commit 037fe87 into rapidsai:branch-22.04 Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for tdigest and merge_tdigest aggregations through cudf::reduce #10433

Add support for tdigest and merge_tdigest aggregations through cudf::reduce #10433

nvdbaranec commented Mar 14, 2022 •

edited

Loading

codecov bot commented Mar 15, 2022 •

edited

Loading

vyasr left a comment

devavret Mar 18, 2022

devavret Mar 18, 2022

devavret Mar 18, 2022

devavret left a comment

nvdbaranec commented Mar 21, 2022

nvdbaranec commented Mar 21, 2022

Add support for tdigest and merge_tdigest aggregations through cudf::reduce #10433

Add support for tdigest and merge_tdigest aggregations through cudf::reduce #10433

Conversation

nvdbaranec commented Mar 14, 2022 • edited Loading

codecov bot commented Mar 15, 2022 • edited Loading

Codecov Report

vyasr left a comment

Choose a reason for hiding this comment

devavret Mar 18, 2022

Choose a reason for hiding this comment

devavret Mar 18, 2022

Choose a reason for hiding this comment

devavret Mar 18, 2022

Choose a reason for hiding this comment

devavret left a comment

Choose a reason for hiding this comment

nvdbaranec commented Mar 21, 2022

nvdbaranec commented Mar 21, 2022

nvdbaranec commented Mar 14, 2022 •

edited

Loading

codecov bot commented Mar 15, 2022 •

edited

Loading