Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for tdigest and merge_tdigest aggregations through cudf::reduce #10433

Merged
merged 25 commits into from
Mar 21, 2022

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Mar 14, 2022

Previously, these aggregations only worked with groupby. Now they can be invoked through cudf::reduce, producing scalar tdigest values (which under the hood are simply struct columns with 1 row).

The difference between the groupby and reduce versions is minimal. They are both fundamentally reduce_by_key operations, where the keys represent the bucketing of input values to merged centroids. In the case of groupby, the keys are further partitioned by the specific input group. So the bulk of the changes are simply adding a few extra template parameters to various internal functions to allow the reduce path to behave as if it were just a constant group.

Similarly, many of the groupby tests which involved single groups have been refactored/repurposed for the reduce tests.

Most of the important changes are in tdigest_aggregation.cu

@nvdbaranec nvdbaranec added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change labels Mar 14, 2022
@nvdbaranec nvdbaranec requested review from a team as code owners March 14, 2022 22:41
@nvdbaranec nvdbaranec requested a review from cwharris March 14, 2022 22:41
@nvdbaranec nvdbaranec requested a review from devavret March 14, 2022 22:41
@github-actions github-actions bot added the CMake CMake build issue label Mar 14, 2022
@codecov
Copy link

codecov bot commented Mar 15, 2022

Codecov Report

Merging #10433 (8a68f3e) into branch-22.04 (0be0b00) will increase coverage by 0.04%.
The diff coverage is 100.00%.

❗ Current head 8a68f3e differs from pull request most recent head eb6a0c4. Consider uploading reports for the commit eb6a0c4 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.04   #10433      +/-   ##
================================================
+ Coverage         86.13%   86.18%   +0.04%     
================================================
  Files               139      139              
  Lines             22438    22468      +30     
================================================
+ Hits              19328    19363      +35     
+ Misses             3110     3105       -5     
Impacted Files Coverage Δ
python/cudf/cudf/core/tools/numeric.py 89.24% <100.00%> (+0.11%) ⬆️
python/dask_cudf/dask_cudf/backends.py 86.44% <100.00%> (+1.47%) ⬆️
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/column/string.py 88.39% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.57% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/column/numerical.py 95.28% <0.00%> (+0.29%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
python/cudf/cudf/core/column/lists.py 90.56% <0.00%> (+0.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf936b6...eb6a0c4. Read the comment docs.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake LGTM.

struct make_centroid_no_nulls {
column_device_view const col;

centroid operator() __device__(size_type index) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've heard we need to replace thrust::tuple with cuda::std::tuple.

offset_type const* group_offsets;

thrust::pair<double, int> operator() __device__(double next_limit, size_type group_index)
thrust::pair<double, int> operator() __device__(double next_limit, size_type group_index) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same thing about thrust::pair

*
* This functor assumes the weight for all scalars is simply 1. Under this assumption,
* the nearest weight that will be <= the next limit is simply the nearest integer < the limit,
* which we can get by just taking floor(next_limit). For example if our next limit is 3.56, the
* nearest whole number <= it is floor(3.56) == 3.
*/
struct nearest_value_scalar_weights {
struct nearest_value_scalar_weights_grouped {
offset_type const* group_offsets;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer device_span.

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nvdbaranec nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 21, 2022
@nvdbaranec
Copy link
Contributor Author

rerun tests

@nvdbaranec
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 037fe87 into rapidsai:branch-22.04 Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants