Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Covariance, Pearson correlation for sort groupby (libcudf) #9154

Merged
merged 81 commits into from
Oct 18, 2021

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Aug 31, 2021

Add sort-groupby covariance and Pearson correlation in libcudf
Addresses part of #1268 (groupby covariance)
Addresses part of #8691 (groupby Pearson correlation)
depends on PR #9195

For both covariance and Pearson correlation, the input column pair should be represented as 2 child columns of non-nullable struct column (aggregation_request::values = struct_column_view{x, y})

covariance = Sum((x-mean_x)*(y-mean_y)) / (group_size-ddof)
Pearson correlation = covariance/ xstddev / ystddev

x, y values both should be non-null.
mean, stddev, count should be calculated on only common non-null values of both columns.

mean, stddev, count of child columns are cached.
One limitation is when both null columns has non-identical null masks, the cached result (mean, stddev, count) of common valid rows can not be reused because bitmask_and result nullmask goes out of scope and new nullmask is created for another set of columns (even if they are same).

Unit tests for covariance and pearson correlation added.

@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Aug 31, 2021
@karthikeyann karthikeyann added 2 - In Progress Currently a work in progress feature request New feature or request non-breaking Non-breaking change and removed CMake CMake build issue labels Aug 31, 2021
@codecov
Copy link

codecov bot commented Aug 31, 2021

Codecov Report

Merging #9154 (8426f56) into branch-21.12 (ab4bfaa) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9154      +/-   ##
================================================
- Coverage         10.79%   10.75%   -0.04%     
================================================
  Files               116      116              
  Lines             18869    19482     +613     
================================================
+ Hits               2036     2096      +60     
- Misses            16833    17386     +553     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/lists.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/numerical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/timedelta.py 0.00% <0.00%> (ø)
... and 77 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1424a2d...8426f56. Read the comment docs.

@github-actions github-actions bot added CMake CMake build issue Python Affects Python cuDF API. labels Sep 4, 2021
@github-actions github-actions bot removed the Python Affects Python cuDF API. label Sep 6, 2021
@karthikeyann karthikeyann added the 3 - Ready for Review Ready for review by team label Oct 4, 2021
@karthikeyann karthikeyann changed the title Add Pearson correlation for sort groupby (libcudf) Add Covariance, Pearson correlation for sort groupby (libcudf) Oct 4, 2021
@karthikeyann karthikeyann marked this pull request as ready for review October 4, 2021 19:47
@karthikeyann karthikeyann requested review from a team as code owners October 4, 2021 19:47
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes LGTM

@karthikeyann
Copy link
Contributor Author

rerun tests

rapids-bot bot pushed a commit that referenced this pull request Oct 5, 2021
…ed on column_view instead of requests. (#9195)

This PR updates groupby result_cache to use `pair<column_view, aggregation>` as key to unordered_map.
This allows to cache intermediate results based on the column view. So, it is possible to cache children column_view results and can be resused in other aggregation_request.

Depends on #9185
shallow_hash and is_shallow_equivalent are used for column_view.

Additional context:
This change is required to cache children column intermediate results in #9154 and allows to be shared across multiple aggregation requests.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9195
Comment on lines +165 to +170
std::vector<std::unique_ptr<aggregation>> aggs;
aggs.push_back(make_sum_aggregation());
// COUNT_VALID
aggs.push_back(make_count_aggregation());

return aggs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be fancy.

Suggested change
std::vector<std::unique_ptr<aggregation>> aggs;
aggs.push_back(make_sum_aggregation());
// COUNT_VALID
aggs.push_back(make_count_aggregation());
return aggs;
return {make_sum_aggregation(), make_count_aggregation()};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::initializer_list only allows access to const elements. When unique_ptr is given in initializer list, std::vector tries to copy unique_ptr. copy constructor of unique_ptr is deleted.
So, this fails to compile.

Copy link
Contributor

@jrhemstad jrhemstad Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sure, that makes sense. This should work though?

Suggested change
std::vector<std::unique_ptr<aggregation>> aggs;
aggs.push_back(make_sum_aggregation());
// COUNT_VALID
aggs.push_back(make_count_aggregation());
return aggs;
return std::vector({make_sum_aggregation(), make_count_aggregation()});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has same effect as above.
Anything involving {} initializer_list on a object cannot be moved. (probably const_cast inside constructor might work, but that's not useful here.)

https://stackoverflow.com/a/8469002/1550940
Surprisingly array works.

     std::unique_ptr<aggregation> init[] = {make_sum_aggregation(), make_count_aggregation()};
     return std::vector<std::unique_ptr<aggregation>>{std::make_move_iterator(std::begin(init)), std::make_move_iterator(std::end(init))};

@karthikeyann karthikeyann requested a review from jrhemstad October 6, 2021 17:21
cpp/src/groupby/sort/aggregate.cpp Outdated Show resolved Hide resolved
cpp/src/groupby/sort/group_correlation.cu Outdated Show resolved Hide resolved
cpp/src/groupby/sort/group_correlation.cu Outdated Show resolved Hide resolved
cpp/src/groupby/sort/group_correlation.cu Outdated Show resolved Hide resolved
@beckernick beckernick removed their request for review October 14, 2021 13:05
@karthikeyann karthikeyann requested a review from a team October 15, 2021 03:14
@karthikeyann
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 410efd9 into rapidsai:branch-21.12 Oct 18, 2021
rapids-bot bot pushed a commit that referenced this pull request Feb 17, 2022
This PR adds the functionality to perform `.cov()` on a `GroupBy` object and completes #1268

Related issue: #1268
Related PRs: #9154, #9166, #9492 

Next steps:

- [ ] Fix Symmetry problem [PR 10098](#10098 (comment)): avoid computing the covariance/ correlation between the same colums twice
- [ ] 	Consolidate  both `cov()` and `corr()`
- [ ] Fix #10303
- [ ] Add `cov `bindings in `aggregation.pyx` (separate PR): [comment](#9889 (comment))
- [ ] Simplify `combine_columns` after #10153 covers `interleave_columns`: [comment](#9889 (comment))

Authors:
  - Mayank Anand (https://github.com/mayankanand007)
  - Michael Wang (https://github.com/isVoid)
  - Sheilah Kirui (https://github.com/skirui-source)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Michael Wang (https://github.com/isVoid)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9889
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Groupby correlation (Pearson)
6 participants