-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Covariance, Pearson correlation for sort groupby (libcudf) #9154
Add Covariance, Pearson correlation for sort groupby (libcudf) #9154
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9154 +/- ##
================================================
- Coverage 10.79% 10.75% -0.04%
================================================
Files 116 116
Lines 18869 19482 +613
================================================
+ Hits 2036 2096 +60
- Misses 16833 17386 +553
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes LGTM
rerun tests |
…ed on column_view instead of requests. (#9195) This PR updates groupby result_cache to use `pair<column_view, aggregation>` as key to unordered_map. This allows to cache intermediate results based on the column view. So, it is possible to cache children column_view results and can be resused in other aggregation_request. Depends on #9185 shallow_hash and is_shallow_equivalent are used for column_view. Additional context: This change is required to cache children column intermediate results in #9154 and allows to be shared across multiple aggregation requests. Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9195
std::vector<std::unique_ptr<aggregation>> aggs; | ||
aggs.push_back(make_sum_aggregation()); | ||
// COUNT_VALID | ||
aggs.push_back(make_count_aggregation()); | ||
|
||
return aggs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be fancy.
std::vector<std::unique_ptr<aggregation>> aggs; | |
aggs.push_back(make_sum_aggregation()); | |
// COUNT_VALID | |
aggs.push_back(make_count_aggregation()); | |
return aggs; | |
return {make_sum_aggregation(), make_count_aggregation()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::initializer_list
only allows access to const
elements. When unique_ptr
is given in initializer list, std::vector
tries to copy unique_ptr
. copy constructor of unique_ptr
is deleted.
So, this fails to compile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sure, that makes sense. This should work though?
std::vector<std::unique_ptr<aggregation>> aggs; | |
aggs.push_back(make_sum_aggregation()); | |
// COUNT_VALID | |
aggs.push_back(make_count_aggregation()); | |
return aggs; | |
return std::vector({make_sum_aggregation(), make_count_aggregation()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has same effect as above.
Anything involving {}
initializer_list on a object cannot be moved. (probably const_cast inside constructor might work, but that's not useful here.)
https://stackoverflow.com/a/8469002/1550940
Surprisingly array works.
std::unique_ptr<aggregation> init[] = {make_sum_aggregation(), make_count_aggregation()};
return std::vector<std::unique_ptr<aggregation>>{std::make_move_iterator(std::begin(init)), std::make_move_iterator(std::end(init))};
Co-authored-by: nvdbaranec <[email protected]>
@gpucibot merge |
This PR adds the functionality to perform `.cov()` on a `GroupBy` object and completes #1268 Related issue: #1268 Related PRs: #9154, #9166, #9492 Next steps: - [ ] Fix Symmetry problem [PR 10098](#10098 (comment)): avoid computing the covariance/ correlation between the same colums twice - [ ] Consolidate both `cov()` and `corr()` - [ ] Fix #10303 - [ ] Add `cov `bindings in `aggregation.pyx` (separate PR): [comment](#9889 (comment)) - [ ] Simplify `combine_columns` after #10153 covers `interleave_columns`: [comment](#9889 (comment)) Authors: - Mayank Anand (https://github.com/mayankanand007) - Michael Wang (https://github.com/isVoid) - Sheilah Kirui (https://github.com/skirui-source) Approvers: - Bradley Dice (https://github.com/bdice) - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9889
Add sort-groupby covariance and Pearson correlation in libcudf
Addresses part of #1268 (groupby covariance)
Addresses part of #8691 (groupby Pearson correlation)
depends on PR #9195
For both covariance and Pearson correlation, the input column pair should be represented as 2 child columns of non-nullable struct column (
aggregation_request::values
=struct_column_view{x, y}
)x, y values both should be non-null.
mean, stddev, count should be calculated on only common non-null values of both columns.
mean, stddev, count of child columns are cached.
One limitation is when both null columns has non-identical null masks, the cached result (mean, stddev, count) of common valid rows can not be reused because bitmask_and result nullmask goes out of scope and new nullmask is created for another set of columns (even if they are same).
Unit tests for covariance and pearson correlation added.