-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update groupby result_cache to allow sharing intermediate results based on column_view instead of requests. #9195
Update groupby result_cache to allow sharing intermediate results based on column_view instead of requests. #9195
Conversation
Co-authored-by: Jake Hemstad <[email protected]>
…/cudf into fea-shallow_equal_columnview
Co-authored-by: Jake Hemstad <[email protected]>
…/cudf into enh-groupby_cache_hashed
…low_hash_columnview
…/cudf into enh-groupby_cache_hashed
Co-authored-by: Jake Hemstad <[email protected]>
…nview' of github.com:karthikeyann/cudf into enh-groupby_cache_hashed
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9195 +/- ##
================================================
- Coverage 10.79% 10.76% -0.03%
================================================
Files 116 116
Lines 18869 19467 +598
================================================
+ Hits 2036 2096 +60
- Misses 16833 17371 +538
Continue to review full report at Codecov.
|
…/cudf into enh-groupby_cache_hashed
Co-authored-by: David Wendt <[email protected]>
…/cudf into enh-groupby_cache_hashed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small change request and one question.
Co-authored-by: Vyas Ramasubramani <[email protected]>
@gpucibot merge |
Add sort-groupby covariance and Pearson correlation in libcudf Addresses part of #1268 (groupby covariance) Addresses part of #8691 (groupby Pearson correlation) depends on PR #9195 For both covariance and Pearson correlation, the input column pair should be represented as 2 child columns of non-nullable struct column (`aggregation_request::values` = `struct_column_view{x, y}`) ``` covariance = Sum((x-mean_x)*(y-mean_y)) / (group_size-ddof) Pearson correlation = covariance/ xstddev / ystddev ``` x, y values both should be non-null. mean, stddev, count should be calculated on only common non-null values of both columns. mean, stddev, count of child columns are cached. One limitation is when both null columns has non-identical null masks, the cached result (mean, stddev, count) of common valid rows can not be reused because bitmask_and result nullmask goes out of scope and new nullmask is created for another set of columns (even if they are same). Unit tests for covariance and pearson correlation added. Authors: - Karthikeyan (https://github.com/karthikeyann) - Sheilah Kirui (https://github.com/skirui-source) Approvers: - Robert Maynard (https://github.com/robertmaynard) - https://github.com/nvdbaranec URL: #9154
This PR updates groupby result_cache to use
pair<column_view, aggregation>
as key to unordered_map.This allows to cache intermediate results based on the column view. So, it is possible to cache children column_view results and can be resused in other aggregation_request.
Depends on #9185
shallow_hash and is_shallow_equivalent are used for column_view.
Additional context:
This change is required to cache children column intermediate results in #9154 and allows to be shared across multiple aggregation requests.