Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update groupby result_cache to allow sharing intermediate results based on column_view instead of requests. #9195

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Sep 8, 2021

This PR updates groupby result_cache to use pair<column_view, aggregation> as key to unordered_map.
This allows to cache intermediate results based on the column view. So, it is possible to cache children column_view results and can be resused in other aggregation_request.

Depends on #9185
shallow_hash and is_shallow_equivalent are used for column_view.

Additional context:
This change is required to cache children column intermediate results in #9154 and allows to be shared across multiple aggregation requests.

@karthikeyann karthikeyann added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 8, 2021
@github-actions github-actions bot added the CMake CMake build issue label Sep 8, 2021
@karthikeyann karthikeyann changed the title Update groupby result_cache to use shallow_hash and is_shallow_equal Update groupby result_cache to allow sharing intermediate results based on column_view instead of requests. Sep 8, 2021
karthikeyann and others added 17 commits September 8, 2021 12:42
…nview' of github.com:karthikeyann/cudf into enh-groupby_cache_hashed
@codecov
Copy link

codecov bot commented Sep 20, 2021

Codecov Report

Merging #9195 (7d7eda5) into branch-21.12 (ab4bfaa) will decrease coverage by 0.02%.
The diff coverage is 1.97%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9195      +/-   ##
================================================
- Coverage         10.79%   10.76%   -0.03%     
================================================
  Files               116      116              
  Lines             18869    19467     +598     
================================================
+ Hits               2036     2096      +60     
- Misses            16833    17371     +538     
Impacted Files Coverage Δ
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/numerical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/timedelta.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
... and 74 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d68e626...7d7eda5. Read the comment docs.

@karthikeyann karthikeyann changed the base branch from branch-21.10 to branch-21.12 September 24, 2021 21:03
@github-actions github-actions bot removed the CMake CMake build issue label Sep 27, 2021
@vyasr vyasr requested review from vyasr and removed request for cwharris September 27, 2021 23:47
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change request and one question.

cpp/include/cudf/detail/aggregation/result_cache.hpp Outdated Show resolved Hide resolved
cpp/src/groupby/common/utils.hpp Show resolved Hide resolved
@karthikeyann karthikeyann requested a review from vyasr October 4, 2021 18:03
@karthikeyann
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1424a2d into rapidsai:branch-21.12 Oct 5, 2021
rapids-bot bot pushed a commit that referenced this pull request Oct 18, 2021
Add sort-groupby covariance and Pearson correlation in libcudf 
Addresses part of #1268 (groupby covariance)
Addresses part of #8691 (groupby Pearson correlation)
depends on PR #9195

For both covariance and Pearson correlation, the input column pair should be represented as 2 child columns of non-nullable struct column (`aggregation_request::values` = `struct_column_view{x, y}`)

```
covariance = Sum((x-mean_x)*(y-mean_y)) / (group_size-ddof)
Pearson correlation = covariance/ xstddev / ystddev
```

x, y values both should be non-null. 
mean, stddev, count should be calculated on only common non-null values of both columns.

mean, stddev, count of child columns are cached.
One limitation is when both null columns has non-identical null masks, the cached result (mean, stddev, count) of common valid rows can not be reused because bitmask_and result nullmask goes out of scope and new nullmask is created for another set of columns (even if they are same).

Unit tests for covariance and pearson correlation added.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Sheilah Kirui (https://github.com/skirui-source)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - https://github.com/nvdbaranec

URL: #9154
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants