Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support collect aggregations in reduction #10353

Merged
merged 4 commits into from
Mar 10, 2022

Conversation

sperlingxx
Copy link
Contributor

Closes #7807

Curreent PR is to support the collect aggregation family in reduction context, which includes collect_list, collect_set, merge_lists, and merge_sets.
The implementations are inspired by corresponding collect aggregations in groupby context.

@sperlingxx sperlingxx added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. 4 - Needs Review Waiting for reviewer to review or respond Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Feb 24, 2022
@sperlingxx sperlingxx requested review from a team as code owners February 24, 2022 10:32
@github-actions github-actions bot added the CMake CMake build issue label Feb 24, 2022
@sperlingxx sperlingxx requested a review from ttnghia February 24, 2022 10:33
Comment on lines 48 to 50
auto not_null_pred = [mask = col.null_mask(), offset = col.offset()] __device__(auto i) {
return bit_is_set(mask, offset + i);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use make_validity_iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

Comment on lines 52 to 54
column* null_purged_col = null_purged_table->release().front().release();
null_purged_col->set_null_mask(rmm::device_buffer{0, stream, mr}, 0);
return std::make_unique<list_scalar>(*null_purged_col, true, stream, mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a copy of null_purged_col and then leaks it.

Suggested change
column* null_purged_col = null_purged_table->release().front().release();
null_purged_col->set_null_mask(rmm::device_buffer{0, stream, mr}, 0);
return std::make_unique<list_scalar>(*null_purged_col, true, stream, mr);
auto null_purged_col = null_purged_table->release().front();
null_purged_col->set_null_mask(rmm::device_buffer{0, stream, mr}, 0);
return std::make_unique<list_scalar>(std::move(*null_purged_col), true, stream, mr);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad Change done. Thank you for the catch!

@codecov
Copy link

codecov bot commented Feb 25, 2022

Codecov Report

Merging #10353 (5926736) into branch-22.04 (c163886) will decrease coverage by 0.11%.
The diff coverage is 0.00%.

❗ Current head 5926736 differs from pull request most recent head aea1704. Consider uploading reports for the commit aea1704 to get more accurate results

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10353      +/-   ##
================================================
- Coverage         10.62%   10.50%   -0.12%     
================================================
  Files               122      127       +5     
  Lines             20961    21200     +239     
================================================
  Hits               2228     2228              
- Misses            18733    18972     +239     
Impacted Files Coverage Δ
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/_compat.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/_internals/where.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/decimal.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/lists.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/numerical_base.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <0.00%> (ø)
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c163886...aea1704. Read the comment docs.

@sperlingxx sperlingxx requested a review from jrhemstad February 26, 2022 07:54
@sperlingxx
Copy link
Contributor Author

Hi @jrhemstad, could you help to take another look at this PR? Thank you

Comment on lines 82 to 89
std::unique_ptr<scalar> merge_sets(column_view const& col,
null_equality nulls_equal,
nan_equality nans_equal,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
CUDF_EXPECTS(col.type().id() == type_id::LIST,
"input column of merge_lists must be a list column");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more descriptive to just make the parameter a lists_column_view and eliminates the extra error checking.

Suggested change
std::unique_ptr<scalar> merge_sets(column_view const& col,
null_equality nulls_equal,
nan_equality nans_equal,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
CUDF_EXPECTS(col.type().id() == type_id::LIST,
"input column of merge_lists must be a list column");
std::unique_ptr<scalar> merge_sets(lists_column_view const& col,
null_equality nulls_equal,
nan_equality nans_equal,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: sperlingxx <[email protected]>
@sperlingxx sperlingxx requested a review from jrhemstad March 9, 2022 02:26
@sperlingxx
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 46ac622 into rapidsai:branch-22.04 Mar 10, 2022
@sperlingxx sperlingxx deleted the reduction_collect_ops branch March 10, 2022 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] support collect aggregations in reduction
2 participants