Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmented apply_boolean_mask for LIST columns #10773

Merged
merged 20 commits into from
May 5, 2022

Conversation

mythrocks
Copy link
Contributor

Fixes #10650.

This commit introduces an apply_boolean_mask() method that interprets a boolean LIST column as a filter, to select elements from an arbitrary LIST input column.
E.g.

auto const input    = lcw<int32_t>{ {0,1,2}, {3,4}, {5,6,7}, {8,9} };
auto const selector = lcw<bool>   { {0,1,1}, {1,0}, {1,1,1}, {0,0} };
auto const results  = apply_boolean_mask( lists_column_view{input}, lists_column_view{selector} );
// results          == { {1,2}, {3}, {5,6,7}, {} };

The input and the boolean_mask should both have the same number of rows, and each row should have the same number of elements.
Each output row copies the elements from the input where the boolean mask is non-null and true.

@mythrocks mythrocks requested review from a team as code owners May 3, 2022 05:50
@mythrocks mythrocks requested review from vyasr and ttnghia May 3, 2022 05:50
@mythrocks mythrocks marked this pull request as draft May 3, 2022 05:50
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels May 3, 2022
@mythrocks mythrocks self-assigned this May 3, 2022
@mythrocks mythrocks added feature request New feature or request 2 - In Progress Currently a work in progress 5 - DO NOT MERGE Hold off on merging; see PR for details labels May 3, 2022
@mythrocks
Copy link
Contributor Author

mythrocks commented May 3, 2022

This is still a work in progress. Some work remains to be done:

  • API docs for apply_boolean_mask()
  • Tests for deeper nesting, null rows, empties, etc.

@codecov
Copy link

codecov bot commented May 3, 2022

Codecov Report

Merging #10773 (4da3ff9) into branch-22.06 (dc1435b) will increase coverage by 0.06%.
The diff coverage is 93.24%.

❗ Current head 4da3ff9 differs from pull request most recent head 18210c3. Consider uploading reports for the commit 18210c3 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10773      +/-   ##
================================================
+ Coverage         86.37%   86.43%   +0.06%     
================================================
  Files               142      143       +1     
  Lines             22306    22448     +142     
================================================
+ Hits              19266    19403     +137     
- Misses             3040     3045       +5     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/utils.py 0.00% <ø> (ø)
python/cudf/cudf/_lib/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/comm/gpuarrow.py 79.76% <ø> (ø)
python/cudf/cudf/core/column/string.py 89.21% <ø> (+0.12%) ⬆️
python/cudf/cudf/core/frame.py 93.41% <ø> (ø)
python/cudf/cudf/core/series.py 95.16% <ø> (ø)
python/custreamz/custreamz/kafka.py 29.16% <ø> (ø)
python/dask_cudf/dask_cudf/io/parquet.py 92.39% <81.08%> (-1.40%) ⬇️
python/cudf/cudf/core/_internals/expressions.py 92.85% <92.85%> (ø)
python/cudf/cudf/core/dataframe.py 93.74% <94.44%> (-0.01%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f8a03e...18210c3. Read the comment docs.

@mythrocks mythrocks marked this pull request as ready for review May 3, 2022 22:11
@mythrocks mythrocks removed 2 - In Progress Currently a work in progress 5 - DO NOT MERGE Hold off on merging; see PR for details labels May 3, 2022
@mythrocks mythrocks requested a review from ttnghia May 3, 2022 22:12
@mythrocks mythrocks added the non-breaking Non-breaking change label May 3, 2022
@mythrocks
Copy link
Contributor Author

I have removed the WIP label, and opened this PR up for review. Thank you for taking an early look at it, @ttnghia, @jrhemstad, @vyasr.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, just some minor suggestions.

cpp/include/cudf/lists/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/lists/stream_compaction.hpp Show resolved Hide resolved
cpp/src/lists/apply_boolean_mask.cu Outdated Show resolved Hide resolved

auto constexpr offset_data_type = data_type{type_id::INT32};

auto filtered_child = [&] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the lambdas in this function are for. Are you trying to exploit RVO from them to avoid needing additional std::move calls? I'm not sure why this code has this extra level of indirection from apply_boolean_mask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not RVO, but close. The alternative would have been:

  auto output_offsets = [&] { ... } ();
  auto filtered_child = [&] { ... } ();
  // ...
  return cudf::make_lists_column(input.size(),
                                 std::move(output_offsets),
                                 std::move(filtered_child),
                                 input.null_count(),
                                 cudf::detail::copy_bitmask(input.parent(), stream, mr),
                                 stream,
                                 mr);

This would have been what I have already, with more steps. By not immediately invoking the IILE, one avoids having to create-then-std-move those expressions.

Copy link
Contributor Author

@mythrocks mythrocks May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I should make those lambdas const.
Edit: These are now const. Please let me know if you'd prefer we use the lambda as an IILE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you perhaps missed the point of my question. I'm not asking why you didn't immediately invoke the lambdas. I'm asking why you defined them as lambdas at all. Perhaps that's a silly question for some obvious reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Sorry I didn't follow earlier.
It's only for the "packaging", like helper functions. For instance, the temporaries in the construction of offsets aren't really relevant to the rest of the function. I'm hoping to avoid clutter in the rest of the function.

Copy link
Contributor

@vyasr vyasr May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. To be honest I'm not convinced that restricting the scopes here really helps all that much relative to the boilerplate that it adds (the lambda declarations, the returns, releasing a unique pointer to control scope), not to mention additional cognitive overhead in reading it to figure out scopes, but I don't mind it much so I'm fine leaving it if you like it that way.

cpp/src/lists/apply_boolean_mask.cu Outdated Show resolved Hide resolved
cpp/tests/lists/apply_boolean_mask_test.cpp Outdated Show resolved Hide resolved
cpp/tests/lists/apply_boolean_mask_test.cpp Show resolved Hide resolved
cpp/tests/lists/apply_boolean_mask_test.cpp Show resolved Hide resolved
cpp/tests/lists/apply_boolean_mask_test.cpp Show resolved Hide resolved
cpp/tests/lists/apply_boolean_mask_test.cpp Show resolved Hide resolved
1. Better names for some test cases.
2. Cached common expressions across lambdas.
3. Removed unnecessary @copydoc.
4. Switch references to `selector` to instead use `boolean_mask`.
5. Miscellaneous minor improvements.
@mythrocks mythrocks requested a review from vyasr May 4, 2022 20:55
@mythrocks
Copy link
Contributor Author

@ttnghia, @vyasr, @jrhemstad: Thank you for your reviews.

@vyasr: I think I've addressed most of your concerns, save for two:

  1. I have retained the comments in the tests, in the hope that they visualize the contents of the input lists. Removing them might make it harder to decipher the operation.
  2. I have retained the lambdas in apply_boolean_mask.cu instead of using IILEs, because it saves a std::move() later on.

I'd be happy to change both, if you'd prefer.

@vyasr
Copy link
Contributor

vyasr commented May 4, 2022

@mythrocks I'm good with the comments. I have a follow-up on the lambdas that we can discuss tomorrow if you'd like, then I think we should be set here.

@mythrocks
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e52a1eb into rapidsai:branch-22.06 May 5, 2022
@mythrocks
Copy link
Contributor Author

Thanks for the reviews, all.

rapids-bot bot pushed a commit that referenced this pull request May 12, 2022
Contributes to #10650

Add JNI support for `apply_boolean_mask`

Refer to the descriptions of PR #10773

Signed-off-by: Chong Gao <[email protected]>

Authors:
  - Chong Gao (https://github.com/res-life)

Approvers:
  - Liangcai Li (https://github.com/firestarman)
  - Jason Lowe (https://github.com/jlowe)

URL: #10812
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support segmented apply_boolean_mask
4 participants