Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support segmented apply_boolean_mask #10650

Closed
revans2 opened this issue Apr 13, 2022 · 6 comments · Fixed by #10773
Closed

[FEA] Support segmented apply_boolean_mask #10650

revans2 opened this issue Apr 13, 2022 · 6 comments · Fixed by #10773
Assignees
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Apr 13, 2022

Is your feature request related to a problem? Please describe.
Apache Spark has a list remove (array_remove in Spark) operator that will take a list column and a search column. There is also a map_filter operator that takes a higher order function and filters out entries in the map that match it. In order to implement both of these we would love to have a segmented apply_boolean_mask operator that would take two list columns. One that is being filtered and another that is a list of booleans that would be the boolean mask. The dimensions of both columns must match (lengths of each list in the column).

Describe the solution you'd like
It feels rather simple to implement. But it feels like something that is basic enough it should be a part of CUDF.

Describe alternatives you've considered
We probably could write our own segmented filter using segmented gather, or a regular apply_boolean_mask along with some aggregations and segmented scan to create the output offsets. But it feels like something that should be a part of CUDF.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Apr 13, 2022
@revans2 revans2 changed the title [FEA] Supporting operators for list_remove [FEA] Support segmented apply_boolean_mask Apr 25, 2022
@revans2
Copy link
Contributor Author

revans2 commented Apr 25, 2022

@jrhemstad I rewrote this to just be about segmented apply_boolean_mask like you asked.

@mythrocks
Copy link
Contributor

I haven't thought it through yet, but it sounds like this might be achievable via cudf::segmented_gather(), if the list<bool> is transformed into a gather map.

@mythrocks mythrocks self-assigned this Apr 27, 2022
@mythrocks
Copy link
Contributor

I'll take a quick stab at it.

@jrhemstad
Copy link
Contributor

jrhemstad commented Apr 27, 2022

cudf::segmented_gather(), if the list<bool> is transformed into a gather map.

Doing the stream compaction step of generating the gather map would be the hard part.

bool mask: { [T, T, F], [F, F], [F, T, F, T] }
// maps to:
gather map: { [0, 1], [], [1, 3] }

Typically stream compaction algorithms like copy_if are implemented via a sum scan + scatter_if where the sum scan is over the predicate condition and will then tell you for each element that passes the filter where it should be scattered to. The column level apply_boolean_mask does this but with a custom kernel largely to handle the null masks.

I like @revans2 idea of just using the regular apply_boolean_mask against the child columns with a post-processing step to figure out how many elements in each list passed the filter to calculate what the offsets should be for the resulting list column.

@jrhemstad
Copy link
Contributor

jrhemstad commented Apr 27, 2022

I like @revans2 idea of just using the regular apply_boolean_mask against the child columns with a post-processing step to figure out how many elements in each list passed the filter to calculate what the offsets should be for the resulting list column.

Alternatively, you could run the normal apply_boolean_mask with the filter extracted from the segmented filter column against a segmented sequence to generate the gather map.

E.g.,

bool mask: { T, T, F, F, F, F, T, F, T } // extract segmented mask child, zero copy

segmented sequence: { 0, 1, 2, 0, 1, 0, 1, 2, 3 }

apply_boolean_mask result:  {0, 1, 1, 3}

I think the only benefit of doing it this way is it defers any support for arbitrary nesting to the implementation of gather. Unless apply_boolean_mask supports arbitrary lists/structs already?

@mythrocks
Copy link
Contributor

run the normal apply_boolean_mask with the filter extracted from the segmented filter column against a segmented sequence to generate the gather map.

This is what I was considering. But my view is coloured by my familiarity with segmented_gather(). It could well be true that apply_boolean_mask handles arbitrary nesting already.

rapids-bot bot pushed a commit that referenced this issue May 5, 2022
Fixes #10650.

This commit introduces an `apply_boolean_mask()` method that interprets a boolean `LIST` column as a filter, to select elements from an arbitrary `LIST` input column.
E.g.
```c++
auto const input    = lcw<int32_t>{ {0,1,2}, {3,4}, {5,6,7}, {8,9} };
auto const selector = lcw<bool>   { {0,1,1}, {1,0}, {1,1,1}, {0,0} };
auto const results  = apply_boolean_mask( lists_column_view{input}, lists_column_view{selector} );
// results          == { {1,2}, {3}, {5,6,7}, {} };
```

The `input` and the `boolean_mask` should both have the same number of rows, and each row should have the same number of elements. 
Each output row copies the elements from the input where the boolean mask is non-null and true.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10773
rapids-bot bot pushed a commit that referenced this issue May 12, 2022
Contributes to #10650

Add JNI support for `apply_boolean_mask`

Refer to the descriptions of PR #10773

Signed-off-by: Chong Gao <[email protected]>

Authors:
  - Chong Gao (https://github.com/res-life)

Approvers:
  - Liangcai Li (https://github.com/firestarman)
  - Jason Lowe (https://github.com/jlowe)

URL: #10812
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants