-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support segmented apply_boolean_mask #10650
Comments
@jrhemstad I rewrote this to just be about segmented apply_boolean_mask like you asked. |
I haven't thought it through yet, but it sounds like this might be achievable via |
I'll take a quick stab at it. |
Doing the stream compaction step of generating the gather map would be the hard part.
Typically stream compaction algorithms like I like @revans2 idea of just using the regular |
Alternatively, you could run the normal E.g.,
I think the only benefit of doing it this way is it defers any support for arbitrary nesting to the implementation of |
This is what I was considering. But my view is coloured by my familiarity with |
Fixes #10650. This commit introduces an `apply_boolean_mask()` method that interprets a boolean `LIST` column as a filter, to select elements from an arbitrary `LIST` input column. E.g. ```c++ auto const input = lcw<int32_t>{ {0,1,2}, {3,4}, {5,6,7}, {8,9} }; auto const selector = lcw<bool> { {0,1,1}, {1,0}, {1,1,1}, {0,0} }; auto const results = apply_boolean_mask( lists_column_view{input}, lists_column_view{selector} ); // results == { {1,2}, {3}, {5,6,7}, {} }; ``` The `input` and the `boolean_mask` should both have the same number of rows, and each row should have the same number of elements. Each output row copies the elements from the input where the boolean mask is non-null and true. Authors: - MithunR (https://github.com/mythrocks) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10773
Contributes to #10650 Add JNI support for `apply_boolean_mask` Refer to the descriptions of PR #10773 Signed-off-by: Chong Gao <[email protected]> Authors: - Chong Gao (https://github.com/res-life) Approvers: - Liangcai Li (https://github.com/firestarman) - Jason Lowe (https://github.com/jlowe) URL: #10812
Is your feature request related to a problem? Please describe.
Apache Spark has a list remove (array_remove in Spark) operator that will take a list column and a search column. There is also a map_filter operator that takes a higher order function and filters out entries in the map that match it. In order to implement both of these we would love to have a segmented
apply_boolean_mask
operator that would take two list columns. One that is being filtered and another that is a list of booleans that would be the boolean mask. The dimensions of both columns must match (lengths of each list in the column).Describe the solution you'd like
It feels rather simple to implement. But it feels like something that is basic enough it should be a part of CUDF.
Describe alternatives you've considered
We probably could write our own segmented filter using segmented gather, or a regular apply_boolean_mask along with some aggregations and segmented scan to create the output offsets. But it feels like something that should be a part of CUDF.
The text was updated successfully, but these errors were encountered: