Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] index_in_list() to return the index of a search-key in a list row #9164

Closed
mythrocks opened this issue Sep 1, 2021 · 2 comments · Fixed by #9510
Closed

[FEA] index_in_list() to return the index of a search-key in a list row #9164

mythrocks opened this issue Sep 1, 2021 · 2 comments · Fixed by #9510
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@mythrocks
Copy link
Contributor

#7039 added a lists::contains() function to test whether each list row contains a specified search-key. This works with scalar search-keys and with columns, and returns a column of BOOL8.

It would be good to have a lists::index_in_list() that functions similarly, to return the index of the specified search-key in each list row. E.g.

using lcw         = tests::lists_column_wrapper<int32_t>;
auto lists        = lcw{{0,1,2,3}, {4,5,6,7}, {8,9,0,1}};
auto s_key        = make_int_scalar(1);
auto index_column = lists::index_in_list(lists, s_key); // {1, -1, 3};

Such a function would be useful in implementing a GPU accelerated array_position() in SparkSQL.

@mythrocks mythrocks added feature request New feature or request Needs Triage Need team to review and classify labels Sep 1, 2021
@mythrocks mythrocks self-assigned this Sep 1, 2021
@mythrocks
Copy link
Contributor Author

One might also use this to look up a map column (implemented, say, as a LIST<STRUCT<key,value>>:

  1. Extract keys as a LIST<key>, and values as a LIST<value>, (accounting for nulls, etc.)
  2. Compute index_in_list(search_key) on the keys column. This returns the position of search_key in each list row.
  3. Compute extract_list_element(values, indices) to extract the indices[i]th element from each list row.

@beckernick beckernick added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Sep 7, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Dec 20, 2021
Fixes #9164.

### Prelude
`lists::contains()` (introduced in #7039) returns a `BOOL8` column, indicating whether the specified search_key(s) exist at all in each corresponding list row of an input LIST column. It does not return the actual position.

### `index_of()`
This commit introduces `lists::index_of()`, to return the INT32 positions of the specified search_key(s) in a LIST column.

The search keys may be searched for using either `FIND_FIRST` (which finds the position of the first occurrence), or `FIND_LAST` (which finds the last occurrence). Both column_view and scalar search keys are supported.

As with `lists::contains()`, nested types are not supported as search keys in `lists::index_of()`.

If the search_key cannot be found, that output row is set to `-1`. Additionally, the row `output[i]` is set to null if:
  1. The `search_key`(scalar) or `search_keys[i]`(column_view) is null.
  2. The list row `lists[i]` is null

In all other cases, `output[i]` should contain a non-negative value.

### Semantic changes for `lists::contains()`
This commit also modifies the semantics of `lists::contains()`: it will now return nulls only for the following cases:
  1. The `search_key`(scalar) or `search_keys[i]`(column_view) is null.
  2. The list row `lists[i]` is null

In all other cases, a non-null bool is returned. Specifically `lists::contains()` no longer conforms to SQL semantics of returning `NULL` for list rows that don't contain the search key, while simultaneously containing nulls. In this case, `false` is returned.

### `lists::contains_null_elements()`
A new function has been introduced to check if each list row contains null elements. The semantics are similar to `lists::contains()`, in that the column returned is BOOL8 typed:
  1. If even 1 element in a list row is null, the returned row is `true`.
  2. If no element is null, the returned row is `false`.
  3. If the list row is null, the returned row is `null`.
  4. If the list row is empty, the returned row is `false`.

The current implementation is an inefficient placeholder, to be replaced once (#9588) is available. It is included here to reconstruct the SQL semantics dropped from `lists::contains()`.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Jason Lowe (https://github.com/jlowe)
  - Mark Harris (https://github.com/harrism)
  - Conor Hoekstra (https://github.com/codereport)

URL: #9510
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants