Skip to content

Commit

Permalink
Implement lists::index_of() to find positions in list rows (#9510)
Browse files Browse the repository at this point in the history
Fixes #9164.

### Prelude
`lists::contains()` (introduced in #7039) returns a `BOOL8` column, indicating whether the specified search_key(s) exist at all in each corresponding list row of an input LIST column. It does not return the actual position.

### `index_of()`
This commit introduces `lists::index_of()`, to return the INT32 positions of the specified search_key(s) in a LIST column.

The search keys may be searched for using either `FIND_FIRST` (which finds the position of the first occurrence), or `FIND_LAST` (which finds the last occurrence). Both column_view and scalar search keys are supported.

As with `lists::contains()`, nested types are not supported as search keys in `lists::index_of()`.

If the search_key cannot be found, that output row is set to `-1`. Additionally, the row `output[i]` is set to null if:
  1. The `search_key`(scalar) or `search_keys[i]`(column_view) is null.
  2. The list row `lists[i]` is null

In all other cases, `output[i]` should contain a non-negative value.

### Semantic changes for `lists::contains()`
This commit also modifies the semantics of `lists::contains()`: it will now return nulls only for the following cases:
  1. The `search_key`(scalar) or `search_keys[i]`(column_view) is null.
  2. The list row `lists[i]` is null

In all other cases, a non-null bool is returned. Specifically `lists::contains()` no longer conforms to SQL semantics of returning `NULL` for list rows that don't contain the search key, while simultaneously containing nulls. In this case, `false` is returned.

### `lists::contains_null_elements()`
A new function has been introduced to check if each list row contains null elements. The semantics are similar to `lists::contains()`, in that the column returned is BOOL8 typed:
  1. If even 1 element in a list row is null, the returned row is `true`.
  2. If no element is null, the returned row is `false`.
  3. If the list row is null, the returned row is `null`.
  4. If the list row is empty, the returned row is `false`.

The current implementation is an inefficient placeholder, to be replaced once (#9588) is available. It is included here to reconstruct the SQL semantics dropped from `lists::contains()`.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Jason Lowe (https://github.com/jlowe)
  - Mark Harris (https://github.com/harrism)
  - Conor Hoekstra (https://github.com/codereport)

URL: #9510
  • Loading branch information
mythrocks authored Dec 20, 2021
1 parent ce02856 commit a4dc42d
Show file tree
Hide file tree
Showing 7 changed files with 1,283 additions and 470 deletions.
102 changes: 100 additions & 2 deletions cpp/include/cudf/lists/contains.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ namespace lists {
*/

/**
* @brief Create a column of bool values indicating whether the specified scalar
* @brief Create a column of `bool` values indicating whether the specified scalar
* is an element of each row of a list column.
*
* The output column has as many elements as the input `lists` column.
Expand All @@ -51,7 +51,7 @@ std::unique_ptr<column> contains(
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Create a column of bool values indicating whether the list rows of the first
* @brief Create a column of `bool` values indicating whether the list rows of the first
* column contain the corresponding values in the second column
*
* The output column has as many elements as the input `lists` column.
Expand All @@ -74,6 +74,104 @@ std::unique_ptr<column> contains(
cudf::column_view const& search_keys,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Create a column of `bool` values indicating whether each row in the `lists` column
* contains at least one null element.
*
* The output column has as many elements as the input `lists` column.
* Output `column[i]` is set to null the list row `lists[i]` is null.
* Otherwise, `column[i]` is set to a non-null boolean value, depending on whether that list
* contains a null element.
* (Empty list rows are considered *NOT* to contain a null element.)
*
* @param lists Lists column whose `n` rows are to be searched
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return std::unique_ptr<column> BOOL8 column of `n` rows with the result of the lookup
*/
std::unique_ptr<column> contains_nulls(
cudf::lists_column_view const& lists,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Option to choose whether `index_of()` returns the first or last match
* of a search key in a list row
*/
enum class duplicate_find_option : int32_t {
FIND_FIRST = 0, ///< Finds first instance of a search key in a list row.
FIND_LAST ///< Finds last instance of a search key in a list row.
};

/**
* @brief Create a column of `size_type` values indicating the position of a search key
* within each list row in the `lists` column
*
* The output column has as many elements as there are rows in the input `lists` column.
* Output `column[i]` contains a 0-based index indicating the position of the search key
* in each list, counting from the beginning of the list.
* Note:
* 1. If the `search_key` is null, all output rows are set to null.
* 2. If the row `lists[i]` is null, `output[i]` is also null.
* 3. If the row `lists[i]` does not contain the `search_key`, `output[i]` is set to `-1`.
* 4. In all other cases, `output[i]` is set to a non-negative `size_type` index.
*
* If the `find_option` is set to `FIND_FIRST`, the position of the first match for
* `search_key` is returned.
* If `find_option == FIND_LAST`, the position of the last match in the list row is
* returned.
*
* @param lists Lists column whose `n` rows are to be searched
* @param search_key The scalar key to be looked up in each list row
* @param find_option Whether to return the position of the first match (`FIND_FIRST`) or
* last (`FIND_LAST`)
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return std::unique_ptr<column> INT32 column of `n` rows with the location of the `search_key`
*
* @throw cudf::logic_error If `search_key` type does not match the element type in `lists`
* @throw cudf::logic_error If `search_key` is of a nested type, or `lists` contains nested
* elements (LIST, STRUCT)
*/
std::unique_ptr<column> index_of(
cudf::lists_column_view const& lists,
cudf::scalar const& search_key,
duplicate_find_option find_option = duplicate_find_option::FIND_FIRST,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Create a column of `size_type` values indicating the position of a search key
* row within the corresponding list row in the `lists` column
*
* The output column has as many elements as there are rows in the input `lists` column.
* Output `column[i]` contains a 0-based index indicating the position of each search key
* row in its corresponding list row, counting from the beginning of the list.
* Note:
* 1. If `search_keys[i]` is null, `output[i]` is also null.
* 2. If the row `lists[i]` is null, `output[i]` is also null.
* 3. If the row `lists[i]` does not contain `search_key[i]`, `output[i]` is set to `-1`.
* 4. In all other cases, `output[i]` is set to a non-negative `size_type` index.
*
* If the `find_option` is set to `FIND_FIRST`, the position of the first match for
* `search_key` is returned.
* If `find_option == FIND_LAST`, the position of the last match in the list row is
* returned.
*
* @param lists Lists column whose `n` rows are to be searched
* @param search_keys A column of search keys to be looked up in each corresponding row of
* `lists`
* @param find_option Whether to return the position of the first match (`FIND_FIRST`) or
* last (`FIND_LAST`)
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return std::unique_ptr<column> INT32 column of `n` rows with the location of the `search_key`
*
* @throw cudf::logic_error If `search_keys` does not match `lists` in its number of rows
* @throw cudf::logic_error If `search_keys` type does not match the element type in `lists`
* @throw cudf::logic_error If `lists` or `search_keys` contains nested elements (LIST, STRUCT)
*/
std::unique_ptr<column> index_of(
cudf::lists_column_view const& lists,
cudf::column_view const& search_keys,
duplicate_find_option find_option = duplicate_find_option::FIND_FIRST,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of group
} // namespace lists
} // namespace cudf
Loading

0 comments on commit a4dc42d

Please sign in to comment.