Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] extract item from a list by index #5742

Closed
revans2 opened this issue Jul 22, 2020 · 6 comments · Fixed by #5753
Closed

[FEA] extract item from a list by index #5742

revans2 opened this issue Jul 22, 2020 · 6 comments · Fixed by #5753
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Jul 22, 2020

Is your feature request related to a problem? Please describe.
I would like to be able to pull out the nth element from a list column and return a new column.

Describe the solution you'd like
I would love an API that lets me take a list column, and pull out a single entry from each list in the column.

Something like the following

input = [[1, 2, 3], [1], null, [null, 3]];
extract_list_element(input, 0);
[1, 1, null, null]
extract_list_element(input, 1);
[2, null, null, 3]
extract_list_element(input, -1);
[null, null, null, null]

Describe alternatives you've considered
There really are not any except writing it ourselves.

Additional context
This is for Spark. At a minimum we need an API that would look something like.

unique_ptr<cudf::column> extract_list_element(cudf::column_view list_column, cudf::scalar index);

But spark supports the full gambit of options

unique_ptr<cudf::column> extract_list_element(cudf::column_view list_column, cudf::column_view index_column);
unique_ptr<cudf::column> extract_list_element(cudf::scalar list_data, cudf::column_view index_column);

For Spark a null is returned if the index is out of bounds for the list, or if the list itself is null, or if the value in the list is null. We really want this to work for a list of strings.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify labels Jul 22, 2020
@kkraus14 kkraus14 added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Jul 22, 2020
@kkraus14
Copy link
Collaborator

cuDF Python could use this as well, but I think the expectation from our side would be to throw if the element is out of bounds, but we'll take whatever makes sense from the C++ perspective 😄

@davidwendt
Copy link
Contributor

There is a precedent of a check_bounds parameter at least for cudf::gather() and cudf::scatter() APIs.

The closest analogy I can think of is the cudf.str.get() function which calls into cudf::strings::slice_strings(). The str.get() is asking for a specific character within a set of rows which are variable-length. If the index value falls outside the range of the an individual string, an empty string is returned (not a null string).

IMO, I think returning null for out-of-bounds within a list makes sense here.

@davidwendt
Copy link
Contributor

Just a clarification. From your last example in the description:

extract_list_element(input, -1);
[null, null, null, null]

negative index values should be considered out-of-bounds?

@kkraus14
Copy link
Collaborator

cc @shwina for this discussion

@shwina
Copy link
Contributor

shwina commented Jul 22, 2020

According to #5505 that check_bounds parameter may be going away though. It's still totally possible to figure out if the element is out of bounds though -- all we have to do is compare it with the offsets child.

We'd love to have support for negative index values to avoid additional pre-processing though. So maybe negative index behaviour should be an input parameter?

@revans2
Copy link
Contributor Author

revans2 commented Jul 22, 2020

Yes negative values are considered out of bounds for Spark. We can work around a lot of issues with bounds checking/etc so long as we have a list length API too. So the bounds checking is some what optional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants