Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Generate element labels from offsets #10905

Closed
ttnghia opened this issue May 19, 2022 · 1 comment · Fixed by #10945
Closed

[FEA] Generate element labels from offsets #10905

ttnghia opened this issue May 19, 2022 · 1 comment · Fixed by #10945
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS

Comments

@ttnghia
Copy link
Contributor

ttnghia commented May 19, 2022

In some cases, for a list column, we want to generate labels for each element in the child column.

For example, given a list column [ [1, 2, 3], [4, 5], [6, 7, 8] ], we want to generate a label column like [0, 0, 0, 1, 1, 2, 2, 2].

Having such label column, we can combine the child column (i.e, [1, 2, 3, 4, 5, 6, 7, 8]) and the label column for further processing. Use case of such label column already exists in drop_list_duplicates (link).
The next use case would be for set-like operations (#10409), when we want to process all elements in the child column in parallel (i.e., one element per thread), instead of one list per thread.

@ttnghia ttnghia added feature request New feature or request libcudf blocker libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels May 19, 2022
@ttnghia ttnghia self-assigned this May 19, 2022
@jrhemstad
Copy link
Contributor

This sounds like something that could be an iterator instead of or in addition to needing to actually materializing the column of labels. I'd make the iterator and then if needed, use the iterator to implement materializing the labels.

@ttnghia ttnghia changed the title [FEA] Generate list labels for each elements in the list child column [FEA] Generate element labels from offsets May 24, 2022
rapids-bot bot pushed a commit that referenced this issue Jun 1, 2022
This PR adds a small (detail) API that generates group labels from a given offset array `offsets`. The output will be an array containing consecutive groups of identical labels, the number of elements in each group `i` is defined by `offsets[i+1] - offsets[i]`.

Examples:
```
offsets = [ 0, 4, 6, 10 ]
output  = [ 0, 0, 0, 0, 1, 1, 2, 2, 2, 2 ]

offsets = [ 5, 10, 12 ]
output  = [ 0, 0, 0, 0, 0, 1, 1 ]
```

Note that the label values always start from `0`. We can in fact add a parameter to allow specifying any starting value but we don't need it in now.

Several places in cudf have been updated to adopt the new API immediately. These places have been tested extensively thus no unit tests for the new API is needed. In addition, I ran a benchmark for groupby aggregations and found no performance difference after adopting this.

Closes #10905 and unblocks #10409.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Devavret Makkar (https://github.com/devavret)

URL: #10945
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants