Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add support for explode_outer #7466

Closed
hyperbolic2346 opened this issue Feb 26, 2021 · 3 comments · Fixed by #7499
Closed

[FEA] Add support for explode_outer #7466

hyperbolic2346 opened this issue Feb 26, 2021 · 3 comments · Fixed by #7499
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS

Comments

@hyperbolic2346
Copy link
Contributor

hyperbolic2346 commented Feb 26, 2021

Is your feature request related to a problem? Please describe.
The Spark plugin would like to support explode_outer, which requires support from libcudf. This is an explode operation that does not remove empty lists or null entries, but instead represents them with nulls.

  a          b
null        100
[5, 6]      200
[]          300
[1, 2, 3]   400

would result in:

  a          b
null        100
5           200
6           200
null        300
1           400
2           400
3           400
@hyperbolic2346 hyperbolic2346 added feature request New feature or request Needs Triage Need team to review and classify labels Feb 26, 2021
@hyperbolic2346 hyperbolic2346 self-assigned this Feb 26, 2021
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Feb 28, 2021
@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 1, 2021

cc @marlenezw @shwina for visibility. This is the behavior we'd want by default for Series.explode / DataFrame.explode.

@jrhemstad
Copy link
Contributor

What does "outer" mean in this context?

@hyperbolic2346
Copy link
Contributor Author

I'm not sure the origin of the nomenclature, but the goal is to bring the nulls and empty lists out into the result.

rapids-bot bot pushed a commit that referenced this issue Mar 17, 2021
This code adds support for explode_outer and explode_outer_position. These differ from explode and explode_position by the way null and empty lists are handled. Explode discards null and empty lists and as such, lifts the child column directly out of the list column. Explode_outer must find these null and empty lists and make space for a null entry in the child column. This means we need to gather both the table and the exploded column. Further, we must make a pass on the exploded column to count these entries initially as we do not know the required size of the gather maps until we have this information and it isn't just the null count.

If there are no null or empty lists in the input, the normal explode function is called as it is simpler, but it does come at the cost of marching the offsets looking for duplicates, which indicate null or empty lists.

closes #7466

Authors:
  - Mike Wilson (@hyperbolic2346)

Approvers:
  - AJ Schmidt (@ajschmidt8)
  - Jake Hemstad (@jrhemstad)
  - Nghia Truong (@ttnghia)

URL: #7499
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants