[FEA] Add support for explode_outer #7466

hyperbolic2346 · 2021-02-26T22:51:05Z

Is your feature request related to a problem? Please describe.
The Spark plugin would like to support explode_outer, which requires support from libcudf. This is an explode operation that does not remove empty lists or null entries, but instead represents them with nulls.

  a          b
null        100
[5, 6]      200
[]          300
[1, 2, 3]   400

would result in:

  a          b
null        100
5           200
6           200
null        300
1           400
2           400
3           400

The text was updated successfully, but these errors were encountered:

kkraus14 · 2021-03-01T20:18:56Z

cc @marlenezw @shwina for visibility. This is the behavior we'd want by default for Series.explode / DataFrame.explode.

jrhemstad · 2021-03-01T20:40:21Z

What does "outer" mean in this context?

hyperbolic2346 · 2021-03-01T22:18:51Z

I'm not sure the origin of the nomenclature, but the goal is to bring the nulls and empty lists out into the result.

@hyperbolic2346

This code adds support for explode_outer and explode_outer_position. These differ from explode and explode_position by the way null and empty lists are handled. Explode discards null and empty lists and as such, lifts the child column directly out of the list column. Explode_outer must find these null and empty lists and make space for a null entry in the child column. This means we need to gather both the table and the exploded column. Further, we must make a pass on the exploded column to count these entries initially as we do not know the required size of the gather maps until we have this information and it isn't just the null count. If there are no null or empty lists in the input, the normal explode function is called as it is simpler, but it does come at the cost of marching the offsets looking for duplicates, which indicate null or empty lists. closes #7466 Authors: - Mike Wilson (@hyperbolic2346) Approvers: - AJ Schmidt (@ajschmidt8) - Jake Hemstad (@jrhemstad) - Nghia Truong (@ttnghia) URL: #7499

hyperbolic2346 added feature request New feature or request Needs Triage Need team to review and classify labels Feb 26, 2021

hyperbolic2346 self-assigned this Feb 26, 2021

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Feb 28, 2021

hyperbolic2346 mentioned this issue Mar 3, 2021

Add explode_outer and explode_outer_position #7499

Merged

rapids-bot bot closed this as completed in #7499 Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add support for explode_outer #7466

[FEA] Add support for explode_outer #7466

hyperbolic2346 commented Feb 26, 2021 •

edited by harrism

Loading

kkraus14 commented Mar 1, 2021

jrhemstad commented Mar 1, 2021

hyperbolic2346 commented Mar 1, 2021

[FEA] Add support for explode_outer #7466

[FEA] Add support for explode_outer #7466

Comments

hyperbolic2346 commented Feb 26, 2021 • edited by harrism Loading

kkraus14 commented Mar 1, 2021

jrhemstad commented Mar 1, 2021

hyperbolic2346 commented Mar 1, 2021

hyperbolic2346 commented Feb 26, 2021 •

edited by harrism

Loading