Add explode_outer and explode_outer_position #7499

hyperbolic2346 · 2021-03-03T06:18:42Z

This code adds support for explode_outer and explode_outer_position. These differ from explode and explode_position by the way null and empty lists are handled. Explode discards null and empty lists and as such, lifts the child column directly out of the list column. Explode_outer must find these null and empty lists and make space for a null entry in the child column. This means we need to gather both the table and the exploded column. Further, we must make a pass on the exploded column to count these entries initially as we do not know the required size of the gather maps until we have this information and it isn't just the null count.

If there are no null or empty lists in the input, the normal explode function is called as it is simpler, but it does come at the cost of marching the offsets looking for duplicates, which indicate null or empty lists.

closes #7466

…ode_outer

codecov · 2021-03-03T09:18:38Z

Codecov Report

Merging #7499 (9bd1dd5) into branch-0.19 (7871e7a) will increase coverage by 0.52%.
The diff coverage is 93.75%.

❗ Current head 9bd1dd5 differs from pull request most recent head 44066f7. Consider uploading reports for the commit 44066f7 to get more accurate results

@@               Coverage Diff               @@
##           branch-0.19    #7499      +/-   ##
===============================================
+ Coverage        81.86%   82.39%   +0.52%     
===============================================
  Files              101      101              
  Lines            16884    17350     +466     
===============================================
+ Hits             13822    14295     +473     
+ Misses            3062     3055       -7

Impacted Files	Coverage Δ
python/cudf/cudf/core/index.py	`93.34% <ø> (+0.48%)`	⬆️
python/cudf/cudf/core/column/numerical.py	`94.83% <87.50%> (-0.20%)`	⬇️
python/cudf/cudf/core/frame.py	`89.09% <89.47%> (+0.08%)`	⬆️
python/cudf/cudf/core/column/column.py	`87.86% <90.00%> (+0.10%)`	⬆️
python/cudf/cudf/core/column/decimal.py	`92.75% <90.32%> (-2.12%)`	⬇️
python/cudf/cudf/core/dataframe.py	`90.58% <95.65%> (+0.11%)`	⬆️
python/cudf/cudf/core/series.py	`91.57% <95.83%> (+0.79%)`	⬆️
python/cudf/cudf/core/column/categorical.py	`91.97% <100.00%> (+0.58%)`	⬆️
python/cudf/cudf/core/column/datetime.py	`89.63% <100.00%> (+0.54%)`	⬆️
python/cudf/cudf/core/column/string.py	`86.76% <100.00%> (+0.26%)`	⬆️
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c1c60ba...44066f7. Read the comment docs.

jrhemstad · 2021-03-03T14:50:45Z

Can we find a better name than "outer"? I understand it comes from Spark, but libcudf is not Spark. "Outer" is not descriptive. Perhaps instead of separate functions this should be some kind of enum flag to explode.

It also feels like explode is making an arbitrary decision to drop null and empty lists. Should it be making that decision? Now we need two more functions in order to not drop null and empty lists. Alternatively, it would be more flexible/generic if explode nullified null/empty lists and if someone wants those filtered then they could do drop_nulls. Obviously that would have worse performance, but there are plenty of places where we don't fuse operations for the sake of flexibility and a smaller binary size. These are the kind of things we need to think about when adding new functionality to libcudf.

revans2 · 2021-03-03T15:15:53Z

outer is not a Spark term. It is a SQL term. Or perhaps more appropriately it is a relational algebra term. Just like inner joins vs outer joins, an explode can be inner or outer.

That said if we can get the same result with a drop nulls after a more generic explode, I would really like to see the performance difference so we can make an informed decision on it.

hyperbolic2346 · 2021-03-03T16:51:16Z

Perhaps instead of separate functions this should be some kind of enum flag to explode.

I was in this boat for a while and even coded it as such with an enum. What I ended up with though was an enum coming in and a switch statement that had calls to different functions because the code was just different enough between the ways to result in either a hugely complex function or a different function for each operation.

I could merge explode and explode_position again, but the code share is fairly minimal and the split functions seem much easier to understand and are not as daunting to approach. I am certainly open to suggestions.

jrhemstad · 2021-03-03T18:17:38Z

outer is not a Spark term. It is a SQL term. Or perhaps more appropriately it is a relational algebra term. Just like inner joins vs outer joins, an explode can be inner or outer.

Sure, I understand "outer" its well-defined and commonplace for Joins. I suspect the naming comes from the notion of a join as a
Venn diagram and that an "outer join" includes the "outer" (non-overlapping) portions of the diagram.

But the meaning here seems quite a bit different. Granted, I know nothing about SQL nor relational algebra.

I was curious, and the only place I can find the term "outer" used in relational algebra is in the context of joins: https://en.wikipedia.org/wiki/Relational_algebra

I just think we should use a name/descriptor that is more intuitive and descriptive.

revans2 · 2021-03-03T19:34:22Z

I am fine with separate APIs, I am fine with a flag that is passed in to the API, I am also fine with calling drop_nulls after the original explode so long as it still ends up beating the CPU version by a significant amount.

hyperbolic2346 · 2021-03-03T21:42:27Z

I'm not a big fan of having to call drop_nulls mainly because of the performance.

To run explode we call lower_bound for every row index, gather, and then lift the child column directly out of the exploded column.

To do explode_outer we require inclusive_scan to count the null or empty lists, then for_each on every row index, and then do a for_each over the nulls/empty list to fill in the holes in the gather map, then call gather for the rest of the table to duplicate rows, then call gather on the child column to insert spacing for the null entries.

I think I can roll the for_each calls into a single one, but there is still a pile of work to do in the explode_outer case that you simply don't need to do for explode due to how the data is arranged. To then turn around and do more work to remove that extra work seems very unfortunate.

jrhemstad · 2021-03-03T22:18:10Z

To then turn around and do more work to remove that extra work seems very unfortunate.

It is unfortunate, but it is sometimes unavoidable in order for libcudf to continue to serve diverse end users. For example, Pandas wants division by 0 to return 0 #7492 (comment). Pretty sure Spark doesn't want that behavior. So that requires doing extra work or implementing a new custom function into libcudf with that behavior.

Not everyone wants the same behavior and we can't always provide a unique implementation that does exactly what you want with a minimal amount of work. libcudf would become a mess of 99% redundant duplicated functions that differ slightly in behavior and it would be a nightmare for binary size and library maintenance.

My goal is when developers are considering adding new functionality to libcudf that they think about these things. Instead of thinking "I'll implement this exactly for what Spark wants" or "I'll implement this for exactly what Pandas wants", we should be thinking "What can I provide that is generic enough to satisfy all of our users, even if it requires them to do more work".

In the case of explode and explode_outer it seems okay to have separate functions, but I hope I've impressed the importance of thinking through decisions like this.

…ode_outer

cpp/src/reshape/explode.cu

hyperbolic2346 · 2021-03-12T17:50:41Z

Do the explode* functions handle empty inputs?

Yes, this is one of the tests.

hyperbolic2346 · 2021-03-12T21:29:02Z

rerun tests

ttnghia · 2021-03-15T15:03:52Z

I feel like reshape group is not appropriate for explode (and even some other functions in that group), as you are generating something new from the input, not modifying the input's shape. So, instead of reshape group, it should be something like product group.

cpp/tests/reshape/explode_tests.cpp

cpp/include/cudf/lists/explode.hpp

cpp/tests/lists/explode_tests.cpp

…ode_outer

kkraus14 · 2021-03-16T21:28:56Z

@gpucibot merge

hyperbolic2346 · 2021-03-17T01:54:43Z

rerun tests

hyperbolic2346 added 2 commits March 3, 2021 06:00

First pass at explode_outer and explode_outer_position

9541e27

Merge remote-tracking branch 'upstream/branch-0.19' into mwilson/expl…

566a8d1

…ode_outer

hyperbolic2346 requested a review from a team as a code owner March 3, 2021 06:18

hyperbolic2346 requested review from jrhemstad and vuule March 3, 2021 06:18

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 3, 2021

hyperbolic2346 added feature request New feature or request non-breaking Non-breaking change labels Mar 3, 2021

merging two for_each calls into one

557fcfc

hyperbolic2346 added the 3 - Ready for Review Ready for review by team label Mar 4, 2021

hyperbolic2346 added 5 commits March 4, 2021 20:36

linting

7bcede0

Merge remote-tracking branch 'upstream/branch-0.19' into mwilson/expl…

869c27c

…ode_outer

Merge remote-tracking branch 'upstream/branch-0.19' into mwilson/expl…

552e43c

…ode_outer

Merge remote-tracking branch 'upstream/branch-0.19' into mwilson/expl…

23f4abc

…ode_outer

Merge remote-tracking branch 'upstream/branch-0.19' into mwilson/expl…

01cdde2

…ode_outer

jrhemstad reviewed Mar 9, 2021

View reviewed changes