Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explode_outer and explode_outer_position #7499

Merged
merged 18 commits into from
Mar 17, 2021

Conversation

hyperbolic2346
Copy link
Contributor

This code adds support for explode_outer and explode_outer_position. These differ from explode and explode_position by the way null and empty lists are handled. Explode discards null and empty lists and as such, lifts the child column directly out of the list column. Explode_outer must find these null and empty lists and make space for a null entry in the child column. This means we need to gather both the table and the exploded column. Further, we must make a pass on the exploded column to count these entries initially as we do not know the required size of the gather maps until we have this information and it isn't just the null count.

If there are no null or empty lists in the input, the normal explode function is called as it is simpler, but it does come at the cost of marching the offsets looking for duplicates, which indicate null or empty lists.

closes #7466

@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner March 3, 2021 06:18
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 3, 2021
@hyperbolic2346 hyperbolic2346 added feature request New feature or request non-breaking Non-breaking change labels Mar 3, 2021
@codecov
Copy link

codecov bot commented Mar 3, 2021

Codecov Report

Merging #7499 (9bd1dd5) into branch-0.19 (7871e7a) will increase coverage by 0.52%.
The diff coverage is 93.75%.

❗ Current head 9bd1dd5 differs from pull request most recent head 44066f7. Consider uploading reports for the commit 44066f7 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7499      +/-   ##
===============================================
+ Coverage        81.86%   82.39%   +0.52%     
===============================================
  Files              101      101              
  Lines            16884    17350     +466     
===============================================
+ Hits             13822    14295     +473     
+ Misses            3062     3055       -7     
Impacted Files Coverage Δ
python/cudf/cudf/core/index.py 93.34% <ø> (+0.48%) ⬆️
python/cudf/cudf/core/column/numerical.py 94.83% <87.50%> (-0.20%) ⬇️
python/cudf/cudf/core/frame.py 89.09% <89.47%> (+0.08%) ⬆️
python/cudf/cudf/core/column/column.py 87.86% <90.00%> (+0.10%) ⬆️
python/cudf/cudf/core/column/decimal.py 92.75% <90.32%> (-2.12%) ⬇️
python/cudf/cudf/core/dataframe.py 90.58% <95.65%> (+0.11%) ⬆️
python/cudf/cudf/core/series.py 91.57% <95.83%> (+0.79%) ⬆️
python/cudf/cudf/core/column/categorical.py 91.97% <100.00%> (+0.58%) ⬆️
python/cudf/cudf/core/column/datetime.py 89.63% <100.00%> (+0.54%) ⬆️
python/cudf/cudf/core/column/string.py 86.76% <100.00%> (+0.26%) ⬆️
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c1c60ba...44066f7. Read the comment docs.

@jrhemstad
Copy link
Contributor

jrhemstad commented Mar 3, 2021

Can we find a better name than "outer"? I understand it comes from Spark, but libcudf is not Spark. "Outer" is not descriptive. Perhaps instead of separate functions this should be some kind of enum flag to explode.

It also feels like explode is making an arbitrary decision to drop null and empty lists. Should it be making that decision? Now we need two more functions in order to not drop null and empty lists. Alternatively, it would be more flexible/generic if explode nullified null/empty lists and if someone wants those filtered then they could do drop_nulls. Obviously that would have worse performance, but there are plenty of places where we don't fuse operations for the sake of flexibility and a smaller binary size. These are the kind of things we need to think about when adding new functionality to libcudf.

@revans2
Copy link
Contributor

revans2 commented Mar 3, 2021

outer is not a Spark term. It is a SQL term. Or perhaps more appropriately it is a relational algebra term. Just like inner joins vs outer joins, an explode can be inner or outer.

That said if we can get the same result with a drop nulls after a more generic explode, I would really like to see the performance difference so we can make an informed decision on it.

@hyperbolic2346
Copy link
Contributor Author

Perhaps instead of separate functions this should be some kind of enum flag to explode.

I was in this boat for a while and even coded it as such with an enum. What I ended up with though was an enum coming in and a switch statement that had calls to different functions because the code was just different enough between the ways to result in either a hugely complex function or a different function for each operation.

I could merge explode and explode_position again, but the code share is fairly minimal and the split functions seem much easier to understand and are not as daunting to approach. I am certainly open to suggestions.

@jrhemstad
Copy link
Contributor

outer is not a Spark term. It is a SQL term. Or perhaps more appropriately it is a relational algebra term. Just like inner joins vs outer joins, an explode can be inner or outer.

Sure, I understand "outer" its well-defined and commonplace for Joins. I suspect the naming comes from the notion of a join as a
Venn diagram and that an "outer join" includes the "outer" (non-overlapping) portions of the diagram.

image

But the meaning here seems quite a bit different. Granted, I know nothing about SQL nor relational algebra.

I was curious, and the only place I can find the term "outer" used in relational algebra is in the context of joins: https://en.wikipedia.org/wiki/Relational_algebra

I just think we should use a name/descriptor that is more intuitive and descriptive.

@revans2
Copy link
Contributor

revans2 commented Mar 3, 2021

I am fine with separate APIs, I am fine with a flag that is passed in to the API, I am also fine with calling drop_nulls after the original explode so long as it still ends up beating the CPU version by a significant amount.

@hyperbolic2346
Copy link
Contributor Author

I'm not a big fan of having to call drop_nulls mainly because of the performance.

To run explode we call lower_bound for every row index, gather, and then lift the child column directly out of the exploded column.

To do explode_outer we require inclusive_scan to count the null or empty lists, then for_each on every row index, and then do a for_each over the nulls/empty list to fill in the holes in the gather map, then call gather for the rest of the table to duplicate rows, then call gather on the child column to insert spacing for the null entries.

I think I can roll the for_each calls into a single one, but there is still a pile of work to do in the explode_outer case that you simply don't need to do for explode due to how the data is arranged. To then turn around and do more work to remove that extra work seems very unfortunate.

@jrhemstad
Copy link
Contributor

To then turn around and do more work to remove that extra work seems very unfortunate.

It is unfortunate, but it is sometimes unavoidable in order for libcudf to continue to serve diverse end users. For example, Pandas wants division by 0 to return 0 #7492 (comment). Pretty sure Spark doesn't want that behavior. So that requires doing extra work or implementing a new custom function into libcudf with that behavior.

Not everyone wants the same behavior and we can't always provide a unique implementation that does exactly what you want with a minimal amount of work. libcudf would become a mess of 99% redundant duplicated functions that differ slightly in behavior and it would be a nightmare for binary size and library maintenance.

My goal is when developers are considering adding new functionality to libcudf that they think about these things. Instead of thinking "I'll implement this exactly for what Spark wants" or "I'll implement this for exactly what Pandas wants", we should be thinking "What can I provide that is generic enough to satisfy all of our users, even if it requires them to do more work".

In the case of explode and explode_outer it seems okay to have separate functions, but I hope I've impressed the importance of thinking through decisions like this.

@hyperbolic2346 hyperbolic2346 added the 3 - Ready for Review Ready for review by team label Mar 4, 2021
@hyperbolic2346
Copy link
Contributor Author

Do the explode* functions handle empty inputs?

Yes, this is one of the tests.

@hyperbolic2346
Copy link
Contributor Author

rerun tests

@ttnghia
Copy link
Contributor

ttnghia commented Mar 15, 2021

I feel like reshape group is not appropriate for explode (and even some other functions in that group), as you are generating something new from the input, not modifying the input's shape. So, instead of reshape group, it should be something like product group.

cpp/tests/reshape/explode_tests.cpp Outdated Show resolved Hide resolved
cpp/tests/reshape/explode_tests.cpp Outdated Show resolved Hide resolved
cpp/tests/reshape/explode_tests.cpp Outdated Show resolved Hide resolved
cpp/tests/reshape/explode_tests.cpp Outdated Show resolved Hide resolved
@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner March 15, 2021 18:53
@github-actions github-actions bot added the CMake CMake build issue label Mar 15, 2021
@hyperbolic2346 hyperbolic2346 requested a review from ttnghia March 15, 2021 20:30
cpp/include/cudf/lists/explode.hpp Outdated Show resolved Hide resolved
cpp/tests/lists/explode_tests.cpp Show resolved Hide resolved
@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner March 16, 2021 02:54
@github-actions github-actions bot added the conda label Mar 16, 2021
This was referenced Mar 16, 2021
@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 16, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@hyperbolic2346
Copy link
Contributor Author

rerun tests

@rapids-bot rapids-bot bot merged commit 0146f74 into rapidsai:branch-0.19 Mar 17, 2021
@hyperbolic2346 hyperbolic2346 deleted the mwilson/explode_outer branch March 17, 2021 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add support for explode_outer
6 participants