-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add explode_outer and explode_outer_position #7499
Add explode_outer and explode_outer_position #7499
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7499 +/- ##
===============================================
+ Coverage 81.86% 82.39% +0.52%
===============================================
Files 101 101
Lines 16884 17350 +466
===============================================
+ Hits 13822 14295 +473
+ Misses 3062 3055 -7
Continue to review full report at Codecov.
|
Can we find a better name than "outer"? I understand it comes from Spark, but libcudf is not Spark. "Outer" is not descriptive. Perhaps instead of separate functions this should be some kind of It also feels like |
That said if we can get the same result with a drop nulls after a more generic explode, I would really like to see the performance difference so we can make an informed decision on it. |
I was in this boat for a while and even coded it as such with an enum. What I ended up with though was an enum coming in and a switch statement that had calls to different functions because the code was just different enough between the ways to result in either a hugely complex function or a different function for each operation. I could merge |
Sure, I understand "outer" its well-defined and commonplace for Joins. I suspect the naming comes from the notion of a join as a But the meaning here seems quite a bit different. Granted, I know nothing about SQL nor relational algebra. I was curious, and the only place I can find the term "outer" used in relational algebra is in the context of joins: https://en.wikipedia.org/wiki/Relational_algebra I just think we should use a name/descriptor that is more intuitive and descriptive. |
I am fine with separate APIs, I am fine with a flag that is passed in to the API, I am also fine with calling |
I'm not a big fan of having to call To run To do I think I can roll the |
It is unfortunate, but it is sometimes unavoidable in order for libcudf to continue to serve diverse end users. For example, Pandas wants division by 0 to return 0 #7492 (comment). Pretty sure Spark doesn't want that behavior. So that requires doing extra work or implementing a new custom function into libcudf with that behavior. Not everyone wants the same behavior and we can't always provide a unique implementation that does exactly what you want with a minimal amount of work. libcudf would become a mess of 99% redundant duplicated functions that differ slightly in behavior and it would be a nightmare for binary size and library maintenance. My goal is when developers are considering adding new functionality to libcudf that they think about these things. Instead of thinking "I'll implement this exactly for what Spark wants" or "I'll implement this for exactly what Pandas wants", we should be thinking "What can I provide that is generic enough to satisfy all of our users, even if it requires them to do more work". In the case of |
Yes, this is one of the tests. |
rerun tests |
I feel like |
@gpucibot merge |
rerun tests |
This code adds support for explode_outer and explode_outer_position. These differ from explode and explode_position by the way null and empty lists are handled. Explode discards null and empty lists and as such, lifts the child column directly out of the list column. Explode_outer must find these null and empty lists and make space for a null entry in the child column. This means we need to gather both the table and the exploded column. Further, we must make a pass on the exploded column to count these entries initially as we do not know the required size of the gather maps until we have this information and it isn't just the null count.
If there are no null or empty lists in the input, the normal explode function is called as it is simpler, but it does come at the cost of marching the offsets looking for duplicates, which indicate null or empty lists.
closes #7466