-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JNI Aggregation Type Changes #8919
JNI Aggregation Type Changes #8919
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. Is there utility in having an interface that all aggregations derive from that allows applications to track aggregations regardless of type? The use of unconstrained generics in the corresponding RAPIDS Accelerator change made me wonder if there's any utility in applications being able to tell something is an aggregation even though there are no common, public methods across aggregations. I suppose we can add this later if it's needed.
We do not have that in the Spark code right now. Each aggregation is tracked separately. Some of this comes from how distributed group by aggregations work in two phases, which requires a mapping from one Spark aggregation to at least two separate cudf aggregations. But there are further issues complicated by cudf. Semantically a single aggregation like min or max does not behaved the same way from one API to another. A lot of this ends up being like corner cases for null handling and the like. They are all things that can be worked around. This makes it difficult to have a single mapping, or API to say that spark.Max maps to cudf.Max. Also because some aggregations are only available under some aggregation APIs, then we still need a way to know which are and are not supported. Generally it is simplest and more flexible in this case to just have separate APIs, which is what we ended up with for Spark. I'm also not sure how you would use this. The only thing I can think of is giving the user the option for runtime checking for aggregations instead of compile time checking. I can do that, but I want to make sure that I am designing the APIs for the desired use case instead of adding in flexibility without knowing why. |
The use case I had in mind was primarily to allow constraining the generic in the RAPIDS Accelerator code so you couldn't just pass anything to it at compile time - it needs to be some kind of aggregation. It's a very minor point, and I'm fine with it as-is. |
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #8919 +/- ##
================================================
- Coverage 10.67% 10.59% -0.09%
================================================
Files 110 116 +6
Lines 18271 19043 +772
================================================
+ Hits 1951 2017 +66
- Misses 16320 17026 +706
Continue to review full report at Codecov.
|
@gpucibot merge |
CUDF is in the process of tagging aggregations with different classes to make it a compile error to use the wrong aggregation with the wrong API. The first set of changes were for rolling windows and when I did the corresponding java changes I ended up using generics and interfaces to try and replicate the same thing. This didn't work out as well as I had hoped and the code to use them ended up being more cumbersome thqn I wanted.
This patch adjusts it so we have truly separate classes for each type of aggregation. It adds more code here, but it makes the code that uses these cleaner.
This is a breaking change and I will be putting up a corresponding change in the Spark plugin to deal with this.