-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement groupby::merge
for collect_list
and collect_set
#8407
Conversation
groupby::merge
groupby::merge
for collect_list
and collect_set
/** | ||
* @brief Indicates whether the specified aggregation operation can be computed | ||
* with a hash-based implementation. | ||
* | ||
* @param t The aggregation operation to verify | ||
* @return true `t` is valid for a hash based groupby | ||
* @return false `t` is invalid for a hash based groupby | ||
*/ | ||
bool is_hash_aggregation(aggregation::Kind t) { return array_contains(hash_aggregations, t); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separating the definition makes this no longer constexpr
. If you want it available in the header, move it's definition there as well and keep it constexpr
.
The implementation looks good. But I have some concern about the interface: how can we perform this mergeOp along with other AggOps (like Sum/Count/Max) ? In the JNI wrapper of In addition, if this merge API is specialized for spark-rapids, the concatenation of multiple keys and values looks unnecessary. Because spark-rapids will concatenate all partial results (also with |
I totally agree this is why I wanted just a concat aggregation. Not some new merge op API. Just a concat that is a regualar aggregation, like collect list or collect set. |
Initially I did it. But then I found out that the new API will need to operate on (merge) multiple pair of keys-values (which are the grouped keys and lists columns resulted from the previous Do you guys have any suggestion for this issue (mismatching numbers of input rows of the |
To merge the outputs we concat them into a single table and then call another aggregation on them. This is actually very simplified because it involves multiple machines shuffling the data around. We don't need a new merge API. Just concat_list and concat_set aggregations. You don't need to worry about how the data gets chopped up and redistributed. You just need to worry about adding in the desired aggregations we will handle the rest. We already do. |
It's easier to think of this as it's own, standalone type of aggregation, i.e.: given a table, aggregate the list columns associated with a particular key by concatenating the list values together into a new list (and removing duplicates if it's collect_set merge). As @sperlingxx said,
Therefore there's no need to consider multiple tables or multiple |
Got it. Thanks all, I'm starting a new PR according to your suggestions. |
Groupby aggregations can be performed for distributed computing by the following approach:
This PR supports merging operations for
collect_list
andcollect_set
.Closes #7839.
PS: This is a WIP and not ready for review.