-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support MERGE_LISTS
and MERGE_SETS
for groupby::aggregate
#7839
Comments
I would like to pick up this task if no experienced c++ contributors are free to work on this issue. |
Can't you just concatenate the partial results and do another collect_list or collect_set? Even if this needs a new feature, this isn't a groupby or an aggregation operation. This is like a "concat by key" or something. |
I think if we "concatenate the partial results and do another collect_list or collect_set". Then, we will get result like: And I am not sure whether these concatenation ops should be regarded as groupby operations or not. But it will be nice if concatenation ops can be used with table::group_by. |
So we will have something called
@kkraus14 Is there any API in Pandas that is similar to this? I'm trying to steer the proposed API to satisfy any potential need. |
No there isn't a similar API. In general Pandas doesn't have many built-in APIs for list handling. |
There is a couple of changes since my last plan. The last plan is to execute the groupby operations (including
So, we will not have PS: Above is my proposed idea, which is WIP. Things may change along the way until I have a final, working implementation. |
I've no issue with the first point. AFAIK,
Why? For instance, there is no groupby aggregation to merge the intermediate results from any other groupby operation. The typical pattern is to concat intermediate results and do another aggregation. That would work fine in this case if you explode the intermediate results from the |
That is sort of correct. In isolation this is 100% correct, but you have to look at how the group by API is used. The group by APIs have no guarantee on the order of the output rows. Also columns that are not a part of a given aggregation cannot be preserved in the output of the aggregation. That is inherent in how group by aggregations work. As such multiple aggregations are grouped together, not just for efficiency but so we can do them correctly. If we do what you ask that does not just mean we have to explode the key columns along with the list/set operation, Which is already memory and performance inefficient. bit for each list/set operation that we do we would also have to do a join between the result of each list/set operation and the rest of the operations that were done. If all we care about is getting an answer we can do it. But if you want something to actually be fast we need a better solution. |
Perhaps the name is just not great. What we really want is a |
I see. I wasn't thinking of merging multiple distributed aggregations and wanting the order to be consistent. That makes more sense.
Adding a |
Humm, I still feel |
There is no notion of "intermediate results" in libcudf as libcudf is a single GPU library. As such,
Why does |
It shouldn't I would propose a |
concat_lists
concat_lists
merge_lists
Intuitively, The proposed (groupby) API here operates on values tables that are pairs of keys-values columns/tables which are the results of previous You may find it a little bit confusing here as typically one groupby operation only has one shared keys table for all the requests. In my draft implementation, the values attaching to each request is a table of pair of keys-values output from previous |
merge_lists
groupby::merge
groupby::merge
MERGE_LISTS
and MERGE_SETS
for groupby::aggregate
Groupby aggregations can be performed for distributed computing by the following approach: * Divide the dataset into batches * Run separate (distributed) aggregations over those batches on the distributed nodes * Merge the results of the step above into one final result by calling `groupby::aggregate` a final time on the master node This PR supports merging operations for the lists resulted from distributed aggregate `collect_list` and `collect_set`. Closes #7839. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Mark Harris (https://github.com/harrism) URL: #8436
Is your feature request related to a problem? Please describe.
To perform collect aggregations in Spark, (like other aggregation functions) it needs to go through two phases. In phase one, we perform collect aggregation within each node locally, which can be implemented by cuDF group_by with AggOp
collect_list
orcollect_set
. In phase two, we need to merge partial aggregation results from each node. In terms of collect aggregations, we concatenate multiple lists/sets into one. For instance,I think we need two additional AggOps (
concatenate_list
andconcatenate_set
) for current feature.The text was updated successfully, but these errors were encountered: