-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle an aggregation with functions of different steps? #4412
Comments
@rui-mo Rui, this query plan is surprising and as you noticed is not supported. Since partial aggregation may flush results when memory is low, the query may return incorrect results where some values of 'ws_order_number' are counted more than once. Presto uses a different query plan with a MarkDistinct plan node. I assume MarkDistinct acts like a "final" aggregation and doesn't support flushing, but may support spilling. Any chance you can share full query plan from Spark? Also, I wonder if there is any documentation about how Spark runs such queries.
|
@mbasmanova Masha, thanks a lot for the answering. Below is the full Spark plan for query: (1) Scan parquet
Output [2]: [ws_order_number#747L, ws_ext_ship_cost#758]
Batched: true
Location: InMemoryFileIndex [file:/mnt/DP_disk2/tpcds_parquet_nopartition_with_date_with_decimal_1/web_sales]
ReadSchema: struct<ws_order_number:bigint,ws_ext_ship_cost:decimal(7,2)>
(2) ColumnarToRow [codegen id : 1]
Input [2]: [ws_order_number#747L, ws_ext_ship_cost#758]
(3) HashAggregate [codegen id : 1]
Input [2]: [ws_order_number#747L, ws_ext_ship_cost#758]
Keys [1]: [ws_order_number#747L]
Functions [1]: [partial_sum(UnscaledValue(ws_ext_ship_cost#758))]
Aggregate Attributes [1]: [sum(UnscaledValue(ws_ext_ship_cost#758))#853L]
Results [2]: [ws_order_number#747L, sum#863L]
(4) Exchange
Input [2]: [ws_order_number#747L, sum#863L]
Arguments: hashpartitioning(ws_order_number#747L, 288), ENSURE_REQUIREMENTS, [plan_id=112]
(5) ShuffleQueryStage
Output [2]: [ws_order_number#747L, sum#863L]
Arguments: 0
(6) AQEShuffleRead
Input [2]: [ws_order_number#747L, sum#863L]
Arguments: coalesced
(7) HashAggregate [codegen id : 2]
Input [2]: [ws_order_number#747L, sum#863L]
Keys [1]: [ws_order_number#747L]
Functions [1]: [merge_sum(UnscaledValue(ws_ext_ship_cost#758))]
Aggregate Attributes [1]: [sum(UnscaledValue(ws_ext_ship_cost#758))#853L]
Results [2]: [ws_order_number#747L, sum#863L]
(8) HashAggregate [codegen id : 2]
Input [2]: [ws_order_number#747L, sum#863L]
Keys: []
Functions [2]: [merge_sum(UnscaledValue(ws_ext_ship_cost#758)), partial_count(distinct ws_order_number#747L)]
Aggregate Attributes [2]: [sum(UnscaledValue(ws_ext_ship_cost#758))#853L, count(ws_order_number#747L)#852L]
Results [2]: [sum#863L, count#866L]
(9) Exchange
Input [2]: [sum#863L, count#866L]
Arguments: SinglePartition, ENSURE_REQUIREMENTS, [plan_id=139]
(10) ShuffleQueryStage
Output [2]: [sum#863L, count#866L]
Arguments: 1
(11) HashAggregate [codegen id : 3]
Input [2]: [sum#863L, count#866L]
Keys: []
Functions [2]: [sum(UnscaledValue(ws_ext_ship_cost#758)), count(distinct ws_order_number#747L)]
Aggregate Attributes [2]: [sum(UnscaledValue(ws_ext_ship_cost#758))#853L, count(ws_order_number#747L)#852L]
Results [2]: [cast(count(ws_order_number#747L)#852L as string) AS order count#858, cast(MakeDecimal(sum(UnscaledValue(ws_ext_ship_cost#758))#853L,17,2) as string) AS total shipping cost#859]
(12) HashAggregate
Input [2]: [ws_order_number#747L, ws_ext_ship_cost#758]
Keys [1]: [ws_order_number#747L]
Functions [1]: [partial_sum(UnscaledValue(ws_ext_ship_cost#758))]
Aggregate Attributes [1]: [sum(UnscaledValue(ws_ext_ship_cost#758))#853L]
Results [2]: [ws_order_number#747L, sum#863L]
(13) Exchange
Input [2]: [ws_order_number#747L, sum#863L]
Arguments: hashpartitioning(ws_order_number#747L, 288), ENSURE_REQUIREMENTS, [plan_id=89]
(14) HashAggregate
Input [2]: [ws_order_number#747L, sum#863L]
Keys [1]: [ws_order_number#747L]
Functions [1]: [merge_sum(UnscaledValue(ws_ext_ship_cost#758))]
Aggregate Attributes [1]: [sum(UnscaledValue(ws_ext_ship_cost#758))#853L]
Results [2]: [ws_order_number#747L, sum#863L]
(15) HashAggregate
Input [2]: [ws_order_number#747L, sum#863L]
Keys: []
Functions [2]: [merge_sum(UnscaledValue(ws_ext_ship_cost#758)), partial_count(distinct ws_order_number#747L)]
Aggregate Attributes [2]: [sum(UnscaledValue(ws_ext_ship_cost#758))#853L, count(ws_order_number#747L)#852L]
Results [2]: [sum#863L, count#866L]
(16) Exchange
Input [2]: [sum#863L, count#866L]
Arguments: SinglePartition, ENSURE_REQUIREMENTS, [plan_id=93]
(17) HashAggregate
Input [2]: [sum#863L, count#866L]
Keys: []
Functions [2]: [sum(UnscaledValue(ws_ext_ship_cost#758)), count(distinct ws_order_number#747L)]
Aggregate Attributes [2]: [sum(UnscaledValue(ws_ext_ship_cost#758))#853L, count(ws_order_number#747L)#852L]
Results [2]: [cast(count(ws_order_number#747L)#852L as string) AS order count#858, cast(MakeDecimal(sum(UnscaledValue(ws_ext_ship_cost#758))#853L,17,2) as string) AS total shipping cost#859]
(18) AdaptiveSparkPlan
Output [2]: [order count#858, total shipping cost#859]
Arguments: isFinalPlan=true |
@rui-mo I found some description of this query plan in "Aggregate with One Distinct" section in https://dataninjago.com/2022/01/04/spark-deep-dive-12-aggregation-strategy/ I'm still reading it. |
@rui-mo Wei is currently working on enhancing aggregate functions registration to automatically register a few companion functions for each aggregate function. Specifically, for aggregate function xxx, we'll automatically generate and register the following companion functions:
Once we have that the plan you describe above can be implemented using sum_partial, sum_merge and sum_extract companion functions like so:
CC: @kagamiori |
@mbasmanova Thanks for your reply! It is very useful to us.
Based on this, do you think it's good for us to fix this issue in a similar way? If I understand correctly, that is to change the Spark plan by adding a partial_count before the merge stage. Then, the Velox plan would become sth. like below: auto plan = PlanBuilder()
.values({rows})
.partialAggregation({"c0"}, {"sum(c1)", "count()"})
.intermediateAggregation({"c0"}, {"sum(a0)", "count()"}, resultType)
.intermediateAggregation({}, {"sum(a0)", "count(c0)"}, resultType)
.planNode(); |
@kagamiori Wei, would it be possible to publish a draft PR of the companion aggregate functions? @rui-mo In the meantime, one option is to manually register companion aggregate function sum_partial, sum_merge and sum_extract. Then create Velox plan like so:
Note that for the production use case you'll need to create a distributed plan with shuffles in place of "localPartition". Hope this helps. |
@mbasmanova Thank you. I see. Basically, we need to change the plan, instead of just offloading to Velox as Spark plans. We' ll try that. Thanks again! |
@rui-mo Rui, Just to clarify, my reading of "Aggregate with One Distinct" section in https://dataninjago.com/2022/01/04/spark-deep-dive-12-aggregation-strategy/ suggests that Spark's plan contains 6 nodes:
There are 2 shuffle nodes which break this plan into 3 fragments. Fragment 1:
Fragment 2:
Fragment 3:
Each fragment is translated to Velox plan and executed separately. The tricky part here is that partial_sum, merge_sum and finalmerge_sum aggregate functions do not map to a single sum aggregate function in Velox. These can be derived from the implementation of "sum", but still need to be defined and registered as separate aggregate functions. Wei's work will make it easier to define and register these so-called companion functions, but we were thinking of defining these slightly differently. In our design partial_sum and merge_sum are the same, except that we planned to put 'partial' and 'merge' in the suffix rather then prefix. We didn't plan to provide finalmerge_sum function though. Instead, we offer sum_extract scalar function that can be combined with sum_merge to implement "finalmerge_sum". Doing so requires translating Fragment 3 into 2-node plan:
I assume you refer to the addition of "project" node as "we need to change the plan". Let me know if that's not the case. An alternative option is for Gluten to define Spark-compatible companion functions: partial_xxx, merge_xxx and finalmerge_xxx. This way the translation from Spark to Velox plan might be simpler. CC: @kagamiori |
@kagamiori Wei, let's discuss if it makes sense to for us to add "finalmerge" companion aggregate function. |
@mbasmanova Masha, let me clarify a bit here.
This is how we translate Spark plan to Velox plan in this case. It's a one-one mapping from Spark plan to Velox plan, and that means The problem occurs in Fragment 2 (8), in which Fragment 1: convert to
Fragment 2: convert to
Fragment 3: convert to
Based on above discussion, I believe we need to change the plan to make it work in Velox. The purpose it to avoid putting partial_count and merge_sum together in a single Aggregation Node, for example, by adding a partial count to Fragment 1. Does that make sense? |
Also have a question here.
I believe merge_sum relies on the intermediate result as input, but partial_sum starts from zero. Why they are treated the same? |
@rui-mo It sounds to me that translation from Spark to Velox plan is incorrect. How do you derive the aggregation step during translation? I expect aggregation steps: partial, final, partial, final for that query plan. |
I meant that in our design sum_partial matches Spark's partial_sum and sum_merge matches Spark's merge_sum. |
@mbasmanova Masha, here are some details for Spark aggregation phases. We use below mapping relationship: Spark Partial => Velox AggregationNode::Step::kPartial
In Spark, merge_sum belongs to PartialMerge, and partial_count belongs to Partial, so we are confused here. |
@rui-mo Thank you for the pointer. I see that in Spark, AggregateMode is attached to an aggregate expression, not to Aggregate plan node. Is this the case? Do you have a pointer to the definition of aggregation plan node? Are you seeing a single aggregation plan node with 2 aggregate expressions that use different AggregateMode? |
@mbasmanova Appreciate your time on this issue.
Yes, that's the case. Spark's aggregate modes can differ between aggregate expressions, but that is not a common case although.
Yes, the case described above is just like that. And we only notice this issue in this case for now. At below plan, merge_sum means PartialMerge mode in Spark, and partial_count means Partial mode. Here is the link on how Spark translate mode to string: interfaces.scala#L161-L166.
This is the class in Spark for hash aggregation: HashAggregateExec.scala#L47-L56. Mode can be accessed from |
@rui-mo Rui, thank you for the pointers. This is a bit tricky. I see a couple of issues: 1- Velox doesn't support mixed aggregations;
We should add support for (2), but I'm not sure about (1). In Velox, partial and intermediate aggregations differ from final aggregation in that they do not spill, but rather flush results if memory is tight. This means that it is possible for the results of partial or intermediate aggregation to include rows with duplicate grouping keys. The Spark plan we consider here assumes that the results of intermediate aggregation does not have duplicate grouping keys. It might be possible to introduce a configuration flag to disable flushing in partial and intermediate aggregations and enable spill, but this would be quite complex change and it is not clear to me whether it makes sense. CC: @xiaoxmeng
I now understand what you meant here. Do you think it would be possible to do that in Gluten? |
@mbasmanova Thanks for the comments. They're very useful to us.
We thought distinct count is not a problem bacause Spark plans grouping expressions in previous aggregations. It was assumed the input for this aggregation was already distinct, so normal count could do the work. But you remind me another thing:
This means the output of Velox's partial or intermediate aggregation can be not distinct, so normal count would not work then. Yeah, to disable flush is a solution to ensure the input for count is already distinct.
Yes, we can try that if Velox decides not to support mixed aggregations. In that case, maybe some change to Spark plan as below could work. The original Spark plan:
The changed Spark plan:
|
@rui-mo Something like this should work. Just make sure to use merge_sum, not final_sum in 2-nd and 3rd aggregations. For 'sum' it doesn't really matter, but if you replace 'sum' with 'avg' then you'll notice the difference. |
@mbasmanova You're right, avg is different. If we use merge_avg in the 2-nd and 3-nd aggregations, this issue comes back. We need think futher here. |
Somehow, I think this should work. By 'merge_avg' I refer to a companion function to 'avg', not orgiinal 'avg'. |
@mbasmanova Understood. We will try to support the case described in this issue soon, and keep you updated. Thank you so much! |
@rui-mo Thank you, Rui. Let us know how it goes and feel free to reach out in case you run into any issues. |
Hi @rui-mo, we're designing an adapter to automatically make companion functions like |
@mbasmanova @kagamiori I tried #4489 in a unit test, and found it could work for this case by using The unit test is: TEST_F(AverageAggregationTest, companion) {
auto rows = makeRowVector({
makeFlatVector<int32_t>(100, [&](auto row) { return row % 10; }),
makeFlatVector<int32_t>(100, [&](auto row) { return row * 2; }),
makeFlatVector<int32_t>(100, [&](auto row) { return row; })});
createDuckDbTable("t", {rows});
std::vector<TypePtr> resultType = {BIGINT(), ROW({DOUBLE(), BIGINT()})};
auto plan = PlanBuilder()
.values({rows})
.partialAggregation({"c0"}, {"avg(c1)", "sum(c2)"})
.intermediateAggregation({"c0"}, {"avg(a0)", "sum(a1)"},
{ROW({DOUBLE(), BIGINT()}), BIGINT()})
.aggregation(
{},
{"avg_merge(a0)", "sum_merge(a1)", "count(c0)"},
{},
core::AggregationNode::Step::kPartial,
false,
{ROW({DOUBLE(), BIGINT()}), BIGINT(), BIGINT()})
.finalAggregation({}, {"avg(a0)", "sum(a1)", "count(a2)"}, {DOUBLE(), BIGINT(), BIGINT()})
.planNode();
assertQuery(plan, "SELECT avg(c1), sum(c2), count(distinct c0) from t");
} The Velox plan node is:
Thank you for all your support! |
@rui-mo Thank you for looking into this. I think the Velox plan should look like this:
First, group-by on c0 and compute avg_partial and sum_partial. Then, global agg with avg_merge, sum_merge and count. Finally, a project using avg_extract and sum_extract. |
Hi Masha @mbasmanova, why do we need the second finalAggregation after the first partialAggregation? Because for
With above plan, because Appreciate your help! |
@rui-mo In this query, we first group by c0, then perform global aggregation. Strictly speaking, it is incorrect to computer intermediate aggregation for group-by-c0 and use the results to compute a different global aggregation. We need to finish group-by-c0 by running final aggregation. To make sure that final aggregation returns "intermediate" results, we use avg_partial and sum_partial companion functions instead of avg and sum. This way, we do not need to disable flushing in partial or intermediate aggregations. (Disabling flushing is tricky because doing so will require enabling spilling. Otherwise, some aggregations may run out of memory.) |
@rui-mo In the above plan node, we should replace avg and sum with avg_partial and sum_partial companion functions. |
@mbasmanova Thank you for your reply! I finally understand your suggestion. A quick update is, we implemented the related code in Gluten to use companion functions to solve this issue. It worked for us but was based on the |
@rui-mo Rui, thank you for the update. Happy to hear you are not blocked and making nice progress. Let us know if you'd like any further help. |
@mbasmanova Thank you! |
Description
Hi,
We noticed this issue when running TPC-DS q95 with Spark. For a query like below, Spark plans several aggregations as below to get the final result.
select count(distinct ws_order_number), sum(ws_ext_ship_cost), sum(ws_net_profit) from web_sales
Spark planned aggregations are:
(1) group by ws_order_number, partial_sum(ws_ext_ship_cost), partial_sum(ws_net_profit)
(2) group by ws_order_number, merge_sum(ws_ext_ship_cost), merge_sum(ws_net_profit)
(3) merge_sum(ws_ext_ship_cost), merge_sum(ws_net_profit), partial_count(ws_order_number)
(4) final_sum(ws_ext_ship_cost), final_sum(ws_net_profit), final_count(ws_order_number)
The problem occurs in the third part, because merge_sum and partial_count which belong to different phases are put together in an aggregation, while Velox currently does not support this behavior. If we ignore that and just use merge phase, incorrect result is got because the intermediate input for merge_count is not correct.
This issue can be reproduced with below unit test:
@mbasmanova @oerling Do you have any advice for us? Thanks for your help.
The text was updated successfully, but these errors were encountered: