Add a new Optimizer Rule to Remove Distinct Aggregation for Queries containing Grouped Aggregation Operators #23087

pgandhi999 · 2024-08-20T22:24:09Z

Description

This optimization removes the redundant Aggregation operator introduced by Distinct clause when the query contains Grouped Aggregation operator.

For context, we noticed a query failing due to OOM error where the query plan contained a Grouped Aggregation Operator followed by the Distinct Aggregation Operator. The Distinct Aggregation Operator is redundant for Grouping Aggregations. After removing the DISTINCT clause, the query succeeded. Adding an Optimizer Rule for the same.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# General
* Improve performance for queries with a redundant `DISTINCT` over already-distinct `GROUP BY` aggregation ({issue}`23087`)

pgandhi999 · 2024-08-20T22:30:26Z

@pettyjamesm

...java/io/trino/sql/planner/iterative/rule/RemoveDistinctAggregationForGroupedAggregation.java

.../io/trino/sql/planner/iterative/rule/TestRemoveDistinctAggregationForGroupedAggregation.java

...java/io/trino/sql/planner/iterative/rule/RemoveDistinctAggregationForGroupedAggregation.java

.../io/trino/sql/planner/iterative/rule/TestRemoveDistinctAggregationForGroupedAggregation.java

...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java

...main/src/main/java/io/trino/sql/planner/iterative/rule/IsDistinctOverGroupingKeysResult.java

...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java

pettyjamesm · 2024-09-06T20:39:22Z

Latest iteration looks good to me, @martint - can you take another look and @pgandhi999 please resolve the review comments that have already been addressed?

martint · 2024-09-09T19:08:44Z

...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java

+            // between the distinct aggregation and the child aggregation nodes so long as all child aggregation keys
+            // remain present (without transformation by the project) in the distinct aggregation grouping keys
+            case ProjectNode projectNode -> {
+                List<Symbol> childGroupingKeys = isDistinctOverGroupingKeys(lookup.resolve(projectNode.getSource()), lookup, groupingKeys);


This is still not quite correct. It's technically not legal to compare the grouping keys from the leaf aggregation with the grouping keys passed from above without performing the reverse translations defined by this projection, as the meaning of symbols is local to each pair of operations and it may have been changed by the projection.

It may not locally be correct to perform the check: groupingKeys.containsAll(aggregationNode.getGroupingSets().getGroupingKeys()) in isolation, since we don't know at that point whether the grouping key symbols from the parent and child aggregations have the same meaning. However, when we return the leaf aggregation symbols back up the call stack, we verfiy that any ProjectNodes encountered only perform identity projections for all leaf aggregation symbols- at which point we can say that they do have the same meaning by the time we reach the parent aggregation again and the transform should be considered to be valid.

My concern is that it's too brittle and it relies on non-local assumptions. The call to isDistinctOverGroupingKeys(lookup.resolve(projectNode.getSource()), lookup, groupingKeys) is incorrect by construction, and whether childGroupingKeys.stream().allMatch(projectNode.getAssignments()::isIdentity) is sufficient to compensate depends on knowledge that this specific callsite doesn't have. For example, adding support of another type of node might break these assumptions.

Also, in the call to childGroupingKeys.stream().allMatch(projectNode.getAssignments()::isIdentity), it's incorrect to perform a lookup in the project node assignments using the symbols from the input scope as the key. The keys belong in the "output" scope. Today that works, but if in the future we make such lookups stricter by failing if the key is not from the proper scope, this will break.

Separately, the check is too specific. For example, it won't be able to catch plans like these ones:

- Aggregate [k] - Project k = x - Aggregate [x]

My concern is that it's too brittle and it relies on non-local assumptions

That’s fair, although I would point out that the assumptions are self contained to the single static private method, so they’re reasonably localized.

For example, adding support of another type of node might break these assumptions.

Agreed. Maybe we can add a comment to clarify the specific assumptions being made and warn against expanding the rule to accept new intervening node types?

The keys belong in the "output" scope. Today that works, but if in the future we make such lookups stricter by failing if the key is not from the proper scope, this will break.

It’s hard for me to guess what that might look like, but keys of interest by definition straddle the input and output scope, and the check for identity transforms guarantees that they are the same. (This is related to the next point, more on that there).

Separately, the check is too specific. For example, it won't be able to catch plans like these ones:

- Aggregate [k] - Project k = x - Aggregate [x]

It won’t detect that, true- but it’s a missed optimization opportunity at that point and not a correctness bug. I’m not sure whether there are any plan optimizers today that would create new symbols for an identity assignment, but if there are then I guess we should plan for that? It’s definitely going to be more expensive to find any identity mappings from all input symbols and consider the intersection of those with the parent grouping symbols, but should be doable at that higher cost. Is that the recommendation here?

Ok, fine. Let's add that documentation and get it in as is. We can fix those issue later when they come up. Also, please add a comment explaining that the lookups are not "proper" and that we're taking a shortcut.

Thank you for all the comments. I have rewritten the method to include translating the grouping set symbols for project nodes.

pettyjamesm · 2024-09-10T23:04:00Z

Merged, thanks @pgandhi999!

sopel39 · 2024-10-14T16:14:46Z

This should use io.trino.sql.planner.optimizations.ActualProperties#localProperties instead of implementing derivations itself as part of RemoveRedundantDistinctAggregation. It would be more generic then too because it would work for aggregations that are distinct within stream, but not necessarily globally.

cla-bot bot added the cla-signed label Aug 20, 2024

pettyjamesm reviewed Aug 21, 2024

View reviewed changes

pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch 2 times, most recently from edf1bc0 to f6897e5 Compare August 22, 2024 21:23

pettyjamesm requested a review from martint August 22, 2024 21:32

pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from f6897e5 to 70edd2d Compare August 23, 2024 17:38

martint requested changes Aug 29, 2024

View reviewed changes

pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from 70edd2d to 214b689 Compare August 30, 2024 18:42

martint reviewed Aug 30, 2024

View reviewed changes

...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java Outdated Show resolved Hide resolved

pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from 214b689 to 5610889 Compare August 30, 2024 19:18

pettyjamesm reviewed Sep 3, 2024

View reviewed changes

...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java Outdated Show resolved Hide resolved

pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from 5610889 to b36d707 Compare September 6, 2024 20:11

pettyjamesm reviewed Sep 6, 2024

View reviewed changes

pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from b36d707 to e2fc4f5 Compare September 6, 2024 20:35

pettyjamesm requested a review from martint September 6, 2024 20:40

martint reviewed Sep 9, 2024

View reviewed changes

pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from e2fc4f5 to 30cb814 Compare September 10, 2024 18:42

martint approved these changes Sep 10, 2024

View reviewed changes

pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from 30cb814 to 3688c0a Compare September 10, 2024 20:18

Remove redundant distinct aggregations

81bc98b

pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from 3688c0a to 81bc98b Compare September 10, 2024 20:21

pettyjamesm merged commit 49f57ef into trinodb:master Sep 10, 2024
95 checks passed

github-actions bot added this to the 458 milestone Sep 10, 2024

colebow mentioned this pull request Sep 11, 2024

Add Trino 458 release notes #23367

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new Optimizer Rule to Remove Distinct Aggregation for Queries containing Grouped Aggregation Operators #23087

Add a new Optimizer Rule to Remove Distinct Aggregation for Queries containing Grouped Aggregation Operators #23087

pgandhi999 commented Aug 20, 2024 •

edited by pettyjamesm

Loading

pgandhi999 commented Aug 20, 2024

pettyjamesm commented Sep 6, 2024

martint Sep 9, 2024

pettyjamesm Sep 9, 2024

martint Sep 9, 2024

pettyjamesm Sep 9, 2024 •

edited

Loading

martint Sep 10, 2024

pgandhi999 Sep 10, 2024

pettyjamesm commented Sep 10, 2024

sopel39 commented Oct 14, 2024 •

edited

Loading

Add a new Optimizer Rule to Remove Distinct Aggregation for Queries containing Grouped Aggregation Operators #23087

Add a new Optimizer Rule to Remove Distinct Aggregation for Queries containing Grouped Aggregation Operators #23087

Conversation

pgandhi999 commented Aug 20, 2024 • edited by pettyjamesm Loading

Description

Additional context and related issues

Release notes

pgandhi999 commented Aug 20, 2024

pettyjamesm commented Sep 6, 2024

martint Sep 9, 2024

Choose a reason for hiding this comment

pettyjamesm Sep 9, 2024

Choose a reason for hiding this comment

martint Sep 9, 2024

Choose a reason for hiding this comment

pettyjamesm Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

martint Sep 10, 2024

Choose a reason for hiding this comment

pgandhi999 Sep 10, 2024

Choose a reason for hiding this comment

pettyjamesm commented Sep 10, 2024

sopel39 commented Oct 14, 2024 • edited Loading

pgandhi999 commented Aug 20, 2024 •

edited by pettyjamesm

Loading

pettyjamesm Sep 9, 2024 •

edited

Loading

sopel39 commented Oct 14, 2024 •

edited

Loading