-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new Optimizer Rule to Remove Distinct Aggregation for Queries containing Grouped Aggregation Operators #23087
Add a new Optimizer Rule to Remove Distinct Aggregation for Queries containing Grouped Aggregation Operators #23087
Conversation
...java/io/trino/sql/planner/iterative/rule/RemoveDistinctAggregationForGroupedAggregation.java
Outdated
Show resolved
Hide resolved
...java/io/trino/sql/planner/iterative/rule/RemoveDistinctAggregationForGroupedAggregation.java
Outdated
Show resolved
Hide resolved
...java/io/trino/sql/planner/iterative/rule/RemoveDistinctAggregationForGroupedAggregation.java
Outdated
Show resolved
Hide resolved
...java/io/trino/sql/planner/iterative/rule/RemoveDistinctAggregationForGroupedAggregation.java
Outdated
Show resolved
Hide resolved
.../io/trino/sql/planner/iterative/rule/TestRemoveDistinctAggregationForGroupedAggregation.java
Outdated
Show resolved
Hide resolved
...java/io/trino/sql/planner/iterative/rule/RemoveDistinctAggregationForGroupedAggregation.java
Outdated
Show resolved
Hide resolved
...java/io/trino/sql/planner/iterative/rule/RemoveDistinctAggregationForGroupedAggregation.java
Outdated
Show resolved
Hide resolved
.../io/trino/sql/planner/iterative/rule/TestRemoveDistinctAggregationForGroupedAggregation.java
Outdated
Show resolved
Hide resolved
edf1bc0
to
f6897e5
Compare
f6897e5
to
70edd2d
Compare
...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java
Show resolved
Hide resolved
70edd2d
to
214b689
Compare
...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java
Outdated
Show resolved
Hide resolved
214b689
to
5610889
Compare
...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java
Outdated
Show resolved
Hide resolved
5610889
to
b36d707
Compare
...main/src/main/java/io/trino/sql/planner/iterative/rule/IsDistinctOverGroupingKeysResult.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/io/trino/sql/planner/iterative/rule/RemoveRedundantDistinctAggregation.java
Outdated
Show resolved
Hide resolved
b36d707
to
e2fc4f5
Compare
Latest iteration looks good to me, @martint - can you take another look and @pgandhi999 please resolve the review comments that have already been addressed? |
// between the distinct aggregation and the child aggregation nodes so long as all child aggregation keys | ||
// remain present (without transformation by the project) in the distinct aggregation grouping keys | ||
case ProjectNode projectNode -> { | ||
List<Symbol> childGroupingKeys = isDistinctOverGroupingKeys(lookup.resolve(projectNode.getSource()), lookup, groupingKeys); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still not quite correct. It's technically not legal to compare the grouping keys from the leaf aggregation with the grouping keys passed from above without performing the reverse translations defined by this projection, as the meaning of symbols is local to each pair of operations and it may have been changed by the projection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not locally be correct to perform the check: groupingKeys.containsAll(aggregationNode.getGroupingSets().getGroupingKeys())
in isolation, since we don't know at that point whether the grouping key symbols from the parent and child aggregations have the same meaning. However, when we return the leaf aggregation symbols back up the call stack, we verfiy that any ProjectNode
s encountered only perform identity projections for all leaf aggregation symbols- at which point we can say that they do have the same meaning by the time we reach the parent aggregation again and the transform should be considered to be valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that it's too brittle and it relies on non-local assumptions. The call to isDistinctOverGroupingKeys(lookup.resolve(projectNode.getSource()), lookup, groupingKeys)
is incorrect by construction, and whether childGroupingKeys.stream().allMatch(projectNode.getAssignments()::isIdentity)
is sufficient to compensate depends on knowledge that this specific callsite doesn't have. For example, adding support of another type of node might break these assumptions.
Also, in the call to childGroupingKeys.stream().allMatch(projectNode.getAssignments()::isIdentity)
, it's incorrect to perform a lookup in the project node assignments using the symbols from the input scope as the key. The keys belong in the "output" scope. Today that works, but if in the future we make such lookups stricter by failing if the key is not from the proper scope, this will break.
Separately, the check is too specific. For example, it won't be able to catch plans like these ones:
- Aggregate [k]
- Project k = x
- Aggregate [x]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that it's too brittle and it relies on non-local assumptions
That’s fair, although I would point out that the assumptions are self contained to the single static private method, so they’re reasonably localized.
For example, adding support of another type of node might break these assumptions.
Agreed. Maybe we can add a comment to clarify the specific assumptions being made and warn against expanding the rule to accept new intervening node types?
The keys belong in the "output" scope. Today that works, but if in the future we make such lookups stricter by failing if the key is not from the proper scope, this will break.
It’s hard for me to guess what that might look like, but keys of interest by definition straddle the input and output scope, and the check for identity transforms guarantees that they are the same. (This is related to the next point, more on that there).
Separately, the check is too specific. For example, it won't be able to catch plans like these ones:
- Aggregate [k] - Project k = x - Aggregate [x]
It won’t detect that, true- but it’s a missed optimization opportunity at that point and not a correctness bug. I’m not sure whether there are any plan optimizers today that would create new symbols for an identity assignment, but if there are then I guess we should plan for that? It’s definitely going to be more expensive to find any identity mappings from all input symbols and consider the intersection of those with the parent grouping symbols, but should be doable at that higher cost. Is that the recommendation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fine. Let's add that documentation and get it in as is. We can fix those issue later when they come up. Also, please add a comment explaining that the lookups are not "proper" and that we're taking a shortcut.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all the comments. I have rewritten the method to include translating the grouping set symbols for project nodes.
e2fc4f5
to
30cb814
Compare
30cb814
to
3688c0a
Compare
3688c0a
to
81bc98b
Compare
Merged, thanks @pgandhi999! |
This should use |
Description
This optimization removes the redundant Aggregation operator introduced by Distinct clause when the query contains Grouped Aggregation operator.
For context, we noticed a query failing due to OOM error where the query plan contained a Grouped Aggregation Operator followed by the Distinct Aggregation Operator. The Distinct Aggregation Operator is redundant for Grouping Aggregations. After removing the DISTINCT clause, the query succeeded. Adding an Optimizer Rule for the same.
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: