Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new Optimizer Rule to Remove Distinct Aggregation for Queries containing Grouped Aggregation Operators #23087

Merged

Conversation

pgandhi999
Copy link
Member

@pgandhi999 pgandhi999 commented Aug 20, 2024

Description

This optimization removes the redundant Aggregation operator introduced by Distinct clause when the query contains Grouped Aggregation operator.

For context, we noticed a query failing due to OOM error where the query plan contained a Grouped Aggregation Operator followed by the Distinct Aggregation Operator. The Distinct Aggregation Operator is redundant for Grouping Aggregations. After removing the DISTINCT clause, the query succeeded. Adding an Optimizer Rule for the same.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# General
* Improve performance for queries with a redundant `DISTINCT` over already-distinct `GROUP BY` aggregation ({issue}`23087`)

@cla-bot cla-bot bot added the cla-signed label Aug 20, 2024
@pgandhi999
Copy link
Member Author

@pettyjamesm

@pgandhi999 pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch 2 times, most recently from edf1bc0 to f6897e5 Compare August 22, 2024 21:23
@pettyjamesm pettyjamesm requested a review from martint August 22, 2024 21:32
@pgandhi999 pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from f6897e5 to 70edd2d Compare August 23, 2024 17:38
@pgandhi999 pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from 70edd2d to 214b689 Compare August 30, 2024 18:42
@pgandhi999 pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from 214b689 to 5610889 Compare August 30, 2024 19:18
@pgandhi999 pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from 5610889 to b36d707 Compare September 6, 2024 20:11
@pgandhi999 pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from b36d707 to e2fc4f5 Compare September 6, 2024 20:35
@pettyjamesm
Copy link
Member

Latest iteration looks good to me, @martint - can you take another look and @pgandhi999 please resolve the review comments that have already been addressed?

@pettyjamesm pettyjamesm requested a review from martint September 6, 2024 20:40
// between the distinct aggregation and the child aggregation nodes so long as all child aggregation keys
// remain present (without transformation by the project) in the distinct aggregation grouping keys
case ProjectNode projectNode -> {
List<Symbol> childGroupingKeys = isDistinctOverGroupingKeys(lookup.resolve(projectNode.getSource()), lookup, groupingKeys);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not quite correct. It's technically not legal to compare the grouping keys from the leaf aggregation with the grouping keys passed from above without performing the reverse translations defined by this projection, as the meaning of symbols is local to each pair of operations and it may have been changed by the projection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not locally be correct to perform the check: groupingKeys.containsAll(aggregationNode.getGroupingSets().getGroupingKeys()) in isolation, since we don't know at that point whether the grouping key symbols from the parent and child aggregations have the same meaning. However, when we return the leaf aggregation symbols back up the call stack, we verfiy that any ProjectNodes encountered only perform identity projections for all leaf aggregation symbols- at which point we can say that they do have the same meaning by the time we reach the parent aggregation again and the transform should be considered to be valid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that it's too brittle and it relies on non-local assumptions. The call to isDistinctOverGroupingKeys(lookup.resolve(projectNode.getSource()), lookup, groupingKeys) is incorrect by construction, and whether childGroupingKeys.stream().allMatch(projectNode.getAssignments()::isIdentity) is sufficient to compensate depends on knowledge that this specific callsite doesn't have. For example, adding support of another type of node might break these assumptions.

Also, in the call to childGroupingKeys.stream().allMatch(projectNode.getAssignments()::isIdentity), it's incorrect to perform a lookup in the project node assignments using the symbols from the input scope as the key. The keys belong in the "output" scope. Today that works, but if in the future we make such lookups stricter by failing if the key is not from the proper scope, this will break.

Separately, the check is too specific. For example, it won't be able to catch plans like these ones:

- Aggregate [k]
  - Project k = x
    - Aggregate [x]

Copy link
Member

@pettyjamesm pettyjamesm Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that it's too brittle and it relies on non-local assumptions

That’s fair, although I would point out that the assumptions are self contained to the single static private method, so they’re reasonably localized.

For example, adding support of another type of node might break these assumptions.

Agreed. Maybe we can add a comment to clarify the specific assumptions being made and warn against expanding the rule to accept new intervening node types?

The keys belong in the "output" scope. Today that works, but if in the future we make such lookups stricter by failing if the key is not from the proper scope, this will break.

It’s hard for me to guess what that might look like, but keys of interest by definition straddle the input and output scope, and the check for identity transforms guarantees that they are the same. (This is related to the next point, more on that there).

Separately, the check is too specific. For example, it won't be able to catch plans like these ones:

- Aggregate [k]
 - Project k = x
   - Aggregate [x]

It won’t detect that, true- but it’s a missed optimization opportunity at that point and not a correctness bug. I’m not sure whether there are any plan optimizers today that would create new symbols for an identity assignment, but if there are then I guess we should plan for that? It’s definitely going to be more expensive to find any identity mappings from all input symbols and consider the intersection of those with the parent grouping symbols, but should be doable at that higher cost. Is that the recommendation here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fine. Let's add that documentation and get it in as is. We can fix those issue later when they come up. Also, please add a comment explaining that the lookups are not "proper" and that we're taking a shortcut.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the comments. I have rewritten the method to include translating the grouping set symbols for project nodes.

@pgandhi999 pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from e2fc4f5 to 30cb814 Compare September 10, 2024 18:42
@pgandhi999 pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from 30cb814 to 3688c0a Compare September 10, 2024 20:18
@pgandhi999 pgandhi999 force-pushed the remove-distinct-grouped-aggregation branch from 3688c0a to 81bc98b Compare September 10, 2024 20:21
@pettyjamesm pettyjamesm merged commit 49f57ef into trinodb:master Sep 10, 2024
95 checks passed
@pettyjamesm
Copy link
Member

Merged, thanks @pgandhi999!

@github-actions github-actions bot added this to the 458 milestone Sep 10, 2024
@sopel39
Copy link
Member

sopel39 commented Oct 14, 2024

This should use io.trino.sql.planner.optimizations.ActualProperties#localProperties instead of implementing derivations itself as part of RemoveRedundantDistinctAggregation. It would be more generic then too because it would work for aggregations that are distinct within stream, but not necessarily globally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants