Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42851][SQL] Replace EquivalentExpressions with mutable map in PhysicalAggregation #40488

Conversation

peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Mar 20, 2023

What changes were proposed in this pull request?

This PR proposes to replace EquivalentExpressions to a simple mutable map in PhysicalAggregation, the only place where EquivalentExpressions.addExpr() is used. EquivalentExpressions is useful for common subexpression elimination but in PhysicalAggregation it is used only to deduplicate whole expressions which can be easily done with a simple map.

Why are the changes needed?

EquivalentExpressions.addExpr() is not guarded by supportedExpression() and so it can cause inconsistent results when used together with EquivalentExpressions.getExprState(). This PR proposes replacing .addExpr() with other alternatives as its boolean result is a bit counter-intuitive to other collections' .add() methods. It returns false if the expression was missing and either adds the expression or not depending on if the expression is deterministic.
After this PR we no longer use EquivalentExpressions.addExpr() so it can be deprecated or even removed...

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UTs from @rednaxelafx's PR: #40473. Please note that those UTs actually pass after #40475, but they are added here to make sure there will be no regression in the future.

@peter-toth
Copy link
Contributor Author

@rednaxelafx, @cloud-fan let me know it this PR is a viable alternative to #40473. Or maybe if I should do a little cleanup like peter-toth@90421cb in this or in a follow-up PR...

@peter-toth peter-toth force-pushed the SPARK-42851-replace-equivalentexpressions-with-mutable-map branch from ff974fa to 345b9b5 Compare March 20, 2023 19:06
@rednaxelafx
Copy link
Contributor

Before the recent rounds of changes to EquivalentExpressions, the old addExprTree used to call addExpr in its core:
https://github.com/apache/spark/blob/branch-2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala#L90
That was still the case when PhysicalAggregation started using this mechanism to deduplicate expressions. I guess it started becoming "detached" from the main path when the recent refactoring happened that allows updating a separate equivalence map instead of the "main" one.

Your proposed PR here further orphans that function from any actual use. Which is okay for keeping binary compatibility as much as possible.
The inlined dedup logic in PhysicalAggregation looks rather ugly though. I don't have a strong opinion about that as long as other reviewers are okay. I'd prefer still retaining some sort of encapsulated collection for the dedup usage.

BTW I updated my PR's test case because it makes more sense to check the return value from addExpr on the second invocation rather than on the first (both "not supported expression" and actual new expression cases would have gotten a false return value if we add that guard to the addExpr() function).
28d101e

case a
if AggregateExpression.isAggregate(a) && !equivalentAggregateExpressions.addExpr(a) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong with addExpr here? It does simplify the code IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line of thought would be: adding the supportedExpression guard to addExpr() would cause performance regression, so let's just close our eyes and make the only remaining use of addExpr break away and do its own deduplication in the old logic without taking things like NamedLambdaVariable into account -- which is the way it's been for quite a few releases. This PR essentially inlines the addExpr path of the old EquivalentExpressions into PhysicalAggregation to recover what it used to do.

Copy link
Contributor Author

@peter-toth peter-toth Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the above, although .addExpr() fits here well and does the job, isn't it a bit weird that an add-like method of a collection-like object doesn't return true when a new item was added, but actually it flips the meaning of the return value? If it was used at multiple places then I would keep it, but we use it only here. But maybe I'm just nitpicking...
Anyways, I'm ok with #40473 too.

@peter-toth peter-toth closed this Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants