-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop copying LogicalPlan and Exprs in SingleDistinctToGroupBy
#10527
Stop copying LogicalPlan and Exprs in SingleDistinctToGroupBy
#10527
Conversation
Since what I was working on overlapped a bit with the code changes here, I made a PR on this branch to remove some additional clones, |
There are still some clones but I think they are necessary here because we are building two aggregates from one |
* refactor: use HashSet<&Expr> instead of HashSet<String> * refactor: remove more cloning
Co-authored-by: Adam Curtis <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @appletreeisyellow and @erratic-pattern
This is looking quite good. The only concern I have is with using the hash value of an expression to check for set membership. Other than that this I think this PR is ready to go
I found whitespace blind diff easier to review: https://github.com/apache/datafusion/pull/10527/files?w=1
let arg = args.swap_remove(0); | ||
|
||
let expr_id = distinct_aggr_exprs.hasher().hash_one(&arg); | ||
if distinct_aggr_exprs.insert(expr_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code appears to deduplicate distinct aggregate expressions by comparing their hash values rather than by equality. While expressions that hash to the same value are very likely the same expression, this isn't guaranteed I don't think.
I think for correctness the distinct_agg_exprs should hold the arguments by value -- e.g. distinct_agg_exprs.insert(arg.clone())
or use the display name as in the previous version
I realize this is another clone, but I think it is required for correctness
In general finding some way to compute a digest / fingerprint of an expression that can be used for equality would be great, but maybe it should be done as a separate project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the correctness is more important here 👍
Since arg.display_name
is a smaller clone than arg.clone
, so I updated the code to use the display name as in the previous version. Updated in 2ba102f
@alamb Thanks for the review! I have updated the code according to your feedback
Indeed! It greatly reduced my mental matching effort 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thanks again @appletreeisyellow
…he#10527) * chore: merge main and resolve conflict * chore: use less copy * chore: remove clone * remove more clones (apache#8) * refactor: use HashSet<&Expr> instead of HashSet<String> * refactor: remove more cloning * chore: reduce string allocation Co-authored-by: Adam Curtis <[email protected]> * chore: return internal error instead of panacing * chore: use arg display_name as hash key instead of a hashed value --------- Co-authored-by: Adam Curtis <[email protected]>
Which issue does this PR close?
Closes #10295
Rationale for this change
Make the optimizer faster by not copying
What changes are included in this PR?
OptimizerRule::rewrite
APISingleDistinctToGroupBy
Are these changes tested?
Existing CI
Are there any user-facing changes?
Functional change: No
Performance change: maybe 1-2% faster
Details