-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
planner: support 3 stage aggregation for single scalar distinct agg #37203
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Code Coverage Details: https://codecov.io/github/pingcap/tidb/commit/0bf971eb06093ede0cf4dbabd30570915c83de65 |
"name": "TestMPPSingleDistinct3Stage", | ||
"cases": [ | ||
"set @@tidb_allow_mpp=1;set @@tidb_enforce_mpp=1;", | ||
"EXPLAIN select count(distinct c) from t;", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the second stage in explain show count(distinct c) instead of group by c?
(json files cannot be commented, so I put the comments here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See here:
e.g.
explain select count(distinct a), count(b) from foo
HashAgg sum(#1), sum(#2) -> final agg
+- Exchange Passthrough
+- HashAgg count(distinct a) #1, sum(#3) #2 -> middle agg
+- Exchange HashPartition by a
+- HashAgg count(b) #3, group by a -> partial agg
+- TableScan foo
After the 1st stage agg, we shuffle by the distinct key. So it is ok to compute count(distinct) in 2nd stage.
newArgs = append(newArgs, middleSchema.Columns[i]) | ||
} else { | ||
for _, arg := range fun.Args { | ||
newCol, err := arg.RemapColumn(schemaMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use ColumnSubstitute
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, great. will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out ColumnSubstitute
is not applicable here, because the 2 schemas are different.
planner/core/task.go
Outdated
@@ -1578,6 +1578,29 @@ func (p *basePhysicalAgg) newPartialAggregate(copTaskType kv.StoreType, isMPPTas | |||
return partialAgg, finalAgg | |||
} | |||
|
|||
// can this agg use 3 stage for distinct aggregation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better with the format funcName do something
.
" └─ExchangeSender_31 1.00 mpp[tiflash] ExchangeType: PassThrough", | ||
" └─HashAgg_28 1.00 mpp[tiflash] funcs:count(distinct Column#13)->Column#15, funcs:sum(Column#14)->Column#16", | ||
" └─ExchangeReceiver_30 1.00 mpp[tiflash] ", | ||
" └─ExchangeSender_29 1.00 mpp[tiflash] ExchangeType: HashPartition, Hash Cols: ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hash Cols: here should be filled with Column#18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, I think it is caused by this issue: #35417
I will bail out for cases where distinct column is an expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
Which part not LGTM? :) |
/merge |
This pull request has been accepted and is ready to merge. Commit hash: a5f7872
|
/merge |
@fixdb: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
/merge |
/merge |
TiDB MergeCI notify✅ Well Done! New fixed [1] after this pr merged.
|
What problem does this PR solve?
Issue Number: close #37202
Problem Summary:
Currently, count(distinct) in MPP mode still execute in a single worker (PassThrough Mode), which is bad when there are large number of distinct values. e.g.:
What is changed and how it works?
In this patch, we are able to generate a plan with 3 stage aggregation for scalar distinct agg.
NB. only for cases where there is only 1 distinct agg function.
Check List
Tests
Side effects
Documentation
Release note