-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reimplement push_down_projection
and prune_column
.
#4465
Conversation
d8e5d92
to
2965ce9
Compare
Prune column
push_down_projection
and prune_column
.
2965ce9
to
a4d49cb
Compare
a4d49cb
to
b78231e
Compare
e1ca364
to
1ff6168
Compare
Projection: customer.c_custkey, customer.c_name, customer.c_address, customer.c_phone, customer.c_acctbal, customer.c_comment, lineitem.l_extendedprice, lineitem.l_discount, nation.n_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can find some new projection happen above join. because I prune column for Join
For example:
table a [id] b [id] c [id]
select c.id from (select * from a join b on a.id = b.id) join c on a.id = c.id.
we need add projection above inside join (a b), because b.id is just used for condition.
We need projection prune `b.id`.
original:
project c.id
Join
Join c(id)
a(id) b(id)
->
project c.id
Join
project(a.id) c(id)
Join
a(id) b(id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea is that the Projection
nodes above the join are added to make it clear what columns that come out of the join are actually needed above it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the Projection
nodes above the join
is used to just get columns that we need to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is relative to build_join_schema
. Some optimizer rules call this function.
I think it is ok for calling it before pushdown projection, but I guess it is not correct after push down projection.
For the query:
select a.id from a join b on a.id = b.id
we call it after pushdown projection:
schema(a)
: a.idschema(b)
: b.id
build_join_schema
will merge left and right, the result is a.id
+ b.id
, but the expected result should be only a.id
.
Maybe we can fix it first, and then we will not need the projection any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for this comment is easily misunderstood. I have corrected it.
Marking as draft to signify that this PR is not quite ready for re-review -- please mark it ready for review when it is ready |
1ff6168
to
54b2be0
Compare
I plan to help review this over the next day or two |
54b2be0
to
8c57270
Compare
fb9cccb
to
f2cd1d8
Compare
f2cd1d8
to
66ad3a9
Compare
89b3ef8
to
f172e60
Compare
ef86e1e
to
ffdad2b
Compare
I will take a look this tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks epic @jackwener -- I have this on my list to review tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jackwener -- I reviewed the code and plan changes carefully
I found the addition of so many Projection
into the plans confusing as they obscure the key operations somewhat. It seems like the new Projection's are primarily introduced because you need some way to encode the currently used set of columns now that the recursion happens outside of the optimizer. Is that true?
It would be great to reduce the newly added ProjectionExec somehow. For example, can we introduce a pass that removes any that don't actually reduce the schema meaningfully (eg the input to an aggregate, for example)?
Basically I think this is a very nice change, and a great end to a long epic of work. Thank you so much for your contribution
cc @ygf11 and @liukun4515 who I think have been working on joins recently and may have some more thoughts on this
Projection: customer.c_custkey, customer.c_name, customer.c_address, customer.c_phone, customer.c_acctbal, customer.c_comment, lineitem.l_extendedprice, lineitem.l_discount, nation.n_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea is that the Projection
nodes above the join are added to make it clear what columns that come out of the join are actually needed above it?
ffdad2b
to
883c951
Compare
This PR now sadly appears to have a bunch of conflicts. @jackwener is your idea that with #5366 the extra projections introduced by this pass will be removed in the final plans? |
Yes, this is one of the purposes. As you said above, we can make PR #5366 to eliminate the redundant projections that appear in this PR. Another purpose is to split one part (eliminate projection) of this PR into one new rule. This make rule keep it as simple as possible, don't do too many things at once in one rule. And this PR will be more simple than now. |
Awesome -- I think as long a the redundant projections are removed by the final plan it is fine to introduce them in an earlier pass 👍
❤️ |
2d6ada4
to
66b0557
Compare
@ygf11 @alamb @liukun4515 @mingmwang PTAL. I resolve all problem in this PR. now this PR just add some extra projection. This projection is used for prune-column. We can see Because these projection can make we just get columns that we use, prune-column will make Regarding these newly added projections, I originally intended to eliminate them in EliminateProjection. However, after careful consideration, I think we should not remove these projections because they are helpful for reducing memory overhead. For example Agg (just use table.id)
Filter table.age > 10
TableScan [id, age]
If we don't add projection to prune column, Agg input will be two column [table.id, table.age]
It will cause more cost.
After prune-column, plan will be following:
Agg (just use table.id)
Projection table.id
Filter table.age > 10
TableScan [id, age]
This projection will prune column, and make Agg input just one column. |
TableScan: supplier projection=[s_suppkey, s_nationkey] | ||
Projection: nation.n_nationkey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here Filter will output two columns, but we just need one column nation.n_nationkey
, so add projection to prune column
4ce085c
to
c0aec60
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me @jackwener -- thank you. An epic sequence of pull requests
I think the question of where the COUNT
aggregate went is probably important to answer before merging.
Also, I think we can make the plans even better (maybe as a follow on PR) by avoiding even more redundant Projections
datafusion/core/tests/sql/window.rs
Outdated
@@ -1515,13 +1515,14 @@ async fn test_remove_unnecessary_sort_in_sub_query() -> Result<()> { | |||
" CoalescePartitionsExec", | |||
" AggregateExec: mode=Partial, gby=[], aggr=[COUNT(UInt8(1))]", | |||
" RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=8", | |||
" AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], aggr=[COUNT(UInt8(1))]", | |||
" AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], aggr=[]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this change -- why is there no aggregate anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has resolved it , thanks @alamb
assert_optimized_plan_eq(&plan, expected); | ||
|
||
Ok(()) | ||
\n Projection: test.c, test.a, test.b\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These projections above a Filter that just pass the input to the output are also unecessary, right? Maybe we can add a rule to the remove unnecessary projections for these as well (if the schema of the projection's input is the same as the schema of its output)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb . Agree with it.
But In this example, the schema of the projection != the schema of its child, because the order already changed.
In the future, we indeed can remove these projection (if it just change order, and it isn't in the top of plan tree, which means that there must be a plannode above this projection that can determine the output schema like agg, other projection ....) We can enhance EliminateProject Rule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the order already changed.
In the future, we indeed can remove these projection (if it just change order, and it isn't in the top of plan tree, which means that there must be a plannode above this projection that can
That is an excellent point
We can enhance EliminateProject Rule
Good idea 👍
c0aec60
to
636e5fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it. Thanks @jackwener
Benchmark runs are scheduled for baseline = 25b4f67 and contender = 0000d4f. 0000d4f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #4265
Closes #4267.
Rationale for this change
original
push_down_projection
useHashSet<Column>
storerequired column
and push it through plan top to down.Now, I remove it and use
Projection
Plan itself to require child to need to output some columns.What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?