Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reimplement push_down_projection and prune_column. #4465

Merged
merged 2 commits into from
Feb 28, 2023

Conversation

jackwener
Copy link
Member

@jackwener jackwener commented Dec 1, 2022

Which issue does this PR close?

Closes #4265
Closes #4267.

Rationale for this change

original push_down_projection use HashSet<Column> store required column and push it through plan top to down.

Now, I remove it and use Projection Plan itself to require child to need to output some columns.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules labels Dec 1, 2022
@jackwener jackwener force-pushed the push_down_projection branch from d8e5d92 to 2965ce9 Compare December 9, 2022 12:19
@jackwener jackwener changed the title Draft: reimplement Prune column reimplement push_down_projection and prune_column. Dec 9, 2022
@jackwener jackwener force-pushed the push_down_projection branch from 2965ce9 to a4d49cb Compare December 9, 2022 12:26
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Dec 9, 2022
@jackwener jackwener force-pushed the push_down_projection branch from a4d49cb to b78231e Compare December 9, 2022 12:30
@jackwener jackwener marked this pull request as ready for review December 9, 2022 12:30
@jackwener jackwener marked this pull request as draft December 9, 2022 12:49
@jackwener jackwener force-pushed the push_down_projection branch from e1ca364 to 1ff6168 Compare December 9, 2022 14:23
@jackwener jackwener marked this pull request as ready for review December 9, 2022 14:26
Projection: customer.c_custkey, customer.c_name, customer.c_address, customer.c_phone, customer.c_acctbal, customer.c_comment, lineitem.l_extendedprice, lineitem.l_discount, nation.n_name
Copy link
Member Author

@jackwener jackwener Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can find some new projection happen above join. because I prune column for Join

For example:

table a [id] b [id] c [id]
select c.id from (select * from a join b on a.id = b.id) join c on a.id = c.id.

we need add projection above inside join (a b), because b.id is just used for condition.
We need projection prune `b.id`.

original:
       project c.id
         Join
    Join    c(id)
a(id)  b(id)

->
        project c.id
            Join
project(a.id)  c(id)
    Join
a(id)  b(id)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is that the Projection nodes above the join are added to make it clear what columns that come out of the join are actually needed above it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the Projection nodes above the join is used to just get columns that we need to use.

Copy link
Contributor

@ygf11 ygf11 Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is relative to build_join_schema. Some optimizer rules call this function.
I think it is ok for calling it before pushdown projection, but I guess it is not correct after push down projection.

For the query:

select a.id from a join b on a.id = b.id

we call it after pushdown projection:

  1. schema(a): a.id
  2. schema(b): b.id

build_join_schema will merge left and right, the result is a.id + b.id, but the expected result should be only a.id.

Maybe we can fix it first, and then we will not need the projection any more.

Copy link
Member Author

@jackwener jackwener Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for this comment is easily misunderstood. I have corrected it.

@alamb alamb marked this pull request as draft December 12, 2022 12:08
@alamb
Copy link
Contributor

alamb commented Dec 12, 2022

Marking as draft to signify that this PR is not quite ready for re-review -- please mark it ready for review when it is ready

@alamb
Copy link
Contributor

alamb commented Jan 6, 2023

I plan to help review this over the next day or two

@jackwener jackwener marked this pull request as draft January 7, 2023 16:09
@jackwener jackwener force-pushed the push_down_projection branch from 54b2be0 to 8c57270 Compare February 13, 2023 03:40
@jackwener jackwener force-pushed the push_down_projection branch 2 times, most recently from fb9cccb to f2cd1d8 Compare February 13, 2023 04:20
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 13, 2023
@jackwener jackwener force-pushed the push_down_projection branch from f2cd1d8 to 66ad3a9 Compare February 13, 2023 05:05
@jackwener jackwener force-pushed the push_down_projection branch 2 times, most recently from 89b3ef8 to f172e60 Compare February 13, 2023 06:32
@jackwener jackwener force-pushed the push_down_projection branch 2 times, most recently from ef86e1e to ffdad2b Compare February 13, 2023 07:53
@liukun4515
Copy link
Contributor

I will take a look this tomorrow.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks epic @jackwener -- I have this on my list to review tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jackwener -- I reviewed the code and plan changes carefully

I found the addition of so many Projection into the plans confusing as they obscure the key operations somewhat. It seems like the new Projection's are primarily introduced because you need some way to encode the currently used set of columns now that the recursion happens outside of the optimizer. Is that true?

It would be great to reduce the newly added ProjectionExec somehow. For example, can we introduce a pass that removes any that don't actually reduce the schema meaningfully (eg the input to an aggregate, for example)?

Basically I think this is a very nice change, and a great end to a long epic of work. Thank you so much for your contribution

cc @ygf11 and @liukun4515 who I think have been working on joins recently and may have some more thoughts on this

Projection: customer.c_custkey, customer.c_name, customer.c_address, customer.c_phone, customer.c_acctbal, customer.c_comment, lineitem.l_extendedprice, lineitem.l_discount, nation.n_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is that the Projection nodes above the join are added to make it clear what columns that come out of the join are actually needed above it?

datafusion/core/tests/sql/avro.rs Outdated Show resolved Hide resolved
datafusion/core/tests/sql/parquet.rs Outdated Show resolved Hide resolved
datafusion/core/tests/sql/window.rs Outdated Show resolved Hide resolved
datafusion/optimizer/tests/integration-test.rs Outdated Show resolved Hide resolved
datafusion/optimizer/src/push_down_projection.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Feb 23, 2023

This PR now sadly appears to have a bunch of conflicts. @jackwener is your idea that with #5366 the extra projections introduced by this pass will be removed in the final plans?

@jackwener
Copy link
Member Author

jackwener commented Feb 23, 2023

This PR now sadly appears to have a bunch of conflicts. @jackwener is your idea that with #5366 the extra projections introduced by this pass will be removed in the final plans?

Yes, this is one of the purposes. As you said above, we can make PR #5366 to eliminate the redundant projections that appear in this PR.

Another purpose is to split one part (eliminate projection) of this PR into one new rule. This make rule keep it as simple as possible, don't do too many things at once in one rule. And this PR will be more simple than now.

@jackwener jackwener marked this pull request as draft February 23, 2023 12:53
@alamb
Copy link
Contributor

alamb commented Feb 23, 2023

Yes, this is one of the purposes. As you said above, we can make PR #5366 to eliminate the redundant projections that appear in this PR.

Awesome -- I think as long a the redundant projections are removed by the final plan it is fine to introduce them in an earlier pass 👍

Another purpose is to split one part (eliminate projection) of this PR into one new rule. This make rule keep it as simple as possible, don't do too many things at once in one rule. And this PR will be more simple than now.

❤️

@jackwener jackwener force-pushed the push_down_projection branch from 2d6ada4 to 66b0557 Compare February 25, 2023 21:55
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Feb 25, 2023
@jackwener jackwener marked this pull request as ready for review February 26, 2023 06:17
@jackwener
Copy link
Member Author

jackwener commented Feb 26, 2023

@ygf11 @alamb @liukun4515 @mingmwang PTAL.

I resolve all problem in this PR. now this PR just add some extra projection. This projection is used for prune-column.

We can see Spark plan to see the expected effect, like spark tpch q2 plan, you can see there are projection above join or filter.

Because these projection can make we just get columns that we use, prune-column will make datafusion use less memory!


Regarding these newly added projections, I originally intended to eliminate them in EliminateProjection. However, after careful consideration, I think we should not remove these projections because they are helpful for reducing memory overhead.

For example

Agg (just use table.id)
  Filter table.age > 10
    TableScan [id, age]

If we don't add projection to prune column, Agg input will be two column [table.id, table.age]
It will cause more cost.

After prune-column, plan will be following:

Agg (just use table.id)
  Projection table.id
    Filter table.age > 10
      TableScan [id, age]

This projection will prune column, and make Agg input just one column.

TableScan: supplier projection=[s_suppkey, s_nationkey]
Projection: nation.n_nationkey
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here Filter will output two columns, but we just need one column nation.n_nationkey, so add projection to prune column

@jackwener jackwener force-pushed the push_down_projection branch 2 times, most recently from 4ce085c to c0aec60 Compare February 26, 2023 07:47
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @jackwener -- thank you. An epic sequence of pull requests

I think the question of where the COUNT aggregate went is probably important to answer before merging.

Also, I think we can make the plans even better (maybe as a follow on PR) by avoiding even more redundant Projections

@@ -1515,13 +1515,14 @@ async fn test_remove_unnecessary_sort_in_sub_query() -> Result<()> {
" CoalescePartitionsExec",
" AggregateExec: mode=Partial, gby=[], aggr=[COUNT(UInt8(1))]",
" RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=8",
" AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], aggr=[COUNT(UInt8(1))]",
" AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], aggr=[]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change -- why is there no aggregate anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has resolved it , thanks @alamb

assert_optimized_plan_eq(&plan, expected);

Ok(())
\n Projection: test.c, test.a, test.b\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These projections above a Filter that just pass the input to the output are also unecessary, right? Maybe we can add a rule to the remove unnecessary projections for these as well (if the schema of the projection's input is the same as the schema of its output)

Copy link
Member Author

@jackwener jackwener Feb 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb . Agree with it.
But In this example, the schema of the projection != the schema of its child, because the order already changed.
In the future, we indeed can remove these projection (if it just change order, and it isn't in the top of plan tree, which means that there must be a plannode above this projection that can determine the output schema like agg, other projection ....) We can enhance EliminateProject Rule

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the order already changed.
In the future, we indeed can remove these projection (if it just change order, and it isn't in the top of plan tree, which means that there must be a plannode above this projection that can

That is an excellent point

We can enhance EliminateProject Rule

Good idea 👍

@jackwener jackwener force-pushed the push_down_projection branch from c0aec60 to 636e5fe Compare February 26, 2023 13:37
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it. Thanks @jackwener

@alamb alamb merged commit 0000d4f into apache:main Feb 28, 2023
@ursabot
Copy link

ursabot commented Feb 28, 2023

Benchmark runs are scheduled for baseline = 25b4f67 and contender = 0000d4f. 0000d4f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@jackwener jackwener deleted the push_down_projection branch March 1, 2023 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[EPIC]: reimplement all rules which contains global-state Reimplement projection_push_down
5 participants