reimplement `push_down_projection` and `prune_column`. #4465

jackwener · 2022-12-01T17:33:06Z

Which issue does this PR close?

Closes #4265
Closes #4267.

Rationale for this change

original push_down_projection use HashSet<Column> store required column and push it through plan top to down.

Now, I remove it and use Projection Plan itself to require child to need to output some columns.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

benchmarks/expected-plans/q19.txt

jackwener · 2022-12-09T14:30:07Z

benchmarks/expected-plans/q10.txt

We can find some new projection happen above join. because I prune column for Join

For example:

table a [id] b [id] c [id] select c.id from (select * from a join b on a.id = b.id) join c on a.id = c.id. we need add projection above inside join (a b), because b.id is just used for condition. We need projection prune `b.id`. original: project c.id Join Join c(id) a(id) b(id) -> project c.id Join project(a.id) c(id) Join a(id) b(id)

So the idea is that the Projection nodes above the join are added to make it clear what columns that come out of the join are actually needed above it?

Yes, the Projection nodes above the join is used to just get columns that we need to use.

Maybe it is relative to build_join_schema. Some optimizer rules call this function.
I think it is ok for calling it before pushdown projection, but I guess it is not correct after push down projection.

For the query:

select a.id from a join b on a.id = b.id

we call it after pushdown projection:

schema(a): a.id

schema(b): b.id

build_join_schema will merge left and right, the result is a.id + b.id, but the expected result should be only a.id.

Maybe we can fix it first, and then we will not need the projection any more.

I'm sorry for this comment is easily misunderstood. I have corrected it.

alamb · 2022-12-12T12:08:58Z

Marking as draft to signify that this PR is not quite ready for re-review -- please mark it ready for review when it is ready

alamb · 2023-01-06T18:58:21Z

I plan to help review this over the next day or two

datafusion/optimizer/tests/integration-test.rs

datafusion/optimizer/src/optimizer.rs

liukun4515 · 2023-02-14T13:21:31Z

I will take a look this tomorrow.

alamb

This looks epic @jackwener -- I have this on my list to review tomorrow

alamb

Thank you @jackwener -- I reviewed the code and plan changes carefully

I found the addition of so many Projection into the plans confusing as they obscure the key operations somewhat. It seems like the new Projection's are primarily introduced because you need some way to encode the currently used set of columns now that the recursion happens outside of the optimizer. Is that true?

It would be great to reduce the newly added ProjectionExec somehow. For example, can we introduce a pass that removes any that don't actually reduce the schema meaningfully (eg the input to an aggregate, for example)?

Basically I think this is a very nice change, and a great end to a long epic of work. Thank you so much for your contribution

cc @ygf11 and @liukun4515 who I think have been working on joins recently and may have some more thoughts on this

alamb · 2023-02-17T19:53:56Z

benchmarks/expected-plans/q10.txt

So the idea is that the Projection nodes above the join are added to make it clear what columns that come out of the join are actually needed above it?

datafusion/core/tests/sql/avro.rs

datafusion/core/tests/sql/parquet.rs

datafusion/core/tests/sql/window.rs

datafusion/optimizer/tests/integration-test.rs

datafusion/optimizer/src/push_down_projection.rs

alamb · 2023-02-23T12:33:12Z

This PR now sadly appears to have a bunch of conflicts. @jackwener is your idea that with #5366 the extra projections introduced by this pass will be removed in the final plans?

jackwener · 2023-02-23T12:38:21Z

This PR now sadly appears to have a bunch of conflicts. @jackwener is your idea that with #5366 the extra projections introduced by this pass will be removed in the final plans?

Yes, this is one of the purposes. As you said above, we can make PR #5366 to eliminate the redundant projections that appear in this PR.

Another purpose is to split one part (eliminate projection) of this PR into one new rule. This make rule keep it as simple as possible, don't do too many things at once in one rule. And this PR will be more simple than now.

alamb · 2023-02-23T13:27:27Z

Yes, this is one of the purposes. As you said above, we can make PR #5366 to eliminate the redundant projections that appear in this PR.

Awesome -- I think as long a the redundant projections are removed by the final plan it is fine to introduce them in an earlier pass 👍

Another purpose is to split one part (eliminate projection) of this PR into one new rule. This make rule keep it as simple as possible, don't do too many things at once in one rule. And this PR will be more simple than now.

❤️

jackwener · 2023-02-26T06:29:12Z

@ygf11 @alamb @liukun4515 @mingmwang PTAL.

I resolve all problem in this PR. now this PR just add some extra projection. This projection is used for prune-column.

We can see Spark plan to see the expected effect, like spark tpch q2 plan, you can see there are projection above join or filter.

Because these projection can make we just get columns that we use, prune-column will make datafusion use less memory!

Regarding these newly added projections, I originally intended to eliminate them in EliminateProjection. However, after careful consideration, I think we should not remove these projections because they are helpful for reducing memory overhead.

For example

Agg (just use table.id)
  Filter table.age > 10
    TableScan [id, age]

If we don't add projection to prune column, Agg input will be two column [table.id, table.age]
It will cause more cost.

After prune-column, plan will be following:

Agg (just use table.id)
  Projection table.id
    Filter table.age > 10
      TableScan [id, age]

This projection will prune column, and make Agg input just one column.

jackwener · 2023-02-26T06:31:51Z

benchmarks/expected-plans/q11.txt

                  TableScan: supplier projection=[s_suppkey, s_nationkey]
+              Projection: nation.n_nationkey


Here Filter will output two columns, but we just need one column nation.n_nationkey, so add projection to prune column

alamb

Looks good to me @jackwener -- thank you. An epic sequence of pull requests

I think the question of where the COUNT aggregate went is probably important to answer before merging.

Also, I think we can make the plans even better (maybe as a follow on PR) by avoiding even more redundant Projections

alamb · 2023-02-26T12:06:35Z

datafusion/core/tests/sql/window.rs

@@ -1515,13 +1515,14 @@ async fn test_remove_unnecessary_sort_in_sub_query() -> Result<()> {
            "    CoalescePartitionsExec",
            "      AggregateExec: mode=Partial, gby=[], aggr=[COUNT(UInt8(1))]",
            "        RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=8",
-            "          AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], aggr=[COUNT(UInt8(1))]",
+            "          AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], aggr=[]",


I don't understand this change -- why is there no aggregate anymore?

Has resolved it , thanks @alamb

alamb · 2023-02-26T12:14:38Z

datafusion/optimizer/src/push_down_projection.rs

-        assert_optimized_plan_eq(&plan, expected);
-
-        Ok(())
+        \n      Projection: test.c, test.a, test.b\


These projections above a Filter that just pass the input to the output are also unecessary, right? Maybe we can add a rule to the remove unnecessary projections for these as well (if the schema of the projection's input is the same as the schema of its output)

Thank you @alamb . Agree with it.
But In this example, the schema of the projection != the schema of its child, because the order already changed.
In the future, we indeed can remove these projection (if it just change order, and it isn't in the top of plan tree, which means that there must be a plannode above this projection that can determine the output schema like agg, other projection ....) We can enhance EliminateProject Rule

because the order already changed.
In the future, we indeed can remove these projection (if it just change order, and it isn't in the top of plan tree, which means that there must be a plannode above this projection that can

That is an excellent point

We can enhance EliminateProject Rule

Good idea 👍

alamb

Let's do it. Thanks @jackwener

ursabot · 2023-02-28T21:53:51Z

Benchmark runs are scheduled for baseline = 25b4f67 and contender = 0000d4f. 0000d4f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules labels Dec 1, 2022

jackwener force-pushed the push_down_projection branch from d8e5d92 to 2965ce9 Compare December 9, 2022 12:19

jackwener changed the title ~~Draft: reimplement Prune column~~ reimplement push_down_projection and prune_column. Dec 9, 2022

jackwener force-pushed the push_down_projection branch from 2965ce9 to a4d49cb Compare December 9, 2022 12:26

github-actions bot removed the logical-expr Logical plan and expressions label Dec 9, 2022

jackwener force-pushed the push_down_projection branch from a4d49cb to b78231e Compare December 9, 2022 12:30

jackwener marked this pull request as ready for review December 9, 2022 12:30

jackwener marked this pull request as draft December 9, 2022 12:49

jackwener force-pushed the push_down_projection branch from e1ca364 to 1ff6168 Compare December 9, 2022 14:23

jackwener commented Dec 9, 2022

View reviewed changes

benchmarks/expected-plans/q19.txt Outdated Show resolved Hide resolved

jackwener marked this pull request as ready for review December 9, 2022 14:26

jackwener commented Dec 9, 2022

View reviewed changes

alamb marked this pull request as draft December 12, 2022 12:08

alamb mentioned this pull request Jan 1, 2023

Stack overflows when planning tpcds 22 in debug mode #4786

Closed

jackwener mentioned this pull request Jan 5, 2023

bugfix: remove cnf_rewrite in push_down_filter #4825

Merged

jackwener force-pushed the push_down_projection branch from 1ff6168 to 54b2be0 Compare January 6, 2023 14:16

jackwener marked this pull request as ready for review January 6, 2023 14:16

jackwener marked this pull request as draft January 7, 2023 16:09

jackwener force-pushed the push_down_projection branch from 54b2be0 to 8c57270 Compare February 13, 2023 03:40

github-actions bot added the substrait label Feb 13, 2023

jackwener force-pushed the push_down_projection branch 2 times, most recently from fb9cccb to f2cd1d8 Compare February 13, 2023 04:20

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 13, 2023

jackwener force-pushed the push_down_projection branch from f2cd1d8 to 66ad3a9 Compare February 13, 2023 05:05

jackwener commented Feb 13, 2023

View reviewed changes

datafusion/optimizer/tests/integration-test.rs Outdated Show resolved Hide resolved

jackwener commented Feb 13, 2023

View reviewed changes

datafusion/optimizer/tests/integration-test.rs Outdated Show resolved Hide resolved

jackwener force-pushed the push_down_projection branch 2 times, most recently from 89b3ef8 to f172e60 Compare February 13, 2023 06:32

liukun4515 reviewed Feb 13, 2023

View reviewed changes

datafusion/optimizer/src/optimizer.rs Outdated Show resolved Hide resolved

jackwener force-pushed the push_down_projection branch 2 times, most recently from ef86e1e to ffdad2b Compare February 13, 2023 07:53

alamb reviewed Feb 16, 2023

View reviewed changes

alamb reviewed Feb 17, 2023

View reviewed changes

jackwener force-pushed the push_down_projection branch from ffdad2b to 883c951 Compare February 19, 2023 06:05

jackwener mentioned this pull request Feb 19, 2023

refactor push_down_filter to fix dead-loop and use optimizer_recurse. #5337

Merged

jackwener marked this pull request as draft February 23, 2023 12:53

jackwener force-pushed the push_down_projection branch from 2d6ada4 to 66b0557 Compare February 25, 2023 21:55

github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Feb 25, 2023

jackwener marked this pull request as ready for review February 26, 2023 06:17

jackwener commented Feb 26, 2023

View reviewed changes

jackwener force-pushed the push_down_projection branch 2 times, most recently from 4ce085c to c0aec60 Compare February 26, 2023 07:47

alamb approved these changes Feb 26, 2023

View reviewed changes

jackwener added 2 commits February 26, 2023 21:37

reimplement push_down_projection and prune_column

c3319da

fix COUNT(UInt8(1))

636e5fe

jackwener force-pushed the push_down_projection branch from c0aec60 to 636e5fe Compare February 26, 2023 13:37

alamb approved these changes Feb 27, 2023

View reviewed changes

alamb merged commit 0000d4f into apache:main Feb 28, 2023

jackwener deleted the push_down_projection branch March 1, 2023 05:50

rgwood mentioned this pull request May 4, 2023

push_down_projection optimization fails when using variables #6237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reimplement `push_down_projection` and `prune_column`. #4465

reimplement `push_down_projection` and `prune_column`. #4465

jackwener commented Dec 1, 2022 •

edited

Loading

jackwener Dec 9, 2022 •

edited

Loading

alamb Feb 17, 2023

jackwener Feb 19, 2023

ygf11 Feb 19, 2023 •

edited

Loading

jackwener Feb 19, 2023 •

edited

Loading

alamb commented Dec 12, 2022

alamb commented Jan 6, 2023

liukun4515 commented Feb 14, 2023

alamb left a comment

alamb left a comment

alamb Feb 17, 2023

alamb commented Feb 23, 2023

jackwener commented Feb 23, 2023 •

edited

Loading

alamb commented Feb 23, 2023

jackwener commented Feb 26, 2023 •

edited

Loading

jackwener Feb 26, 2023

alamb left a comment

alamb Feb 26, 2023

jackwener Feb 26, 2023

alamb Feb 26, 2023

jackwener Feb 26, 2023 •

edited

Loading

alamb Feb 27, 2023

alamb left a comment

ursabot commented Feb 28, 2023

		TableScan: supplier projection=[s_suppkey, s_nationkey]
		Projection: nation.n_nationkey

reimplement push_down_projection and prune_column. #4465

reimplement push_down_projection and prune_column. #4465

Conversation

jackwener commented Dec 1, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jackwener Dec 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ygf11 Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

jackwener Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

alamb commented Dec 12, 2022

alamb commented Jan 6, 2023

liukun4515 commented Feb 14, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Feb 23, 2023

jackwener commented Feb 23, 2023 • edited Loading

alamb commented Feb 23, 2023

jackwener commented Feb 26, 2023 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackwener Feb 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

ursabot commented Feb 28, 2023

reimplement `push_down_projection` and `prune_column`. #4465

reimplement `push_down_projection` and `prune_column`. #4465

jackwener commented Dec 1, 2022 •

edited

Loading

jackwener Dec 9, 2022 •

edited

Loading

ygf11 Feb 19, 2023 •

edited

Loading

jackwener Feb 19, 2023 •

edited

Loading

jackwener commented Feb 23, 2023 •

edited

Loading

jackwener commented Feb 26, 2023 •

edited

Loading

jackwener Feb 26, 2023 •

edited

Loading