Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop copying LogicalPlan and Exprs in OptimizeProjections (2% faster planning) #10405

Merged
merged 2 commits into from
May 10, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 7, 2024

Note this also has the changes from #10410 in it

Which issue does this PR close?

Closes #10209

Rationale for this change

Make planning faster by not copying as much

What changes are included in this PR?

  1. Rewrite OptimizeProjections to use treenode APIs

Are these changes tested?

Existing tests

Are there any user-facing changes?

  1. More types of projections can be combined (e.g. CASE expressions)
  2. Faster planning

Benchmark Results show a moderate improvement (2% overall in tpch, but some queries like Q3 are like 10% faster)

Details

++ critcmp main projection_pushdown
group                                         main                                   projection_pushdown
-----                                         ----                                   -------------------
logical_aggregate_with_join                   1.01  1217.9±11.93µs        ? ?/sec    1.00  1208.1±12.12µs        ? ?/sec
logical_plan_tpcds_all                        1.00    160.7±1.72ms        ? ?/sec    1.00    160.6±1.78ms        ? ?/sec
logical_plan_tpch_all                         1.02     17.1±0.20ms        ? ?/sec    1.00     16.8±0.21ms        ? ?/sec
logical_select_all_from_1000                  1.00     18.7±0.14ms        ? ?/sec    1.01     18.9±0.11ms        ? ?/sec
logical_select_one_from_700                   1.01    820.9±9.73µs        ? ?/sec    1.00   816.1±20.75µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00   765.5±18.71µs        ? ?/sec    1.00   764.8±64.26µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.01    749.8±8.67µs        ? ?/sec    1.00    742.3±7.49µs        ? ?/sec
physical_plan_tpcds_all                       1.03  1364.1±11.01ms        ? ?/sec    1.00  1330.1±16.93ms        ? ?/sec
physical_plan_tpch_all                        1.03     94.8±1.43ms        ? ?/sec    1.00     92.1±1.28ms        ? ?/sec
physical_plan_tpch_q1                         1.06      5.2±0.05ms        ? ?/sec    1.00      4.9±0.07ms        ? ?/sec
physical_plan_tpch_q10                        1.03      4.4±0.06ms        ? ?/sec    1.00      4.3±0.08ms        ? ?/sec
physical_plan_tpch_q11                        1.02      3.9±0.06ms        ? ?/sec    1.00      3.8±0.08ms        ? ?/sec
physical_plan_tpch_q12                        1.00      3.0±0.04ms        ? ?/sec    1.02      3.1±0.06ms        ? ?/sec
physical_plan_tpch_q13                        1.03      2.1±0.04ms        ? ?/sec    1.00      2.1±0.03ms        ? ?/sec
physical_plan_tpch_q14                        1.08      2.9±0.07ms        ? ?/sec    1.00      2.7±0.04ms        ? ?/sec
physical_plan_tpch_q16                        1.06      3.8±0.06ms        ? ?/sec    1.00      3.6±0.05ms        ? ?/sec
physical_plan_tpch_q17                        1.04      3.6±0.07ms        ? ?/sec    1.00      3.5±0.07ms        ? ?/sec
physical_plan_tpch_q18                        1.07      4.1±0.05ms        ? ?/sec    1.00      3.9±0.05ms        ? ?/sec
physical_plan_tpch_q19                        1.10      6.5±0.07ms        ? ?/sec    1.00      5.9±0.07ms        ? ?/sec
physical_plan_tpch_q2                         1.04      8.0±0.04ms        ? ?/sec    1.00      7.7±0.09ms        ? ?/sec
physical_plan_tpch_q20                        1.04      4.8±0.06ms        ? ?/sec    1.00      4.6±0.06ms        ? ?/sec
physical_plan_tpch_q21                        1.02      6.4±0.08ms        ? ?/sec    1.00      6.2±0.06ms        ? ?/sec
physical_plan_tpch_q22                        1.00      3.4±0.06ms        ? ?/sec    1.00      3.4±0.08ms        ? ?/sec
physical_plan_tpch_q3                         1.11      3.4±0.07ms        ? ?/sec    1.00      3.1±0.06ms        ? ?/sec
physical_plan_tpch_q4                         1.12      2.5±0.04ms        ? ?/sec    1.00      2.2±0.02ms        ? ?/sec
physical_plan_tpch_q5                         1.09      4.8±0.05ms        ? ?/sec    1.00      4.4±0.06ms        ? ?/sec
physical_plan_tpch_q6                         1.05  1615.2±30.14µs        ? ?/sec    1.00  1543.2±24.24µs        ? ?/sec
physical_plan_tpch_q7                         1.03      5.9±0.06ms        ? ?/sec    1.00      5.7±0.09ms        ? ?/sec
physical_plan_tpch_q8                         1.02      7.5±0.15ms        ? ?/sec    1.00      7.4±0.08ms        ? ?/sec
physical_plan_tpch_q9                         1.01      5.6±0.07ms        ? ?/sec    1.00      5.5±0.11ms        ? ?/sec
physical_select_all_from_1000                 1.01     61.5±0.34ms        ? ?/sec    1.00     61.2±0.52ms        ? ?/sec
physical_select_one_from_700                  1.02      3.7±0.05ms        ? ?/sec    1.00      3.6±0.04ms        ? ?/sec

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels May 7, 2024
@alamb alamb force-pushed the alamb/projection_pushdown branch from 3e8264c to 6bbc24d Compare May 7, 2024 11:20
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label May 7, 2024
// Test outer projection isn't discarded despite the same schema as inner
// https://github.com/apache/datafusion/issues/8942
#[test]
fn test_derived_column() -> Result<()> {
let table_scan = test_table_scan()?;
let plan = LogicalPlanBuilder::from(table_scan)
.project(vec![col("a"), lit(0).alias("d")])?
.project(vec![col("a").add(lit(1)).alias("a"), lit(0).alias("d")])?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change the projection is actually merged

This is due to using the general purpose transform_up rather than special casing certain Expr types in the projection rewrite and now CASE is handled

In order to prevent merging the projections, I needed to make the lower projection non trivial so changed it from a to a + 1 as a

Note that the tests for correctness added for #8942 in
#8960 such as consecutive_projection_same_schema are also still passing

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels May 7, 2024
02)--SubqueryAlias: NUMBERS
03)----Projection: Int64(1) AS a, Int64(2) AS b, Int64(3) AS c
04)------EmptyRelation
01)SubqueryAlias: NUMBERS
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these plans are better -- they have one less unecessary Projection (I believe the changes are due to better tracking of child rewrites during optimization)

@alamb alamb force-pushed the alamb/projection_pushdown branch from 4a7aa33 to f80b800 Compare May 7, 2024 14:22
pub fn unalias(self) -> Expr {
match self {
Expr::Alias(alias) => *alias.expr,
_ => self,
}
}

/// Recursively potentially multiple aliases from an expression.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trim_expr renamed and with examples

) -> Result<Transformed<LogicalPlan>> {
// Recursively rewrite any nodes that may be able to avoid computation given
// their parents' required indices.
match plan {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of handling plan nodes changed so the ones that rewrite the plan directly do so in the first match statement and those that compute required indices do so in a second match

fn optimize_projections(
plan: &LogicalPlan,
plan: LogicalPlan,
Copy link
Contributor Author

@alamb alamb May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fact this function no takes plan rather than plan is the key change that permits this PR to avoid copying as much

@alamb alamb force-pushed the alamb/projection_pushdown branch 2 times, most recently from ac5c77e to 43d8f0e Compare May 7, 2024 15:18
@alamb alamb marked this pull request as ready for review May 7, 2024 17:01
@alamb alamb marked this pull request as draft May 7, 2024 17:01
@alamb alamb force-pushed the alamb/projection_pushdown branch from 43d8f0e to b15d1f4 Compare May 7, 2024 17:03
@alamb alamb changed the title Stop copying LogicalPlan and Exprs in OptimizeProjections Stop copying LogicalPlan and Exprs in OptimizeProjections (2% faster planning) May 7, 2024
@alamb alamb marked this pull request as ready for review May 7, 2024 17:25
06)------Projection: nodes.id + Int64(1) AS id
07)--------Filter: nodes.id < Int64(10)
08)----------TableScan: nodes
01)SubqueryAlias: nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in fact it removes another projection layer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is correct

}
LogicalPlan::Window(window) => {
let input_schema = window.input.schema();
let input_schema = window.input.schema().clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this clone?

let input_schema = inputs[0].schema();
// If inputs are not pruned do not change schema
// TODO this seems wrong (shouldn't we always use the schema of the input?)
let schema = if schema.fields().len() == input_schema.fields().len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that shouldn't happen.... we can follow up to run tests with only 1 schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it doesn't make sense -- however, it is the same logic as in with_new_exprs:

LogicalPlan::Union(Union { schema, .. }) => {
let input_schema = inputs[0].schema();
// If inputs are not pruned do not change schema.
let schema = if schema.fields().len() == input_schema.fields().len() {
schema.clone()
} else {
input_schema.clone()
};
Ok(LogicalPlan::Union(Union {
inputs: inputs.into_iter().map(Arc::new).collect(),
schema,
}))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #10442

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb for making tha planner better

@comphead comphead merged commit cf0cba7 into apache:main May 10, 2024
24 checks passed
@alamb alamb deleted the alamb/projection_pushdown branch May 10, 2024 15:10
@alamb
Copy link
Contributor Author

alamb commented May 10, 2024

Thanks for the review @comphead

fyi @mustafasrepo

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…r planning) (apache#10405)

* Add `LogicalPlan::recompute_schema` for handling rewrite passes

* Stop copying LogicalPlan and Exprs in  `OptimizeProjections`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop copying LogicalPlan and Exprs in OptimizeProjections
2 participants