Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEMO: Introduce TreeNodeMutator for rewriting TreeNodes in place, change optimizer to rewrite LogicalPlan in place - 10% faster planning time #9780

Closed
wants to merge 2 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Mar 24, 2024

Note this PR looks large, but it is mostly comments and changes of &plan to plan in tests

Which issue does this PR close?

Part of #9637 (based on some ideas from #9708). 🙏 @jayzhan211

closes #9768

Rationale for this change

See #9637

TLDR is that the optimizer currently copies LogicalPlans many times unnecessarily (I think it is quadratic if not worse in the number of tree nodes). This is both slow and requires many memory allocations.

Note we can do even better, but I want to keep the PR sizes as small as possible.

What changes are included in this PR?

  1. Add in place mutate / Mutator APIs to TreeNode
  2. Implement mutate for LogicalPlan (aka &mut LogicalPlan)
  3. Rewrite Optimizer to use TreeNode API

I am trying to keep this PR relatively small, and I have other improvements in mind for follow on PRs (see below). This PR is now finally feasible thanks to the really nice TreeNode cleanup from @peter-toth and @berkaysynnada in #8891

Are these changes tested?

Functionally covered by existing tests, and I measured performance using the benchmarks -- I see an mprovement of about 10% across TPCH queries.

Details

++ critcmp main optimizer_tree_node
++ critcmp main optimizer_tree_node
group                                         main                                    optimizer_tree_node
-----                                         ----                                    -------------------
logical_aggregate_with_join                   1.01  1287.8±105.61µs        ? ?/sec    1.00   1274.7±7.85µs        ? ?/sec
logical_plan_tpch_all                         1.00     17.3±0.16ms        ? ?/sec     1.00     17.3±0.14ms        ? ?/sec
logical_select_all_from_1000                  1.00     93.4±1.40ms        ? ?/sec     1.02     95.3±0.34ms        ? ?/sec
logical_select_one_from_700                   1.00    740.6±8.75µs        ? ?/sec     1.00   743.8±17.11µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00   789.6±22.14µs        ? ?/sec     1.00   792.5±19.06µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00   755.3±13.71µs        ? ?/sec     1.00   759.0±18.47µs        ? ?/sec
physical_plan_tpch_all                        1.11    133.8±0.87ms        ? ?/sec     1.00    120.7±0.70ms        ? ?/sec
physical_plan_tpch_q1                         1.15      7.7±0.15ms        ? ?/sec     1.00      6.7±0.05ms        ? ?/sec
physical_plan_tpch_q10                        1.10      6.3±0.05ms        ? ?/sec     1.00      5.7±0.04ms        ? ?/sec
physical_plan_tpch_q11                        1.09      5.0±0.03ms        ? ?/sec     1.00      4.6±0.03ms        ? ?/sec
physical_plan_tpch_q12                        1.08      4.1±0.03ms        ? ?/sec     1.00      3.8±0.03ms        ? ?/sec
physical_plan_tpch_q13                        1.09      2.7±0.03ms        ? ?/sec     1.00      2.5±0.02ms        ? ?/sec
physical_plan_tpch_q14                        1.05      3.4±0.02ms        ? ?/sec     1.00      3.3±0.02ms        ? ?/sec
physical_plan_tpch_q16                        1.12      5.3±0.04ms        ? ?/sec     1.00      4.7±0.04ms        ? ?/sec
physical_plan_tpch_q17                        1.11      4.9±0.04ms        ? ?/sec     1.00      4.4±0.03ms        ? ?/sec
physical_plan_tpch_q18                        1.11      5.4±0.04ms        ? ?/sec     1.00      4.9±0.03ms        ? ?/sec
physical_plan_tpch_q19                        1.06     10.1±0.09ms        ? ?/sec     1.00      9.5±0.07ms        ? ?/sec
physical_plan_tpch_q2                         1.14     12.2±0.09ms        ? ?/sec     1.00     10.7±0.10ms        ? ?/sec
physical_plan_tpch_q20                        1.14      6.4±0.05ms        ? ?/sec     1.00      5.6±0.04ms        ? ?/sec
physical_plan_tpch_q21                        1.11      9.5±0.08ms        ? ?/sec     1.00      8.6±0.18ms        ? ?/sec
physical_plan_tpch_q22                        1.10      4.7±0.04ms        ? ?/sec     1.00      4.2±0.03ms        ? ?/sec
physical_plan_tpch_q3                         1.08      4.2±0.04ms        ? ?/sec     1.00      3.9±0.03ms        ? ?/sec
physical_plan_tpch_q4                         1.09      3.3±0.02ms        ? ?/sec     1.00      3.0±0.03ms        ? ?/sec
physical_plan_tpch_q5                         1.08      6.1±0.04ms        ? ?/sec     1.00      5.6±0.04ms        ? ?/sec
physical_plan_tpch_q6                         1.05      2.0±0.02ms        ? ?/sec     1.00  1944.8±12.65µs        ? ?/sec
physical_plan_tpch_q7                         1.11      8.6±0.09ms        ? ?/sec     1.00      7.8±0.06ms        ? ?/sec
physical_plan_tpch_q8                         1.12     12.1±0.10ms        ? ?/sec     1.00     10.8±0.09ms        ? ?/sec
physical_plan_tpch_q9                         1.13      9.2±0.07ms        ? ?/sec     1.00      8.1±0.11ms        ? ?/sec
physical_select_all_from_1000                 1.13    684.2±1.45ms        ? ?/sec     1.00    606.5±1.42ms        ? ?/sec
physical_select_one_from_700                  1.00      4.2±0.03ms        ? ?/sec     1.01      4.2±0.03ms        ? ?/sec

Are there any user-facing changes?

  1. Faster planning
  2. Ability to rewrite LogicalPlan in place using TreeNode API

Planned Follow Ons:

Here are follow on tickets that will improve things even more

  • Add API to OptimizerRule to rewrite in place (and avoid yet another copy)
  • Add an example of how to use the "in place tree write" API
  • Supporting rewriting Exprs in place in addition to LogicalPlan (implement TreeNodeMutator -- specifically ExprRewriter should be a big win)
  • API to support in place rewrites of UserDefinedLogicalPlan
  • Change other uses of TreeNodeRewriter to TreeNodeMutator (e.g. ExprRewriter, etc)

@alamb alamb marked this pull request as draft March 24, 2024 13:50
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Mar 24, 2024
@github-actions github-actions bot added logical-expr Logical plan and expressions and removed logical-expr Logical plan and expressions labels Mar 25, 2024
@alamb alamb force-pushed the alamb/optimizer_tree_node branch from da634a2 to 233d8e9 Compare March 29, 2024 01:58
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 29, 2024
@alamb
Copy link
Contributor Author

alamb commented Mar 29, 2024

Update here is that I think this PR now has the basic code to do in-place updates of TreeNodes in general and LogicalPlans in particular 🎉

TODOs:

  • Run performance benchmarks to show this helps
  • Complete edge case implementations (DDL, user defined nodes), polish PR description

Follow ons:
(moved up)

@alamb alamb changed the title WIP: Rewrite optimizer to use TreeNode API, implement in-place rewrites for LogicalPlan Introduce TreeNodeMutator for rewriting TreeNodes in place, change optimizer to rewrite LogicalPlan in place Mar 29, 2024
------Inner Join: Filter: t1.t1_id = t2.t2_id AND t1.t1_name = outer_ref(t0.t0_name)
--------TableScan: t1
--------TableScan: t2
LeftSemi Join: t0.t0_name = __correlated_sq_2.t1_name
Copy link
Contributor Author

@alamb alamb Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually seems like an improvement to me -- as the subquery expressions are now properly visited by the rewriter

@alamb
Copy link
Contributor Author

alamb commented Mar 30, 2024

Updated description with performance results (about a 10% improvement for this PR alone). I think we can squeeze a bunch more out of this approach with the additional changes to rewrite in place (will file follow on tickets).

Proceeding to get this PR ready for review

@@ -88,7 +88,7 @@ mod tests {

use crate::test::*;

fn assert_optimized_plan_equal(plan: &LogicalPlan, expected: &str) -> Result<()> {
fn assert_optimized_plan_equal(plan: LogicalPlan, expected: &str) -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large number of test changes below are due to taking an owned LogicalPlan (to avoid having to copy it)

@github-actions github-actions bot added the core Core DataFusion crate label Mar 31, 2024

let formatted_plan = format!("{optimized_plan:?}");
assert_eq!(formatted_plan, expected);
assert_eq!(plan.schema(), optimized_plan.schema());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test calls Optimizer::optimze now, which already checks the schemas are the same after each rule, so there is no need to check that again.

@@ -651,8 +651,8 @@ explain SELECT t1_id, t1_name FROM t1 WHERE t1_id in (SELECT t2_id FROM t2 where
logical_plan
Filter: t1.t1_id IN (<subquery>)
--Subquery:
----Limit: skip=0, fetch=10
------Projection: t2.t2_id
----Projection: t2.t2_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit is now pushed under the subquery

@alamb alamb changed the title Introduce TreeNodeMutator for rewriting TreeNodes in place, change optimizer to rewrite LogicalPlan in place Introduce TreeNodeMutator for rewriting TreeNodes in place, change optimizer to rewrite LogicalPlan in place - 15% faster planning time Apr 1, 2024
@alamb alamb changed the title Introduce TreeNodeMutator for rewriting TreeNodes in place, change optimizer to rewrite LogicalPlan in place - 15% faster planning time Introduce TreeNodeMutator for rewriting TreeNodes in place, change optimizer to rewrite LogicalPlan in place - 10% faster planning time Apr 1, 2024
@alamb alamb added the api change Changes the API exposed to users of the crate label Apr 1, 2024
@@ -59,7 +59,7 @@ pub fn main() -> Result<()> {

// then run the optimizer with our custom rule
let optimizer = Optimizer::with_rules(vec![Arc::new(MyOptimizerRule {})]);
let optimized_plan = optimizer.optimize(&analyzed_plan, &config, observe)?;
let optimized_plan = optimizer.optimize(analyzed_plan, &config, observe)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only API change exposed to users -- Optimizer::optimize takes an owned LogicalPlan rather than a &LogicalPlan. This is to avoid the need to copy over and over again

@@ -174,6 +182,70 @@ pub trait TreeNode: Sized {
})
}

/// Implements the [visitor pattern](https://en.wikipedia.org/wiki/Visitor_pattern) for
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the new TreeNode API to rewrite the nodes in place

I also tried Transformed<&mut Node>, but this got caught up in the borrow checker. Returning Transformed<()> seemed to work and made using this API feasible

@peter-toth and @berkaysynnada I would love any feedback on this API that you may have

fn apply_children<F: FnMut(&Self) -> Result<TreeNodeRecursion>>(
&self,
f: &mut F,
) -> Result<TreeNodeRecursion>;

/// Apply transform `F` to the node's children. Note that the transform `F`
/// might have a direction (pre-order or post-order).
/// Rewrite the node's children in place using `F`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to leave mutate_children unimplemented for Expr and the other TreeNodes initially to reduce the size of this PR (and thus the review burden). I will file a follow on ticket to track doing so

///
/// * If `f` returns Ok, sets `self.transformed` to `true` if either self or
/// the result of `f` were transformed.
pub fn and_then<F>(self, f: F) -> Result<Transformed<()>>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing this function allowed the implementation of mutate and mutate children to be more concise. I couldn't figure out how to implement it in general for two Transformed as there are two different payloads

use std::sync::{Arc, OnceLock};

impl LogicalPlan {
/// applies `f` to each expression of this node, potentially rewriting it in
Copy link
Contributor Author

@alamb alamb Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains the implementation to allow rewriting the LogicalPlan in place. rewrite_exprs is needed initially to rewrite subquery references. I also expect we will be able to use this to implement more performant optimizer passes in the future (e.g. ExprSimplifier)

.collect();

let exprs = plan.expressions();
plan.with_new_exprs(exprs, new_inputs).map(Some)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this used to copy all expressions and the entire LogicalPlan at least once

let optimize_self_opt = self.optimize_node(rule, plan, config)?;
let optimize_inputs_opt = match &optimize_self_opt {
Some(optimized_plan) => {
self.optimize_inputs(rule, optimized_plan, config)?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each of these calls also copies the plan

assert_eq!(
"Optimizer rule 'get table_scan rule' failed\ncaused by\nget table_scan rule\ncaused by\n\
Internal error: Failed due to a difference in schemas, \
original schema: DFSchema { fields: [\
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test actually shows the output was reversed -- the original schema was [] not the table schema (which is what is changed by the optimizer)

&OptimizerContext::new(),
)?
.unwrap_or_else(|| plan.clone());
let optimized_plan =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now uses the Optimizer directly rather than calling optimize_recursively (which the Optimizer does internally)

fn observe(_plan: &LogicalPlan, _rule: &dyn OptimizerRule) {}

// in tests we are applying only one rule once
let opt_context = OptimizerContext::new().with_max_passes(1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it turns out if you apply this rule multiple times, it changes the output. I explicitly made the test only run it a single time to match the previous behavior

Use TreeNode API in Optimizer
@alamb alamb force-pushed the alamb/optimizer_tree_node branch from 1011559 to 1bedfec Compare April 1, 2024 13:23
@alamb alamb marked this pull request as ready for review April 1, 2024 13:25
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

.clone();

// take the old value out of the Arc
std::mem::swap(node, &mut new_node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use std::mem::take here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no default impl for Arc<LogicalPlan> so I am not sure how to use the take API here 🤔

Copy link
Contributor

@jayzhan211 jayzhan211 Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have a default implementation with empty relation and empty schema, like the placeholder here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be nice -- maybe like adding a Default impl for LogicalPlan (that is empty relation?)

/// Applies `f` to rewrite the existing node, while avoiding `clone`'ing as much
/// as possiblw.
///
/// TODO eventually remove `Arc<LogicalPlan>` from `LogicalPlan` and have it own
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we can have LogicalPlan eventually? Not even with Box<LogicalPlan>? I think we may need Box for recursive plan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry I meant use Box eventually

// we make a temporary Expr to rewrite and then put it back

let mut swap_column = Column::new_unqualified("TEMP_unnest_column");
std::mem::swap(column, &mut swap_column);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have mem::take here too with a default implementation for Column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using take and it seems like there is no default impl for Column 🤔

error[E0277]: the trait bound `datafusion_common::Column: std::default::Default` is not satisfied
   --> datafusion/expr/src/logical_plan/mutate.rs:288:41
    |
288 |     let mut swap_column= std::mem::take(column);
    |                          -------------- ^^^^^^ the trait `std::default::Default` is not implemented for `datafusion_common::Column`
    |                          |
    |                          required by a bound introduced by this call

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I would prefer "" as default column if it makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- the reason I left a TEMP... name here was so that on error it was obvious that the Column was partially rewritten.

Given the comments below, however, it seems like we may not even need this API

// rewrite children recursively with mutator
self.mutate_children(|c| c.mutate(mutator))?
.try_transform_node_with(
|_: ()| mutator.f_up(self),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems |_| is also fine. what is () for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The () is the type of the Node in Transform<()> which is somewhat confusing

@alamb alamb self-assigned this Apr 1, 2024
@peter-toth
Copy link
Contributor

I wonder if it would make sense to split this PR into 2 parts as it seem to mix 2 very different ideas.

  • It introduces an in place mutation TreeNode API. First of all, I really like that API and I too think it can be faster than the current consume and procude style ones (TreeNode::rewrite(), TreeNode::transform_down_up()). But I think the only reason why it is faster is because only smaller references (e.g. &mut LogicalPlans) are stored on the stack during a transformation and not the original structures themselves (LogicalPlans). Otherwise the 2 APIs are very similar, there should be nothing that you can do with a &mut LogicalPlans but you can't do with an owned LogicalPlan.
  • It uses a trick in rewrite_arc to avoid copying Arcs. But this trick shouldn't require introducing the new API, it should simply work with changing the existing LogicalPlan::map_children().

So I don't think we need a new API for #9768. Don't get me wrong, I favour the new API but I would rather move that to a completely separate PR and slowly transform the code to using that in the future.

As for the new LogicalPlan methods (rewrite_inputs(), rewrite_subqueries(), rewrite_exprs(), ...) in this PR, the problem is that they use the new Transformed::and_then() method to iterate over sibling nodes in the tree (e.g. https://github.com/apache/arrow-datafusion/pull/9780/files#diff-b2431d53b487c80c797aecd88859229fb902f74bf888a598878386055b717bbcR188-R189 or https://github.com/apache/arrow-datafusion/pull/9780/files#diff-b2431d53b487c80c797aecd88859229fb902f74bf888a598878386055b717bbcR229-R230). But that's not correct as the new and_then() doesn't respect TreeNodeRecursion::Stop behaviour. We already have TransformedIterator::map_until_stop_and_collect() and TreeNode::try_transform_node() to chain transformations in such cases (see examples in Expr::map_children()). BTW, I'm trying to simplyfy those try_transform_node chained calls in #9876.

@alamb
Copy link
Contributor Author

alamb commented Apr 2, 2024

Thank you for the feedback @peter-toth

I wonder if it would make sense to split this PR into 2 parts as it seem to mix 2 very different ideas.

I can definitely split it into two PRs. My rationale for one PR was that a PR for TreeNode::mutate API would be too abstract.

  • It introduces an in place mutation TreeNode API. First of all, I really like that API and I too think it can be faster than the current consume and procude style ones (TreeNode::rewrite(), TreeNode::transform_down_up()). But I think the only reason why it is faster is because only smaller references (e.g. &mut LogicalPlans) are stored on the stack during a transformation and not the original structures themselves (LogicalPlans).

I agree

Otherwise the 2 APIs are very similar, there should be nothing that you can do with a &mut LogicalPlans but you can't do with an owned LogicalPlan.

The only difference I see in the APIs is that using &mut LogicalPlan leaves a partially rewritten node on error, whereas an owned LogicalPlan consumes the node on error.

  • It uses a trick in rewrite_arc to avoid copying Arcs. But this trick shouldn't require introducing the new API, it should simply work with changing the existing LogicalPlan::map_children().

Changing map_children to rewrite the Arc'd inputs is a very interesting idea -- I think we could do that in a separate PR. I'll give it a try

So I don't think we need a new API for #9768. Don't get me wrong, I favour the new API but I would rather move that to a completely separate PR and slowly transform the code to using that in the future.

As for the new LogicalPlan methods (rewrite_inputs(), rewrite_subqueries(), rewrite_exprs(), ...) in this PR, the problem is that they use the new Transformed::and_then() method to iterate over sibling nodes in the tree (e.g. https://github.com/apache/arrow-datafusion/pull/9780/files#diff-b2431d53b487c80c797aecd88859229fb902f74bf888a598878386055b717bbcR188-R189 or https://github.com/apache/arrow-datafusion/pull/9780/files#diff-b2431d53b487c80c797aecd88859229fb902f74bf888a598878386055b717bbcR229-R230). But that's not correct as the new and_then() doesn't respect TreeNodeRecursion::Stop behaviour. We already have TransformedIterator::map_until_stop_and_collect() and TreeNode::try_transform_node() to chain transformations in such cases (see examples in Expr::map_children()). BTW, I'm trying to simplyfy those try_transform_node chained calls in #9876.

This is also an excellent call -- I actually tried initially to use TransformedIterator::map_until_stop_and_collect() but I couldn't make it work with the generics (it needed a function that took (Self) as I recall.

So how about this for a plan:

  1. I'll make a PR that changes LogicalPlan::map_children to avoid copies if possible (which should be beneficial in many places in DataFusion)
  2. I'll create a PR that changes the Optimizer to use a TreeNodeRewriter (rather than TreeNodeMutator) which will avoid even more copies after 1 is complete.

We can then compare the performance of the TreeNodeRewriter to this PR with TreeNodeMutator and decide and see if the additional TreeNodeMutator API is worth the extra API surface / migration.

@peter-toth
Copy link
Contributor

So how about this for a plan:

  1. I'll make a PR that changes LogicalPlan::map_children to avoid copies if possible (which should be beneficial in many places in DataFusion)
  2. I'll create a PR that changes the Optimizer to use a TreeNodeRewriter (rather than TreeNodeMutator) which will avoid even more copies after 1 is complete.

We can then compare the performance of the TreeNodeRewriter to this PR with TreeNodeMutator and decide and see if the additional TreeNodeMutator API is worth the extra API surface / migration.

Sounds very good to me, I'm really curious how much improvement the changes can bring.

/// applies `f` to LogicalPlans in any subquery expressions
///
/// If Err is returned, the plan may be left in a partially modified state
fn rewrite_subqueries<F>(&mut self, mut f: F) -> Result<Transformed<()>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about this part. I mean the TreeNode APIs are currently inconsistent about handling "inner" trees. I'm refferring to LogicalPlan for example. It overrides TreeNode::apply() and TreeNode::visit() to apply f on subquery's logical plans too.

  • But first of all, none of the other trees do that,
  • secodly, LogicalPlan doesn't override rewrite() or transform...() so transformations are applied only the "current" tree and not on "whole" logical plan.

I'm not sure that the current state is right or wrong as I can imagine both usescases in which

  • a transformations needs to be applied on the current tree only
  • a transformations need to change the whole tree (subplans trees included)

Now, this code seems to apply f on subquery logical plans too. If we keep it that way mutate() will be inconsistent with other APIs. But I'm also not sure that why it applies f on the root node only? Shouldn't it call subquery.subquery.mutate(f) to start a traversal on the subquery plan tree?

Is it needed because of the Optimizer? Does it optimize the whole logical plan at once? If so, I think we should introduce new APIs on LogicalPlan that is different to the existing TreeNode APIs. In Spark for example we have TreeNode.transform() and QueryPlan.transformWithSubqueries() for the different purposes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I open a PR that adds LogicalPlan::rewriteWithSubqueries() and LogicalPlan::transform...WithSubqueries()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, this code seems to apply f on subquery logical plans too. If we keep it that way mutate() will be inconsistent with other APIs

I actually found I needed to handle recursing into subqueries to get some optimizer tests tests passing. Interestingly, recursing into subqueries also seems to have improved planning in some cases like #9780 (comment)

Is it needed because of the Optimizer? Does it optimize the whole logical plan at once? If so, I think we should introduce new APIs on LogicalPlan that is different to the existing TreeNode APIs. In Spark for example we have TreeNode.transform() and QueryPlan.transformWithSubqueries() for the different purposes.

OptimizerRules can currently specify if they want to handle recursion themselves or not: https://docs.rs/datafusion/latest/datafusion/optimizer/trait.OptimizerRule.html#method.apply_order

I'm not sure that the current state is right or wrong as I can imagine both usescases in which

  • a transformations needs to be applied on the current tree only
  • a transformations need to change the whole tree (subplans trees included)

I agree I can see usecases for both. I think the optimizer rules that handle subqueries specially currently do their own walk of the tree. For example https://github.com/apache/arrow-datafusion/blob/cc1db8a2043c73bda7adec309b42c08d88defab8/datafusion/optimizer/src/decorrelate_predicate_subquery.rs doesn't specify apply_order and walks down the tree

Shall I open a PR that adds LogicalPlan::rewriteWithSubqueries() and LogicalPlan::transform...WithSubqueries()?

It seems like a good idea to me. If you you are going to do this, however, I would recommend finding some manual recursion / optimizer pass (maybe decorrelate_predicate_subquery.rs?) that you can use for a usecase of the new API.

Thank you again for all your help

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened a PR here: #9913.
I think the new LogicalPlan methods like LogicalPlan::transform_expressions() can be useful for 2. in #9780 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update here I have created two PRs as discussed

Unfortunately, currently individually they either don't improve performance or actually slow things down.

I am going to test what happens when I run them both together

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I combined both #9948 and #9946 tpch planning time improves by 12% (basically the same as this PR)

Thus I conclude we do not need the tree mutator API quite yet. Let's continue the discussions on those other PRs

Details

group                                         both_tests                             main                                   optimizer_tree_node2
-----                                         ----------                             ----                                   --------------------
logical_aggregate_with_join                   1.01  1203.3±30.57µs        ? ?/sec    1.00  1197.3±17.89µs        ? ?/sec    1.00  1194.4±22.14µs        ? ?/sec
logical_plan_tpcds_all                        1.00    157.6±2.32ms        ? ?/sec    1.00    157.3±3.68ms        ? ?/sec    1.00    157.6±3.67ms        ? ?/sec
logical_plan_tpch_all                         1.01     16.9±0.34ms        ? ?/sec    1.00     16.7±0.22ms        ? ?/sec    1.04     17.5±0.37ms        ? ?/sec
logical_select_all_from_1000                  1.01     19.6±0.17ms        ? ?/sec    1.00     19.4±0.17ms        ? ?/sec    1.00     19.4±0.17ms        ? ?/sec
logical_select_one_from_700                   1.01   790.8±12.36µs        ? ?/sec    1.00   784.0±12.23µs        ? ?/sec    1.00   786.1±12.87µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00   740.8±12.72µs        ? ?/sec    1.00   737.2±28.09µs        ? ?/sec    1.00   739.5±12.25µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00   719.9±14.19µs        ? ?/sec    1.02   731.2±11.31µs        ? ?/sec    1.00   721.4±19.28µs        ? ?/sec
physical_plan_tpcds_all                       1.00  1666.1±16.04ms        ? ?/sec    1.12  1872.6±16.17ms        ? ?/sec    1.20       2.0±0.02s        ? ?/sec
physical_plan_tpch_all                        1.00    110.2±1.86ms        ? ?/sec    1.10    121.3±1.90ms        ? ?/sec    1.21    133.3±2.21ms        ? ?/sec
physical_plan_tpch_q1                         1.00      6.2±0.11ms        ? ?/sec    1.19      7.4±0.12ms        ? ?/sec    1.23      7.6±0.07ms        ? ?/sec
physical_plan_tpch_q10                        1.00      5.1±0.07ms        ? ?/sec    1.11      5.6±0.11ms        ? ?/sec    1.20      6.1±0.10ms        ? ?/sec
physical_plan_tpch_q11                        1.00      4.5±0.09ms        ? ?/sec    1.10      5.0±0.07ms        ? ?/sec    1.20      5.4±0.10ms        ? ?/sec
physical_plan_tpch_q12                        1.00      3.6±0.06ms        ? ?/sec    1.11      4.0±0.07ms        ? ?/sec    1.21      4.4±0.11ms        ? ?/sec
physical_plan_tpch_q13                        1.00      2.4±0.04ms        ? ?/sec    1.10      2.7±0.06ms        ? ?/sec    1.18      2.9±0.08ms        ? ?/sec
physical_plan_tpch_q14                        1.00      3.1±0.05ms        ? ?/sec    1.09      3.4±0.06ms        ? ?/sec    1.19      3.7±0.09ms        ? ?/sec
physical_plan_tpch_q16                        1.00      4.5±0.06ms        ? ?/sec    1.10      4.9±0.08ms        ? ?/sec    1.24      5.6±0.10ms        ? ?/sec
physical_plan_tpch_q17                        1.00      4.3±0.07ms        ? ?/sec    1.09      4.7±0.05ms        ? ?/sec    1.21      5.2±0.10ms        ? ?/sec
physical_plan_tpch_q18                        1.00      4.6±0.09ms        ? ?/sec    1.10      5.1±0.09ms        ? ?/sec    1.22      5.6±0.10ms        ? ?/sec
physical_plan_tpch_q19                        1.00      9.0±0.11ms        ? ?/sec    1.06      9.5±0.11ms        ? ?/sec    1.23     11.1±0.14ms        ? ?/sec
physical_plan_tpch_q2                         1.00      9.6±0.12ms        ? ?/sec    1.11     10.6±0.18ms        ? ?/sec    1.23     11.7±0.14ms        ? ?/sec
physical_plan_tpch_q20                        1.00      5.5±0.08ms        ? ?/sec    1.11      6.1±0.08ms        ? ?/sec    1.24      6.9±0.14ms        ? ?/sec
physical_plan_tpch_q21                        1.00      7.6±0.11ms        ? ?/sec    1.12      8.5±0.14ms        ? ?/sec    1.22      9.3±0.13ms        ? ?/sec
physical_plan_tpch_q22                        1.00      4.1±0.10ms        ? ?/sec    1.10      4.5±0.10ms        ? ?/sec    1.23      5.0±0.12ms        ? ?/sec
physical_plan_tpch_q3                         1.00      3.6±0.07ms        ? ?/sec    1.08      3.9±0.06ms        ? ?/sec    1.19      4.3±0.11ms        ? ?/sec
physical_plan_tpch_q4                         1.00      2.7±0.06ms        ? ?/sec    1.11      3.0±0.05ms        ? ?/sec    1.21      3.2±0.11ms        ? ?/sec
physical_plan_tpch_q5                         1.00      5.3±0.08ms        ? ?/sec    1.08      5.7±0.12ms        ? ?/sec    1.17      6.2±0.08ms        ? ?/sec
physical_plan_tpch_q6                         1.00  1905.1±30.83µs        ? ?/sec    1.06      2.0±0.03ms        ? ?/sec    1.15      2.2±0.05ms        ? ?/sec
physical_plan_tpch_q7                         1.00      7.0±0.11ms        ? ?/sec    1.08      7.6±0.12ms        ? ?/sec    1.19      8.4±0.14ms        ? ?/sec
physical_plan_tpch_q8                         1.00      8.9±0.14ms        ? ?/sec    1.10      9.7±0.18ms        ? ?/sec    1.19     10.6±0.16ms        ? ?/sec
physical_plan_tpch_q9                         1.00      6.7±0.11ms        ? ?/sec    1.10      7.3±0.10ms        ? ?/sec    1.21      8.0±0.10ms        ? ?/sec
physical_select_all_from_1000                 1.00    114.1±0.77ms        ? ?/sec    1.13    129.2±1.14ms        ? ?/sec    1.12    128.3±0.79ms        ? ?/sec
physical_select_one_from_700                  1.00      3.9±0.08ms        ? ?/sec    1.03      4.1±0.05ms        ? ?/sec    1.04      4.1±0.04ms        ? ?/sec

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice improvement! Thanks @alamb!

@alamb alamb marked this pull request as draft April 2, 2024 13:13
@alamb
Copy link
Contributor Author

alamb commented Apr 2, 2024

Converting to draft as we work on this API so someone doesn't merge it accidentally

@berkaysynnada
Copy link
Contributor

I will review this PR and the conversation here tomorrow. Thanks @alamb, the results are so promising 🚀

@alamb
Copy link
Contributor Author

alamb commented Apr 4, 2024

Per #9780 (comment) I think we can accomplish the same thing without a mutator at this time, so closing this PR

@alamb alamb closed this Apr 4, 2024
@alamb alamb changed the title Introduce TreeNodeMutator for rewriting TreeNodes in place, change optimizer to rewrite LogicalPlan in place - 10% faster planning time DEMO: Introduce TreeNodeMutator for rewriting TreeNodes in place, change optimizer to rewrite LogicalPlan in place - 10% faster planning time Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants