Fix logical plan optimization will execute twice in SQL mode #1183

xudong963 · 2021-10-27T17:10:42Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb · 2021-10-27T23:31:47Z

datafusion/src/execution/dataframe_impl.rs

@@ -56,8 +56,12 @@ impl DataFrameImpl {
    /// Create a physical plan
    async fn create_physical_plan(&self) -> Result<Arc<dyn ExecutionPlan>> {
        let state = self.ctx_state.lock().unwrap().clone();
+        let has_optimized = state.has_optimized;


I think it is ok to call DataFrame::create_physical_plan() twice on the same DataFrame -- with this implementation, won't the second call to create_physical_plan not be optimized?

I vaguely remember @Dandandan finding something similar in the past, but I can't find the reference

I think redundant calls to logical plan optimization will cost performance.

With this implementation, after the first optimization in https://github.com/apache/arrow-datafusion/blob/ad059a688fd8da7b360423e0d911f2f1f33dbb9f/datafusion/src/execution/context.rs#L749, the state.has_optimized will be true. So in create_physical_plan, we can avoid the redundant optimization for logical plan and directly create a physical plan.

@alamb perhaps you were thinking about this ticket? #705

@xudong963 I think what @alamb meant was the return value from ctx.optimize(&self.plan)? was not saved into self.plan. So the next time create_physical_plan is called, ctx.create_physical_plan(&plan).await will be invoked with an unoptimized plan.

@houqp I don't mean create_physical_plan will be called twice. I mean https://github.com/apache/arrow-datafusion/blob/ad059a688fd8da7b360423e0d911f2f1f33dbb9f/datafusion/src/execution/context.rs#L613 will be called twice.

Fox example

#[tokio::main] async fn main() -> Result<()> { // create local execution context let mut ctx = ExecutionContext::new(); let testdata = datafusion::arrow::util::test_util::parquet_test_data(); // register parquet file with the execution context ctx.register_parquet( "alltypes_plain", &format!("{}/alltypes_plain.parquet", testdata), ) .await?; // execute the query let df = ctx .sql( "SELECT int_col, double_col, CAST(date_string_col as VARCHAR) \ FROM alltypes_plain \ WHERE id > 1 AND tinyint_col < double_col", ) .await?; // print the results df.show().await?; Ok(()) }

ctx.sql(..) and df.show() both will call logical plan optimization.
Let me know where I misunderstand.
cc @alamb

Dandandan · 2021-10-28T16:21:03Z

datafusion/src/execution/context.rs

@@ -176,6 +176,7 @@ impl ExecutionContext {
                config,
                execution_props: ExecutionProps::new(),
                object_store_registry: Arc::new(ObjectStoreRegistry::new()),
+                has_optimized: false,


This way - if we want to execute two queries in the same context, it seems only the first will be optimized?

If we want to do this, I think it should be something on the LogicalPlan instead.

On the other hand, I don't think double optimize on different query executions (e.g. collect, show) is something very beneficial.
The slower part of optimization is collecting statistics and using it for cost based optimizations and pruning, which is not something we do in the logical optimizations.

This way - if we want to execute two queries in the same context, it seems only the first will be optimized?

How to execute two queries in the same context, could you please give an example, thanks very much @Dandandan

If we want to do this, I think it should be something on the LogicalPlan instead.

I agree. ExecutionContext as a global context isn't suitable to do this.

What could be done to avoid optimizing a logical plan twice is adding a is_optimized or something similar to LogicalPlan instead. After optimizing we can set is_optimized to true on the logical plan of the dataframe.

Might also be good to have some numbers about how much time a typical full optimization pass costs (and / or to track some statistics) - I would expect in most cases it will be quite a bit less than say 1 ms.

Might also be good to have some numbers about how much time a typical full optimization pass costs (and / or to track some statistics) - I would expect in most cases it will be quite a bit less than say 1 ms.

Maybe we can do this in the next PR.

Dandandan · 2021-10-29T09:48:55Z

An argument against avoiding to optimize it twice in different query executions is this example:

let x = df.collect();
// change execution context e.g. change enabled optimizations, parallelism, etc.
ctx.state...
// evaluate dataframe with collect or similar method
let y = df.collect();

Now - when running the optimizer for the new execution "again" for the DataFrame will optimize it based on the provided configurations.
But if we would avoid it, we don't run the optimizer again, but instead have it optimized using the previous configuration,

xudong963 · 2021-10-29T13:20:59Z

An argument against avoiding to optimize it twice in different query executions is this example:
let x = df.collect();
// change execution context e.g. change enabled optimizations, parallelism, etc.
ctx.state...
// evaluate dataframe with collect or similar method
let y = df.collect();
Now - when running the optimizer for the new execution "again" for the DataFrame will optimize it based on the provided configurations. But if we would avoid it, we don't run the optimizer again, but instead have it optimized using the previous configuration,

Thanks! @Dandandan. Happy weekend, happy coding! 🎉

xudong963 · 2021-10-29T14:24:23Z

A summary of what the PR will do.

Add a member variable such as is_optimized to LogicalPlan to avoid optimizing the logical plan twice.
Avoid optimizing on different query executions twice, such as Fix logical plan optimization will execute twice in SQL mode #1183 (comment)

About 1, I want to know if directly adding is_optimized to all logical plans is a good way? Do you have a more elegant idea?

pub enum LogicalPlan {
    Projection {
       ...
        is_optimized: bool
    },

    Filter {
        ...
        is_optimized: bool
    },
    ...

@Dandandan @alamb @houqp Please help me check my thought, thanks!

alamb · 2021-10-29T17:54:38Z

A summary of what the PR will do.

I think it might be ok to optimize the plan twice (in other words, perhaps we can close the ticket as "working as expected"?). Do we have any examples of the double optimization causing problems (or taking overly long)?

I think @Dandandan was also hinting at this point in his comment at #1183 (comment)

Adding a lot of additional code to LogicalPlan (is_optimized on every variant) seems like it will be a very large change without much benefit, though perhaps I am overlooking potential benefits

xudong963 · 2021-10-30T07:28:38Z

A summary of what the PR will do.

I think it might be ok to optimize the plan twice (in other words, perhaps we can close the ticket as "working as expected"?). Do we have any examples of the double optimization causing problems (or taking overly long)?

I think @Dandandan was also hinting at this point in his comment at #1183 (comment)

Adding a lot of additional code to LogicalPlan (is_optimized on every variant) seems like it will be a very large change without much benefit, though perhaps I am overlooking potential benefits

OK, No code is best code😄

Dandandan · 2021-10-30T07:56:08Z

A summary of what the PR will do.

I think it might be ok to optimize the plan twice (in other words, perhaps we can close the ticket as "working as expected"?). Do we have any examples of the double optimization causing problems (or taking overly long)?

I think @Dandandan was also hinting at this point in his comment at #1183 (comment)

Adding a lot of additional code to LogicalPlan (is_optimized on every variant) seems like it will be a very large change without much benefit, though perhaps I am overlooking potential benefits

Yes I was hinting at that.

If we are able to show otherwise (examples of logical plan optimization takes very long) then we can see if we can optimize for this case. Otherwise I agree "working as expected" should be the conclusion.

alamb · 2021-11-01T10:33:46Z

Thanks again for looking into this @xudong963

Fix logical plan optimization will execute twice in SQL mode

5627875

github-actions bot added ballista datafusion Changes in the datafusion crate labels Oct 27, 2021

alamb reviewed Oct 27, 2021

View reviewed changes

Dandandan reviewed Oct 28, 2021

View reviewed changes

xudong963 closed this Oct 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix logical plan optimization will execute twice in SQL mode #1183

Fix logical plan optimization will execute twice in SQL mode #1183

xudong963 commented Oct 27, 2021

alamb Oct 27, 2021

xudong963 Oct 28, 2021

houqp Oct 28, 2021 •

edited

Loading

houqp Oct 28, 2021 •

edited

Loading

xudong963 Oct 28, 2021

Dandandan Oct 28, 2021

Dandandan Oct 28, 2021

xudong963 Oct 28, 2021

xudong963 Oct 28, 2021

Dandandan Oct 29, 2021

xudong963 Oct 29, 2021

Dandandan commented Oct 29, 2021

xudong963 commented Oct 29, 2021

xudong963 commented Oct 29, 2021

alamb commented Oct 29, 2021

xudong963 commented Oct 30, 2021

Dandandan commented Oct 30, 2021

alamb commented Nov 1, 2021

Fix logical plan optimization will execute twice in SQL mode #1183

Fix logical plan optimization will execute twice in SQL mode #1183

Conversation

xudong963 commented Oct 27, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

houqp Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

houqp Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Oct 29, 2021

xudong963 commented Oct 29, 2021

xudong963 commented Oct 29, 2021

alamb commented Oct 29, 2021

xudong963 commented Oct 30, 2021

Dandandan commented Oct 30, 2021

alamb commented Nov 1, 2021

houqp Oct 28, 2021 •

edited

Loading

houqp Oct 28, 2021 •

edited

Loading