-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix logical plan optimization will execute twice in SQL mode #1183
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,8 +56,12 @@ impl DataFrameImpl { | |
/// Create a physical plan | ||
async fn create_physical_plan(&self) -> Result<Arc<dyn ExecutionPlan>> { | ||
let state = self.ctx_state.lock().unwrap().clone(); | ||
let has_optimized = state.has_optimized; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is ok to call I vaguely remember @Dandandan finding something similar in the past, but I can't find the reference There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think redundant calls to logical plan optimization will cost performance. With this implementation, after the first optimization in https://github.com/apache/arrow-datafusion/blob/ad059a688fd8da7b360423e0d911f2f1f33dbb9f/datafusion/src/execution/context.rs#L749, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @xudong963 I think what @alamb meant was the return value from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @houqp I don't mean Fox example #[tokio::main]
async fn main() -> Result<()> {
// create local execution context
let mut ctx = ExecutionContext::new();
let testdata = datafusion::arrow::util::test_util::parquet_test_data();
// register parquet file with the execution context
ctx.register_parquet(
"alltypes_plain",
&format!("{}/alltypes_plain.parquet", testdata),
)
.await?;
// execute the query
let df = ctx
.sql(
"SELECT int_col, double_col, CAST(date_string_col as VARCHAR) \
FROM alltypes_plain \
WHERE id > 1 AND tinyint_col < double_col",
)
.await?;
// print the results
df.show().await?;
Ok(())
}
|
||
let ctx = ExecutionContext::from(Arc::new(Mutex::new(state))); | ||
let plan = ctx.optimize(&self.plan)?; | ||
let mut plan: LogicalPlan = self.plan.clone(); | ||
if !has_optimized { | ||
plan = ctx.optimize(&self.plan)?; | ||
} | ||
ctx.create_physical_plan(&plan).await | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way - if we want to execute two queries in the same context, it seems only the first will be optimized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to do this, I think it should be something on the
LogicalPlan
instead.On the other hand, I don't think double optimize on different query executions (e.g. collect, show) is something very beneficial.
The slower part of optimization is collecting statistics and using it for cost based optimizations and pruning, which is not something we do in the logical optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to execute two queries in the same context, could you please give an example, thanks very much @Dandandan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
ExecutionContext
as a global context isn't suitable to do this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could be done to avoid optimizing a logical plan twice is adding a
is_optimized
or something similar toLogicalPlan
instead. After optimizing we can setis_optimized
to true on the logical plan of the dataframe.Might also be good to have some numbers about how much time a typical full optimization pass costs (and / or to track some statistics) - I would expect in most cases it will be quite a bit less than say 1 ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do this in the next PR.