-
Notifications
You must be signed in to change notification settings - Fork 72
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enable DataFusion CBO and introduce DaskSqlOptimizer (#558)
* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Bump DataFusion version (#494) * bump DataFusion version * remove unnecessary downcasts and use separate structs for TableSource and TableProvider * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * commit to share with colleague * updates * checkpoint * Temporarily disable conda run_test.py script since it uses features not yet implemented * formatting after upstream merge * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * tests update * checkpoint * checkpoint * Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls * skip test that uses create statement for gpuci * Basic DataFusion Select Functionality (#489) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * Temporarily disable conda run_test.py script since it uses features not yet implemented * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * Remove print statements * Default to UTC if tz is None * Delegate timezone handling to the arrow library * Updates from review Co-authored-by: Charles Blackmon-Luca <[email protected]> * updates for expression * uncommented pytests * uncommented pytests * code cleanup for review * code cleanup for review * Enabled more pytest that work now * Enabled more pytest that work now * Output Expression as String when BinaryExpr does not contain a named alias * Output Expression as String when BinaryExpr does not contain a named alias * Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR * Handle Between operation for case-when * adjust timestamp casting * Refactor projection _column_name() logic to the _column_name logic in expression.rs * removed println! statements * introduce join getCondition() logic for retrieving the combining Rex logic for joining * Updates from review * Add Offset and point to repo with offset in datafusion * Introduce offset * limit updates * commit before upstream merge * Code formatting * update Cargo.toml to use Arrow-DataFusion version with LIMIT logic * Bump DataFusion version to get changes around variant_name() * Use map partitions for determining the offset * Merge with upstream * Rename underlying DataContainer's DataFrame instance to match the column container names * Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset * Add enumerate to column_{i} generation string to ensure columns exist in both dataframes * Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions * Handle DataFusion COUNT(UInt8(1)) as COUNT(*) * commit before merge * Update function for gathering index of a expression * Update for review check * Adjust RelDataType to retrieve fully qualified column names * Adjust base.py to get fully qualified column name * Enable passing pytests in test_join.py * Adjust keys provided by getting backend column mapping name * Adjust output_col to not use the backend_column name for special reserved exprs * uncomment cross join pytest which works now * Uncomment passing pytests in test_select.py * Review updates * Add back complex join case condition, not just cross join but 'complex' joins * Enable DataFusion CBO logic * Disable EliminateFilter optimization rule * updates * Disable tests that hit CBO generated plan edge cases of yet to be implemented logic * [REVIEW] - Modifiy sql.skip_optimize to use dask_config.get and remove used method parameter * [REVIEW] - change name of configuration from skip_optimize to optimize * [REVIEW] - Add OptimizeException catch and raise statements back * Found issue where backend column names which are results of a single aggregate resulting column, COUNT(*) for example, need to get the first agg df column since names are not valid * Remove SQL from OptimizationException * skip tests that CBO plan reorganization causes missing features to be present Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: Andy Grove <[email protected]>
- Loading branch information
1 parent
2a2b5d1
commit d233b9d
Showing
15 changed files
with
176 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
use datafusion::error::DataFusionError; | ||
use datafusion::logical_expr::LogicalPlan; | ||
use datafusion::optimizer::eliminate_limit::EliminateLimit; | ||
use datafusion::optimizer::filter_push_down::FilterPushDown; | ||
use datafusion::optimizer::limit_push_down::LimitPushDown; | ||
use datafusion::optimizer::optimizer::OptimizerRule; | ||
use datafusion::optimizer::OptimizerConfig; | ||
|
||
use datafusion::optimizer::common_subexpr_eliminate::CommonSubexprEliminate; | ||
use datafusion::optimizer::projection_push_down::ProjectionPushDown; | ||
use datafusion::optimizer::single_distinct_to_groupby::SingleDistinctToGroupBy; | ||
use datafusion::optimizer::subquery_filter_to_join::SubqueryFilterToJoin; | ||
|
||
/// Houses the optimization logic for Dask-SQL. This optimization controls the optimizations | ||
/// and their ordering in regards to their impact on the underlying `LogicalPlan` instance | ||
pub struct DaskSqlOptimizer { | ||
optimizations: Vec<Box<dyn OptimizerRule + Send + Sync>>, | ||
} | ||
|
||
impl DaskSqlOptimizer { | ||
/// Creates a new instance of the DaskSqlOptimizer with all the DataFusion desired | ||
/// optimizers as well as any custom `OptimizerRule` trait impls that might be desired. | ||
pub fn new() -> Self { | ||
let mut rules: Vec<Box<dyn OptimizerRule + Send + Sync>> = Vec::new(); | ||
rules.push(Box::new(CommonSubexprEliminate::new())); | ||
rules.push(Box::new(EliminateLimit::new())); | ||
rules.push(Box::new(FilterPushDown::new())); | ||
rules.push(Box::new(LimitPushDown::new())); | ||
rules.push(Box::new(ProjectionPushDown::new())); | ||
rules.push(Box::new(SingleDistinctToGroupBy::new())); | ||
rules.push(Box::new(SubqueryFilterToJoin::new())); | ||
Self { | ||
optimizations: rules, | ||
} | ||
} | ||
|
||
/// Iteratoes through the configured `OptimizerRule`(s) to transform the input `LogicalPlan` | ||
/// to its final optimized form | ||
pub(crate) fn run_optimizations( | ||
&self, | ||
plan: LogicalPlan, | ||
) -> Result<LogicalPlan, DataFusionError> { | ||
let mut resulting_plan: LogicalPlan = plan; | ||
for optimization in &self.optimizations { | ||
match optimization.optimize(&resulting_plan, &OptimizerConfig::new()) { | ||
Ok(optimized_plan) => resulting_plan = optimized_plan, | ||
Err(e) => { | ||
return Err(e); | ||
} | ||
} | ||
} | ||
Ok(resulting_plan) | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,3 +7,5 @@ sql: | |
case_sensitive: True | ||
|
||
predicate_pushdown: True | ||
|
||
optimize: True |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters