remove duplicate the logic b/w DataFrame API and SQL planning #5686

jiangzhx · 2023-03-22T11:09:23Z

Which issue does this PR close?

now the count wildcard rules already move to Analyzer #5671
so remove duplicate the logic in SQL planning.

Closes #.

Rationale for this change

related issues: #5473 (comment)
related PR: #5671

What changes are included in this PR?

Are there any user-facing changes?

alamb

I am sorry for the delay in review. I will try and find more time to review this carefully tomorrow but initially I am surprised that says it removes duplicated logic adds more code than it removes 🤔

alamb · 2023-03-27T21:03:15Z

datafusion/common/src/dfschema.rs

@@ -630,9 +630,9 @@ impl ExprSchema for DFSchema {
 #[derive(Debug, Clone, PartialEq, Eq, Hash)]
 pub struct DFField {
    /// Optional qualifier (usually a table or relation name)
-    qualifier: Option<OwnedTableReference>,
+    pub qualifier: Option<OwnedTableReference>,


Can you please explain the rationale for this change?

alamb · 2023-03-27T21:03:28Z

datafusion/core/tests/sql/mod.rs

@@ -1161,15 +1161,6 @@ async fn try_execute_to_batches(
 /// Execute query and return results as a Vec of RecordBatches
 async fn execute_to_batches(ctx: &SessionContext, sql: &str) -> Vec<RecordBatch> {
    let df = ctx.sql(sql).await.unwrap();
-
-    // We are not really interested in the direct output of optimized_logical_plan


Why was this removed?

jiangzhx · 2023-03-28T10:38:23Z

I am sorry for the delay in review. I will try and find more time to review this carefully tomorrow but initially I am surprised that says it removes duplicated logic adds more code than it removes 🤔

Because when I started to remove the duplicate logic between the DataFrame API and SQL planning, I found that count_wildcard_rule did not cover all scenarios, such as union, window, etc.
Currently, these test cases are all based on SQL, so they are not discovered when running cargo test.
After I covered more scenarios for count_wildcard_rule, it resulted in more code added than removed.

for example, before this pr.

#worked
ctx.sql("select COUNT(*) OVER(ORDER BY timestamp_col DESC RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING)  from alltypes_tiny_pages")

#failed
let df_results = ctx
    .table("alltypes_tiny_pages")
    .await?
    .select(vec![Expr::WindowFunction(expr::WindowFunction::new(
        WindowFunction::AggregateFunction(AggregateFunction::Count),
        vec![Expr::Wildcard],
        vec![],
        vec![Expr::Sort(Sort::new(
            Box::new(col("timestamp_col")),
            false,
            true,
        ))],
        WindowFrame {
            units: WindowFrameUnits::Range,
            start_bound: WindowFrameBound::Preceding(ScalarValue::IntervalDayTime(
                Some(6),
            )),
            end_bound: WindowFrameBound::Following(ScalarValue::IntervalDayTime(
                Some(2),
            )),
        },
    ))])?
    .explain(false, false)?
    .collect()
    .await?;

alamb · 2023-03-28T20:26:02Z

Because when I started to remove the duplicate logic between the DataFrame API and SQL planning, I found that count_wildcard_rule did not cover all scenarios, such as union, window, etc.

This makes sense -- thank you for the explanation @jiangzhx

Can you please add DataFrame tests the relevant behavior (mostly so that we don't break it in the future by accident)

alamb · 2023-03-28T20:26:35Z

Marking as draft to signify this PR has feedback and is not waiting for another review at the moment.

jiangzhx · 2023-03-29T08:37:06Z

Because when I started to remove the duplicate logic between the DataFrame API and SQL planning, I found that count_wildcard_rule did not cover all scenarios, such as union, window, etc.

This makes sense -- thank you for the explanation @jiangzhx

Can you please add DataFrame tests the relevant behavior (mostly so that we don't break it in the future by accident)

i added some testcase in tests/dataframe.rs

test_count_wildcard_on_sort
test_count_wildcard_on_where_exist
test_count_wildcard_on_where_in
test_count_wildcard_on_window
test_count_wildcard_on_aggregate
test_count_wildcard_on_where_scalar_subquery

jiangzhx · 2023-04-14T09:58:43Z

split this pr in two part.

update count_wildcard_rule for more scenario #6010
remove duplicate the logic b/w DataFrame API and SQL planning

alamb · 2024-04-08T21:05:41Z

Since this has been open for more than a year, closing it down. Feel free to reopen if/when you keep working on it.

jiangzhx marked this pull request as draft March 22, 2023 11:09

github-actions bot added core Core DataFusion crate optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner labels Mar 22, 2023

jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch from 8c7d06a to 2318a80 Compare March 22, 2023 11:17

github-actions bot removed the physical-expr Physical Expressions label Mar 22, 2023

jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch 4 times, most recently from 09c0ae4 to c9e610d Compare March 24, 2023 08:02

jiangzhx changed the title ~~remove unnecessary logic on sql count wildcard~~ remove duplicate the logic b/w DataFrame API and SQL planning? Mar 24, 2023

jiangzhx changed the title ~~remove duplicate the logic b/w DataFrame API and SQL planning?~~ remove duplicate the logic b/w DataFrame API and SQL planning Mar 24, 2023

jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch 9 times, most recently from 88db38a to e917a33 Compare March 25, 2023 02:23

jiangzhx marked this pull request as ready for review March 25, 2023 03:14

alamb reviewed Mar 27, 2023

View reviewed changes

alamb marked this pull request as draft March 28, 2023 20:26

jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch from e917a33 to 423e604 Compare March 29, 2023 04:55

github-actions bot removed the optimizer Optimizer rules label Mar 29, 2023

github-actions bot added the optimizer Optimizer rules label Mar 29, 2023

jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch 2 times, most recently from 11c2ff8 to 3f22001 Compare March 29, 2023 06:44

jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch from 3f22001 to 9c845de Compare March 30, 2023 08:28

jiangzhx mentioned this pull request Mar 31, 2023

SQL case scalar_subquery logical_paln unexpected Aggregate: groupBy=[[col]] #5791

Closed

jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch 3 times, most recently from 4d94f9c to dc5e1c0 Compare April 11, 2023 06:07

This was referenced Apr 11, 2023

Regression in 22.0.0 with filter push-down #5949

Closed

add an example of using DataFrame to create a subquery #5961

Merged

jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch 2 times, most recently from fb3d0ec to 692bfac Compare April 14, 2023 04:43

jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch 5 times, most recently from 7f2a745 to f308261 Compare April 17, 2023 07:28

github-actions bot removed the optimizer Optimizer rules label Apr 17, 2023

remove duplicate the logic b/w DataFrame API and SQL planning

623c634

jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch from f308261 to 623c634 Compare April 19, 2023 08:17

github-actions bot added the optimizer Optimizer rules label Apr 19, 2023

alamb closed this Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove duplicate the logic b/w DataFrame API and SQL planning #5686

remove duplicate the logic b/w DataFrame API and SQL planning #5686

jiangzhx commented Mar 22, 2023 •

edited

Loading

alamb left a comment

alamb Mar 27, 2023

alamb Mar 27, 2023

jiangzhx commented Mar 28, 2023 •

edited

Loading

alamb commented Mar 28, 2023

alamb commented Mar 28, 2023

jiangzhx commented Mar 29, 2023 •

edited

Loading

jiangzhx commented Apr 14, 2023

alamb commented Apr 8, 2024

remove duplicate the logic b/w DataFrame API and SQL planning #5686

remove duplicate the logic b/w DataFrame API and SQL planning #5686

Conversation

jiangzhx commented Mar 22, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 27, 2023

Choose a reason for hiding this comment

alamb Mar 27, 2023

Choose a reason for hiding this comment

jiangzhx commented Mar 28, 2023 • edited Loading

alamb commented Mar 28, 2023

alamb commented Mar 28, 2023

jiangzhx commented Mar 29, 2023 • edited Loading

jiangzhx commented Apr 14, 2023

alamb commented Apr 8, 2024

jiangzhx commented Mar 22, 2023 •

edited

Loading

jiangzhx commented Mar 28, 2023 •

edited

Loading

jiangzhx commented Mar 29, 2023 •

edited

Loading