Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove duplicate the logic b/w DataFrame API and SQL planning #5686

Conversation

jiangzhx
Copy link
Contributor

@jiangzhx jiangzhx commented Mar 22, 2023

Which issue does this PR close?

now the count wildcard rules already move to Analyzer #5671
so remove duplicate the logic in SQL planning.

Closes #.

Rationale for this change

related issues: #5473 (comment)
related PR: #5671

What changes are included in this PR?

Are there any user-facing changes?

@jiangzhx jiangzhx marked this pull request as draft March 22, 2023 11:09
@github-actions github-actions bot added core Core DataFusion crate optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner labels Mar 22, 2023
@jiangzhx jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch from 8c7d06a to 2318a80 Compare March 22, 2023 11:17
@github-actions github-actions bot removed the physical-expr Physical Expressions label Mar 22, 2023
@jiangzhx jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch 4 times, most recently from 09c0ae4 to c9e610d Compare March 24, 2023 08:02
@jiangzhx jiangzhx changed the title remove unnecessary logic on sql count wildcard remove duplicate the logic b/w DataFrame API and SQL planning? Mar 24, 2023
@jiangzhx jiangzhx changed the title remove duplicate the logic b/w DataFrame API and SQL planning? remove duplicate the logic b/w DataFrame API and SQL planning Mar 24, 2023
@jiangzhx jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch 9 times, most recently from 88db38a to e917a33 Compare March 25, 2023 02:23
@jiangzhx jiangzhx marked this pull request as ready for review March 25, 2023 03:14
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for the delay in review. I will try and find more time to review this carefully tomorrow but initially I am surprised that says it removes duplicated logic adds more code than it removes 🤔

@@ -630,9 +630,9 @@ impl ExprSchema for DFSchema {
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct DFField {
/// Optional qualifier (usually a table or relation name)
qualifier: Option<OwnedTableReference>,
pub qualifier: Option<OwnedTableReference>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain the rationale for this change?

@@ -1161,15 +1161,6 @@ async fn try_execute_to_batches(
/// Execute query and return results as a Vec of RecordBatches
async fn execute_to_batches(ctx: &SessionContext, sql: &str) -> Vec<RecordBatch> {
let df = ctx.sql(sql).await.unwrap();

// We are not really interested in the direct output of optimized_logical_plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

@jiangzhx
Copy link
Contributor Author

jiangzhx commented Mar 28, 2023

I am sorry for the delay in review. I will try and find more time to review this carefully tomorrow but initially I am surprised that says it removes duplicated logic adds more code than it removes 🤔

Because when I started to remove the duplicate logic between the DataFrame API and SQL planning, I found that count_wildcard_rule did not cover all scenarios, such as union, window, etc.
Currently, these test cases are all based on SQL, so they are not discovered when running cargo test.
After I covered more scenarios for count_wildcard_rule, it resulted in more code added than removed.

for example, before this pr.

#worked
ctx.sql("select COUNT(*) OVER(ORDER BY timestamp_col DESC RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING)  from alltypes_tiny_pages")
#failed
let df_results = ctx
    .table("alltypes_tiny_pages")
    .await?
    .select(vec![Expr::WindowFunction(expr::WindowFunction::new(
        WindowFunction::AggregateFunction(AggregateFunction::Count),
        vec![Expr::Wildcard],
        vec![],
        vec![Expr::Sort(Sort::new(
            Box::new(col("timestamp_col")),
            false,
            true,
        ))],
        WindowFrame {
            units: WindowFrameUnits::Range,
            start_bound: WindowFrameBound::Preceding(ScalarValue::IntervalDayTime(
                Some(6),
            )),
            end_bound: WindowFrameBound::Following(ScalarValue::IntervalDayTime(
                Some(2),
            )),
        },
    ))])?
    .explain(false, false)?
    .collect()
    .await?;

@alamb
Copy link
Contributor

alamb commented Mar 28, 2023

Because when I started to remove the duplicate logic between the DataFrame API and SQL planning, I found that count_wildcard_rule did not cover all scenarios, such as union, window, etc.

This makes sense -- thank you for the explanation @jiangzhx

Can you please add DataFrame tests the relevant behavior (mostly so that we don't break it in the future by accident)

@alamb
Copy link
Contributor

alamb commented Mar 28, 2023

Marking as draft to signify this PR has feedback and is not waiting for another review at the moment.

@alamb alamb marked this pull request as draft March 28, 2023 20:26
@jiangzhx jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch from e917a33 to 423e604 Compare March 29, 2023 04:55
@github-actions github-actions bot removed the optimizer Optimizer rules label Mar 29, 2023
@github-actions github-actions bot added the optimizer Optimizer rules label Mar 29, 2023
@jiangzhx jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch 2 times, most recently from 11c2ff8 to 3f22001 Compare March 29, 2023 06:44
@jiangzhx
Copy link
Contributor Author

jiangzhx commented Mar 29, 2023

Because when I started to remove the duplicate logic between the DataFrame API and SQL planning, I found that count_wildcard_rule did not cover all scenarios, such as union, window, etc.

This makes sense -- thank you for the explanation @jiangzhx

Can you please add DataFrame tests the relevant behavior (mostly so that we don't break it in the future by accident)

i added some testcase in tests/dataframe.rs

  • test_count_wildcard_on_sort
  • test_count_wildcard_on_where_exist
  • test_count_wildcard_on_where_in
  • test_count_wildcard_on_window
  • test_count_wildcard_on_aggregate
  • test_count_wildcard_on_where_scalar_subquery

@jiangzhx jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch from 3f22001 to 9c845de Compare March 30, 2023 08:28
@jiangzhx jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch 3 times, most recently from 4d94f9c to dc5e1c0 Compare April 11, 2023 06:07
@jiangzhx jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch 2 times, most recently from fb3d0ec to 692bfac Compare April 14, 2023 04:43
@jiangzhx
Copy link
Contributor Author

split this pr in two part.

  1. update count_wildcard_rule for more scenario #6010
  2. remove duplicate the logic b/w DataFrame API and SQL planning

@jiangzhx jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch 5 times, most recently from 7f2a745 to f308261 Compare April 17, 2023 07:28
@github-actions github-actions bot removed the optimizer Optimizer rules label Apr 17, 2023
@jiangzhx jiangzhx force-pushed the remove_unnecessary_logic_sql_count_wildcard branch from f308261 to 623c634 Compare April 19, 2023 08:17
@github-actions github-actions bot added the optimizer Optimizer rules label Apr 19, 2023
@alamb
Copy link
Contributor

alamb commented Apr 8, 2024

Since this has been open for more than a year, closing it down. Feel free to reopen if/when you keep working on it.

@alamb alamb closed this Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants