feat: support unparsing LogicalPlan::Window nodes #10767

devinjdangelo · 2024-06-02T14:17:30Z

Which issue does this PR close?

closes #10664

Rationale for this change

Queries involving window functions are common and should be supported for unparsing a plan into SQL.

What changes are included in this PR?

Implements logic for unprojecting window function projections, similar to existing logic for aggregate functions
No longer throw unimplemented error on LogicalPlan::Window nodes
Fixes a subtle error in unparsing Window expressions

Are these changes tested?

Yes, with a new round trip plan_to_sql test

Are there any user-facing changes?

Additional queries can be unparsed

devinjdangelo · 2024-06-02T14:19:46Z

datafusion/sql/src/unparser/expr.rs

-                    self.scalar_to_sql(val).map(Box::new).ok(),
-                )
+                Ok(ast::WindowFrameBound::Preceding({
+                    let val = self.scalar_to_sql(val)?;


There is a subtle difference in how datafusion plans a window frame bound that is None vs ScalarValue::Null.

The former yields PRECEDING UNBOUNDED and the latter yields PRECEDING NULL.

Datafusion's planner accepts the former, but rejects the latter.

devinjdangelo · 2024-06-02T14:20:24Z

datafusion/sql/src/unparser/expr.rs

@@ -1148,7 +1158,7 @@ mod tests {
                    window_frame: WindowFrame::new(None),
                    null_treatment: None,
                }),
-                r#"ROW_NUMBER(col) OVER (ROWS BETWEEN NULL PRECEDING AND NULL FOLLOWING)"#,
+                r#"ROW_NUMBER(col) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)"#,


See comment on L520 for explanation of this test change.

devinjdangelo · 2024-06-02T14:22:34Z

datafusion/sql/src/unparser/utils.rs


-/// Recursively searches children of [LogicalPlan] to find an Aggregate node if one exists
+/// One of the possible aggregation plans which can be found within a single select query.
+pub(crate) enum AggVariant<'a> {


This assumes a SELECT query can exclusively have only a window function or an aggregate function but not both. A LogicalPlan can certainly have both, but I could not find an example of a single SELECT query without any nesting/derived table factors that is allowed to have both.

devinjdangelo · 2024-06-02T14:26:50Z

datafusion/sql/tests/cases/plan_to_sql.rs

@@ -127,7 +127,10 @@ fn roundtrip_statement() -> Result<()> {
            UNION ALL
            SELECT j2_string as string FROM j2
            ORDER BY string DESC
-            LIMIT 10"#
+            LIMIT 10"#,
+            "SELECT id, count(*) over (PARTITION BY first_name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 


The roundtrip test will fail if ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is not explicitly included. E.g. the datafusion planner generates a non identical plan for the following two SQL queries:

SELECT id, count(*) over (PARTITION BY first_name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), last_name, sum(id) over (PARTITION BY first_name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), first_name from person

vs

SELECT id, count(*) over (PARTITION BY first_name), last_name, sum(id) over (PARTITION BY first_name), first_name from person

While the two plans are not identical p1!=p2, I believe the difference is trivial and will actually result in the same computations taking place.

While the two plans are not identical p1!=p2, I believe the difference is trivial and will actually result in the same computations taking place.

That is my understanding as well

We could plausibly simply the resulting plan of the window bounds are the default

Can we also add some tests that have aggregate and window functions? Something like

SELECT id, count(distinct id), sum(id) OVER (PARTITION BY first_name) from person SELECT id, sum(id) OVER (PARTITION BY first_name ROWS 5 PRECEDING ROWS 2 FOLLOWING) from person

It appears that the datafusion planner does not support mixing aggregate and window functions. It does allow mixing window functions with different WindowSpecs, including some over all rows (which is almost the same thing as an aggregate function). I think this behavior makes sense as an aggregate function is strict in how many tuples it will return (one per group) while a window function can return multiple tuples per group as needed.

DataFusion CLI v38.0.0 > create table person (id int, name varchar); 0 row(s) fetched. Elapsed 0.001 seconds. > insert into person values (1, 'a'), (2, 'b'), (3, 'c'); +-------+ | count | +-------+ | 3 | +-------+ 1 row(s) fetched. Elapsed 0.001 seconds. > select count(distinct id), sum(id) over (partition by name) from person; Error during planning: Projection references non-aggregate values: Expression person.id could not be resolved from available columns: COUNT(DISTINCT person.id) > select count(distinct id) over (), sum(id) over (partition by name) from person; +---------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ | COUNT(person.id) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING | SUM(person.id) PARTITION BY [person.name] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING | +---------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ | 3 | 2 | | 3 | 1 | | 3 | 3 | +---------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ 3 row(s) fetched. Elapsed 0.002 seconds.

I added some tests and made a few changes to correctly support unparsing a SELECT query with multiple different WindowSpecs.

The tests look great now -- thanks @devinjdangelo

devinjdangelo · 2024-06-02T14:28:55Z

datafusion/sql/src/unparser/plan.rs

-                                .map(|expr| self.expr_to_sql(expr))
-                                .collect::<Result<Vec<_>>>()?,
-                        ));
+                    if let Some(aggvariant) = find_agg_node_within_select(plan, true) {


The select_to_sql_recursively method has grown deeply nested/complex and is due for a refactor or at least breaking up into more helper methods to improve readability.

I agree this would be a good follow on PR.

alamb

Thank you @devinjdangelo -- this is looking pretty neat

alamb · 2024-06-02T15:42:33Z

datafusion/sql/src/unparser/plan.rs

-                                .map(|expr| self.expr_to_sql(expr))
-                                .collect::<Result<Vec<_>>>()?,
-                        ));
+                    if let Some(aggvariant) = find_agg_node_within_select(plan, true) {


I agree this would be a good follow on PR.

alamb · 2024-06-02T15:46:14Z

datafusion/sql/tests/cases/plan_to_sql.rs

@@ -127,7 +127,10 @@ fn roundtrip_statement() -> Result<()> {
            UNION ALL
            SELECT j2_string as string FROM j2
            ORDER BY string DESC
-            LIMIT 10"#
+            LIMIT 10"#,
+            "SELECT id, count(*) over (PARTITION BY first_name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 


While the two plans are not identical p1!=p2, I believe the difference is trivial and will actually result in the same computations taking place.

That is my understanding as well

We could plausibly simply the resulting plan of the window bounds are the default

Can we also add some tests that have aggregate and window functions? Something like

SELECT id, count(distinct id), sum(id) OVER (PARTITION BY first_name) from person SELECT id, sum(id) OVER (PARTITION BY first_name ROWS 5 PRECEDING ROWS 2 FOLLOWING) from person

yyy1000 · 2024-06-02T16:24:59Z

datafusion/sql/src/unparser/utils.rs

+///
+/// For example, if expr contains the column expr "COUNT(*) PARTITION BY id" it will be transformed
+/// into an actual window expression as identified in the window node.
+pub(crate) fn unproject_window_exprs(expr: &Expr, window: &Window) -> Result<Expr> {


I tried to implement unproject_window_exprs before by using the same method in unproject_agg_exprs, that's finding the index_of_column but it's not correct. Could you tell me why there's a difference? 😊

The difference I found is that the Aggregate structure maintains all of the relevant Exprs that a Projection node may be referencing in its group by and aggr expr vecs. So, you can look up any column in the schema of the aggregate plan to find the correct expression to sub into the Projection node.

Window on the other hand only keeps the list of window functions whereas the schema has all of the fields. So its no longer possible to rely on the index of a column within the schema to find the correct expression.

Instead, I am using the auto generated column name for the expression to identify the correct expression. This is definitely a more brittle way to do it and could break if there are unexpected changes to how names are generated or propagated during planning.

It turns out another difference is there can be multiple Window nodes representing different WindowSpecs within the same select query. I had to make a few additional tweaks to handle this case.

alamb

This looks great -- thank you @devinjdangelo

alamb · 2024-06-03T18:59:46Z

datafusion/sql/tests/cases/plan_to_sql.rs

@@ -127,7 +127,10 @@ fn roundtrip_statement() -> Result<()> {
            UNION ALL
            SELECT j2_string as string FROM j2
            ORDER BY string DESC
-            LIMIT 10"#
+            LIMIT 10"#,
+            "SELECT id, count(*) over (PARTITION BY first_name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 


The tests look great now -- thanks @devinjdangelo

* unparse window plans * new tests + fixes * fmt

unparse window plans

db7e7de

github-actions bot added the sql SQL Planner label Jun 2, 2024

devinjdangelo commented Jun 2, 2024

View reviewed changes

devinjdangelo mentioned this pull request Jun 2, 2024

Suport unparsing LogicalPlan::Window to SQL #10664

Closed

alamb reviewed Jun 2, 2024

View reviewed changes

yyy1000 reviewed Jun 2, 2024

View reviewed changes

devinjdangelo added 2 commits June 3, 2024 07:22

new tests + fixes

220ac53

fmt

54f8c32

alamb approved these changes Jun 3, 2024

View reviewed changes

alamb merged commit e4f7b98 into apache:main Jun 3, 2024
23 checks passed

devinjdangelo mentioned this pull request Jun 4, 2024

minor: Refactor some unparser methods to improve readability #10788

Merged

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024

feat: support unparsing LogicalPlan::Window nodes (apache#10767)

cee965e

* unparse window plans * new tests + fixes * fmt

sgrebnov mentioned this pull request Oct 1, 2024

Support unparsing plans with both Aggregation and Window functions #12705

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support unparsing LogicalPlan::Window nodes #10767

feat: support unparsing LogicalPlan::Window nodes #10767

devinjdangelo commented Jun 2, 2024

devinjdangelo Jun 2, 2024

devinjdangelo Jun 2, 2024

devinjdangelo Jun 2, 2024

devinjdangelo Jun 2, 2024

alamb Jun 2, 2024

devinjdangelo Jun 3, 2024 •

edited

Loading

alamb Jun 3, 2024

devinjdangelo Jun 2, 2024

alamb Jun 2, 2024

alamb left a comment

alamb Jun 2, 2024

alamb Jun 2, 2024

yyy1000 Jun 2, 2024

devinjdangelo Jun 2, 2024

devinjdangelo Jun 3, 2024

alamb left a comment

alamb Jun 3, 2024

feat: support unparsing LogicalPlan::Window nodes #10767

feat: support unparsing LogicalPlan::Window nodes #10767

Conversation

devinjdangelo commented Jun 2, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinjdangelo Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinjdangelo Jun 3, 2024 •

edited

Loading