Regression in 22.0.0 with filter push-down #5949

andygrove · 2023-04-10T21:56:36Z

Describe the bug

The following code is from a unit test in arrow-datafusion-python. It works correctly with DataFusion 21.1.0.

    df = df.select(
        column("a") + column("b"),
        column("a") - column("b"),
    ).filter(column("a") > literal(2))

    # execute and collect the first (and only) batch
    result = df.collect()[0]

When I upgrade to DF 22.0.0, it fails with:

Exception: Schema error: No field named ca29d730badd94c3d96481c7edf6255b0.a. Valid fields are "ca29d730badd94c3d96481c7edf6255b0.a + ca29d730badd94c3d96481c7edf6255b0.b", "ca29d730badd94c3d96481c7edf6255b0.a - ca29d730badd94c3d96481c7edf6255b0.b".

To Reproduce

See apache/datafusion-python#320

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

andygrove · 2023-04-10T22:24:46Z

The issue seems to have been introduced in #5831. @jackwener do you know why this PR would have introduced this change in behavior?

jackwener · 2023-04-11T03:29:52Z

It seems that this plan is unreasonable.

select(
        column("a") + column("b"),
        column("a") - column("b"),
    )

will get two col (a + b) (a - b)

filter(column("a") > literal(2))  can't find col(a)

why this PR would have introduced this change in behavior?

In my opinion, origin type coercion is in optimizer and optimizer can skip failed rule. So the problem is covered up.
After move it into Analyzer which don't skip failed rule, so this problem expose.

how do you think about it, cc @alamb @mingmwang @liukun4515

jiangzhx · 2023-04-11T08:09:56Z

hi, @jackwener
about pr #5686

before rebase with main ,it's work fine,after rebase,got error.

Error: SchemaError(FieldNotFound { field: Column { relation: None, name: "COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING" }, valid_fields: [Column { relation: Some(Bare { table: "alltypes_tiny_pages" }), name: "id" }] })

before PushDownProjection::

Projection: COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING [COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING:Int64;N]
  WindowAggr: windowExpr=[[COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 475368975085586025561263702016 PRECEDING AND 158456325028528675187087900672 FOLLOWING AS COUNT(*) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING]] [id:Int32;N, bool_col:Boolean;N, tinyint_col:Int8;N, smallint_col:Int16;N, int_col:Int32;N, bigint_col:Int64;N, float_col:Float32;N, double_col:Float64;N, date_string_col:Utf8;N, string_col:Utf8;N, timestamp_col:Timestamp(Nanosecond, None);N, year:Int32;N, month:Int32;N, COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING:Int64;N]
    TableScan: alltypes_tiny_pages [id:Int32;N, bool_col:Boolean;N, tinyint_col:Int8;N, smallint_col:Int16;N, int_col:Int32;N, bigint_col:Int64;N, float_col:Float32;N, double_col:Float64;N, date_string_col:Utf8;N, string_col:Utf8;N, timestamp_col:Timestamp(Nanosecond, None);N, year:Int32;N, month:Int32;N]

after PushDownProjection::

Projection: COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING [COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING:Int64;N]
  TableScan: alltypes_tiny_pages [id:Int32;N, bool_col:Boolean;N, tinyint_col:Int8;N, smallint_col:Int16;N, int_col:Int32;N, bigint_col:Int64;N, float_col:Float32;N, double_col:Float64;N, date_string_col:Utf8;N, string_col:Utf8;N, timestamp_col:Timestamp(Nanosecond, None);N, year:Int32;N, month:Int32;N]

To Reproduce

gh pr checkout 5686

cargo test --color=always --test dataframe test_count_wildcard_on_window

jackwener · 2023-04-11T09:14:35Z

It indeed is covered by skip failed rule

| initial_logical_plan                                       | Filter: t.a > Int32(1)                                                      |
|                                                            |   Projection: t.a - t.b                                                     |
|                                                            |     TableScan: t                                                            |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                                                          |
| logical_plan after replace_distinct_aggregate              | SAME TEXT AS ABOVE                                                          |
| logical_plan after decorrelate_where_exists                | SAME TEXT AS ABOVE                                                          |
| logical_plan after decorrelate_where_in                    | SAME TEXT AS ABOVE                                                          |
| logical_plan after scalar_subquery_to_join                 | SAME TEXT AS ABOVE                                                          |
| logical_plan after extract_equijoin_predicate              | SAME TEXT AS ABOVE                                                          |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                                                          |
| logical_plan after merge_projection                        | SAME TEXT AS ABOVE                                                          |
| logical_plan after rewrite_disjunctive_predicate           | SAME TEXT AS ABOVE                                                          |
| logical_plan after eliminate_duplicated_expr               | SAME TEXT AS ABOVE                                                          |
| logical_plan after eliminate_filter                        | SAME TEXT AS ABOVE                                                          |
| logical_plan after eliminate_cross_join                    | SAME TEXT AS ABOVE                                                          |
| logical_plan after eliminate_limit                         | SAME TEXT AS ABOVE                                                          |
| logical_plan after propagate_empty_relation                | SAME TEXT AS ABOVE                                                          |
| logical_plan after filter_null_join_keys                   | SAME TEXT AS ABOVE                                                          |
| logical_plan after eliminate_outer_join                    | SAME TEXT AS ABOVE                                                          |
| logical_plan after push_down_limit                         | SAME TEXT AS ABOVE                                                          |
| logical_plan after push_down_filter                        | Projection: t.a - t.b                                                       |
|                                                            |   Filter: t.a > Int32(1)                                                    |
|                                                            |     TableScan: t

After I investigate it, I find the cause of this problem.

This plan is wrong, but because of we don't have analyzer, so the wrong plan was cover by push_down_filter.

After push_down_filter, the wrong plan will be right plan.

jackwener · 2023-04-11T09:24:12Z

So, I think it isn't a regression.

But, If we want to support it, I think we can add a analyzer rule to add missing column. I'm not sure if we need to do this.

Dandandan · 2023-04-11T12:04:35Z

I agree, the program as shown in the description should fail

andygrove · 2023-04-11T18:08:49Z

Thanks for looking into this. I updated the tests to perform the filter before the projection.

andygrove added the bug Something isn't working label Apr 10, 2023

andygrove mentioned this issue Apr 10, 2023

Prepare for 22.0.0 release apache/datafusion-python#320

Merged

andygrove closed this as completed Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in 22.0.0 with filter push-down #5949

Regression in 22.0.0 with filter push-down #5949

andygrove commented Apr 10, 2023

andygrove commented Apr 10, 2023

jackwener commented Apr 11, 2023 •

edited

Loading

jiangzhx commented Apr 11, 2023 •

edited

Loading

jackwener commented Apr 11, 2023 •

edited

Loading

jackwener commented Apr 11, 2023

Dandandan commented Apr 11, 2023

andygrove commented Apr 11, 2023

Regression in 22.0.0 with filter push-down #5949

Regression in 22.0.0 with filter push-down #5949

Comments

andygrove commented Apr 10, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

andygrove commented Apr 10, 2023

jackwener commented Apr 11, 2023 • edited Loading

jiangzhx commented Apr 11, 2023 • edited Loading

jackwener commented Apr 11, 2023 • edited Loading

jackwener commented Apr 11, 2023

Dandandan commented Apr 11, 2023

andygrove commented Apr 11, 2023

jackwener commented Apr 11, 2023 •

edited

Loading

jiangzhx commented Apr 11, 2023 •

edited

Loading

jackwener commented Apr 11, 2023 •

edited

Loading