Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in 22.0.0 with filter push-down #5949

Closed
andygrove opened this issue Apr 10, 2023 · 7 comments
Closed

Regression in 22.0.0 with filter push-down #5949

andygrove opened this issue Apr 10, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Member

Describe the bug

The following code is from a unit test in arrow-datafusion-python. It works correctly with DataFusion 21.1.0.

    df = df.select(
        column("a") + column("b"),
        column("a") - column("b"),
    ).filter(column("a") > literal(2))

    # execute and collect the first (and only) batch
    result = df.collect()[0]

When I upgrade to DF 22.0.0, it fails with:

Exception: Schema error: No field named ca29d730badd94c3d96481c7edf6255b0.a. Valid fields are "ca29d730badd94c3d96481c7edf6255b0.a + ca29d730badd94c3d96481c7edf6255b0.b", "ca29d730badd94c3d96481c7edf6255b0.a - ca29d730badd94c3d96481c7edf6255b0.b".

To Reproduce

See apache/datafusion-python#320

Expected behavior

No response

Additional context

No response

@andygrove
Copy link
Member Author

The issue seems to have been introduced in #5831. @jackwener do you know why this PR would have introduced this change in behavior?

@jackwener
Copy link
Member

jackwener commented Apr 11, 2023

It seems that this plan is unreasonable.

select(
        column("a") + column("b"),
        column("a") - column("b"),
    )

will get two col (a + b) (a - b)

filter(column("a") > literal(2))  can't find col(a)

why this PR would have introduced this change in behavior?

In my opinion, origin type coercion is in optimizer and optimizer can skip failed rule. So the problem is covered up.
After move it into Analyzer which don't skip failed rule, so this problem expose.

how do you think about it, cc @alamb @mingmwang @liukun4515

@jiangzhx
Copy link
Contributor

jiangzhx commented Apr 11, 2023

hi, @jackwener
about pr #5686

before rebase with main ,it's work fine,after rebase,got error.

Error: SchemaError(FieldNotFound { field: Column { relation: None, name: "COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING" }, valid_fields: [Column { relation: Some(Bare { table: "alltypes_tiny_pages" }), name: "id" }] })

before PushDownProjection::

Projection: COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING [COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING:Int64;N]
  WindowAggr: windowExpr=[[COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 475368975085586025561263702016 PRECEDING AND 158456325028528675187087900672 FOLLOWING AS COUNT(*) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING]] [id:Int32;N, bool_col:Boolean;N, tinyint_col:Int8;N, smallint_col:Int16;N, int_col:Int32;N, bigint_col:Int64;N, float_col:Float32;N, double_col:Float64;N, date_string_col:Utf8;N, string_col:Utf8;N, timestamp_col:Timestamp(Nanosecond, None);N, year:Int32;N, month:Int32;N, COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING:Int64;N]
    TableScan: alltypes_tiny_pages [id:Int32;N, bool_col:Boolean;N, tinyint_col:Int8;N, smallint_col:Int16;N, int_col:Int32;N, bigint_col:Int64;N, float_col:Float32;N, double_col:Float64;N, date_string_col:Utf8;N, string_col:Utf8;N, timestamp_col:Timestamp(Nanosecond, None);N, year:Int32;N, month:Int32;N]

after PushDownProjection::

Projection: COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING [COUNT(UInt8(1)) ORDER BY [alltypes_tiny_pages.timestamp_col DESC NULLS FIRST] RANGE BETWEEN 6 PRECEDING AND 2 FOLLOWING:Int64;N]
  TableScan: alltypes_tiny_pages [id:Int32;N, bool_col:Boolean;N, tinyint_col:Int8;N, smallint_col:Int16;N, int_col:Int32;N, bigint_col:Int64;N, float_col:Float32;N, double_col:Float64;N, date_string_col:Utf8;N, string_col:Utf8;N, timestamp_col:Timestamp(Nanosecond, None);N, year:Int32;N, month:Int32;N]

To Reproduce

gh pr checkout 5686

cargo test --color=always --test dataframe test_count_wildcard_on_window

@jackwener
Copy link
Member

jackwener commented Apr 11, 2023

It indeed is covered by skip failed rule

| initial_logical_plan                                       | Filter: t.a > Int32(1)                                                      |
|                                                            |   Projection: t.a - t.b                                                     |
|                                                            |     TableScan: t                                                            |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                                                          |
| logical_plan after replace_distinct_aggregate              | SAME TEXT AS ABOVE                                                          |
| logical_plan after decorrelate_where_exists                | SAME TEXT AS ABOVE                                                          |
| logical_plan after decorrelate_where_in                    | SAME TEXT AS ABOVE                                                          |
| logical_plan after scalar_subquery_to_join                 | SAME TEXT AS ABOVE                                                          |
| logical_plan after extract_equijoin_predicate              | SAME TEXT AS ABOVE                                                          |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                                                          |
| logical_plan after merge_projection                        | SAME TEXT AS ABOVE                                                          |
| logical_plan after rewrite_disjunctive_predicate           | SAME TEXT AS ABOVE                                                          |
| logical_plan after eliminate_duplicated_expr               | SAME TEXT AS ABOVE                                                          |
| logical_plan after eliminate_filter                        | SAME TEXT AS ABOVE                                                          |
| logical_plan after eliminate_cross_join                    | SAME TEXT AS ABOVE                                                          |
| logical_plan after eliminate_limit                         | SAME TEXT AS ABOVE                                                          |
| logical_plan after propagate_empty_relation                | SAME TEXT AS ABOVE                                                          |
| logical_plan after filter_null_join_keys                   | SAME TEXT AS ABOVE                                                          |
| logical_plan after eliminate_outer_join                    | SAME TEXT AS ABOVE                                                          |
| logical_plan after push_down_limit                         | SAME TEXT AS ABOVE                                                          |
| logical_plan after push_down_filter                        | Projection: t.a - t.b                                                       |
|                                                            |   Filter: t.a > Int32(1)                                                    |
|                                                            |     TableScan: t      

After I investigate it, I find the cause of this problem.

This plan is wrong, but because of we don't have analyzer, so the wrong plan was cover by push_down_filter.

After push_down_filter, the wrong plan will be right plan.

@jackwener
Copy link
Member

So, I think it isn't a regression.

But, If we want to support it, I think we can add a analyzer rule to add missing column. I'm not sure if we need to do this.

@Dandandan
Copy link
Contributor

I agree, the program as shown in the description should fail

@andygrove
Copy link
Member Author

Thanks for looking into this. I updated the tests to perform the filter before the projection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants