Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datafusion 42 does not raise plan error when UNION branches have different number of columns #13092

Closed
emanueledomingo opened this issue Oct 24, 2024 · 9 comments · Fixed by #13117
Assignees
Labels
bug Something isn't working

Comments

@emanueledomingo
Copy link

Describe the bug

Since datafusion 41, i got Datafusion Error when some SQL queries contain errors.

SELECT "Product" FROM "my_table" UNION ALL SELECT "Product", "Id" FROM "my_table"

raises

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[4], line 4
      1 query = """
      2 SELECT "Product" FROM "my_table" UNION ALL SELECT "Product", "Id" FROM "my_table"
      3 """
----> 4 df = ctx.sql(query)

File ~/.../site-packages/datafusion/context.py:514, in SessionContext.sql(self, query, options)
    499 """Create a :py:class:`~datafusion.DataFrame` from SQL query text.
    500 
    501 Note: This API implements DDL statements such as ``CREATE TABLE`` and
   (...)
    511     DataFrame representation of the SQL query.
    512 """
    513 if options is None:
--> 514     return DataFrame(self.ctx.sql(query))
    515 return DataFrame(self.ctx.sql_with_options(query, options.options_internal))

Exception: DataFusion error: Plan("Union queries must have the same number of columns, (left is 1, right is 2)")

With new datafusion version, this doesn't happens anymore (it crashes later, when i run .collect() after the .sql() result)

To Reproduce

import pyarrow as pa
import datafusion as df
import pyarrow.dataset as pda

t = pa.Table.from_pydict(
    {
        "Id": [0, 0, 0, 0, 0, 1, 1, 1, 1],
        "Product": ["A", "A", "A", "B", "C", "B", "C", "B", "B"],
    },
)

ctx = df.SessionContext()
ctx.register_dataset(name="my_table", dataset=pda.dataset(t))

query = """
SELECT "Product" FROM "my_table" UNION ALL SELECT "Product", "Id" FROM "my_table"
"""
df = ctx.sql(query)  # with DF 41 this riases an error, with 42 doesn't

Expected behavior

DataFusion error exception correctly raised

Additional context

No response

@emanueledomingo emanueledomingo added the bug Something isn't working label Oct 24, 2024
@Omega359
Copy link
Contributor

I think this is related to #11961

@alamb
Copy link
Contributor

alamb commented Oct 24, 2024

I think SQL requires the inputs to have the same number of columns

Here is an example in postgres

postgres=# create table foo (x int, y int) ;
CREATE TABLE
postgres=# insert into foo values (1, 1), (1, 2);
INSERT 0 2
postgres=# select x from foo UNION ALL select y, y FROM foo;
ERROR:  each UNION query must have the same number of columns
LINE 1: select x from foo UNION ALL select y, y FROM foo;
                                           ^
postgres=# select x FROM foo UNION ALL select x, y FROM foo;

Perhaps you can get equivalent behavior in DataFusion 42 using a null constant

SELECT "Product" FROM "my_table" UNION ALL SELECT "Product", "Id" FROM "my_table"

to

SELECT "Product", NULL FROM "my_table" UNION ALL SELECT "Product", "Id" FROM "my_table"

@Omega359
Copy link
Contributor

I think the author's concern is not that the sql is invalid, it's that it's not being detected as invalid as early as it used to be. Is that a correct summation @emanueledomingo ?

@alamb
Copy link
Contributor

alamb commented Oct 24, 2024

Ah, sorry -- that is my mistake

@jonahgao
Copy link
Member

I think we can check if the number of columns is the same when building Unions, as it is a common scenario and does not depend on wildcard expansion and type coercion.

pub fn union(left_plan: LogicalPlan, right_plan: LogicalPlan) -> Result<LogicalPlan> {
    if left_plan.schema().fields().len() != right_plan.schema().fields().len() {
        return plan_err!(
            "UNION queries have different number of columns: \
            left has {} columns whereas right has {} columns",
            left_plan.schema().fields().len(),
            right_plan.schema().fields().len()
        );
    }

@emanueledomingo
Copy link
Author

I think the author's concern is not that the sql is invalid, it's that it's not being detected as invalid as early as it used to be. Is that a correct summation @emanueledomingo ?

Yes exactly. I realised it because i have a set of unit tests to test a function that handles errors and one of them failed. Not a real bug to me, i just want to report that this behaviour changed!

@Omega359
Copy link
Contributor

take

@findepi
Copy link
Member

findepi commented Oct 26, 2024

#11961 is probably related, at least "Union queries must have the same number of columns" error changed there.

@emanueledomingo if this is about UNIONs, can we change the issue title to "Datafusion 42 does not raise plan error when UNION branches have different number of columns"?, or is there more to it?

@emanueledomingo emanueledomingo changed the title Datafusion 42 does not raise plan error on some queries Datafusion 42 does not raise plan error when UNION branches have different number of columns Oct 28, 2024
@emanueledomingo
Copy link
Author

#11961 is probably related, at least "Union queries must have the same number of columns" error changed there.

@emanueledomingo if this is about UNIONs, can we change the issue title to "Datafusion 42 does not raise plan error when UNION branches have different number of columns"?, or is there more to it?

I experienced this bug only with this kind of query. i have 5 wrong queries in the unit test and only this gave this error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants