Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with type coercion with CREATE TABLE AS SELECT ... inserting VALUES #13124

Open
Tracked by #14008 ...
alamb opened this issue Oct 26, 2024 · 6 comments
Open
Tracked by #14008 ...
Labels
bug Something isn't working regression Something that used to work no longer does

Comments

@alamb
Copy link
Contributor

alamb commented Oct 26, 2024

Describe the bug

This was reported by @findepi on #12864 in #12864 (comment). I am pulling it into its own ticket so we can track it / find it more easily

I think as long as there is no regression we should move on and file the ticket for the remaining issue.

Sorry for not following earlier, was on a full-day event yesterday.

There are regressions.
I kind of felt it's obvious from the way it works, sorry for not providing good examples earlier.

Before the change

> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('+123')) t(a); SELECT * FROM t;
0 row(s) fetched.
Elapsed 0.005 seconds.

+---+
| a |
+---+
| 4 |
+---+
> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('abcd')) t(a); SELECT * FROM t;
0 row(s) fetched.
Elapsed 0.078 seconds.

+---+
| a |
+---+
| 4 |
+---+

on current main

Wrong result:

> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('+123')) t(a); SELECT * FROM t;
0 row(s) fetched.
Elapsed 0.071 seconds.

+---+
| a |
+---+
| 3 |

failure

> CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('abcd')) t(a); SELECT * FROM t;
Arrow error: Cast error: Cannot cast string 'abcd' to value of Int32 type

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@alamb alamb added the bug Something isn't working label Oct 26, 2024
@alamb
Copy link
Contributor Author

alamb commented Oct 26, 2024

@jayzhan211 notes #12864 (comment)

It seems the query is accidentally correct in before this change, because we don't know the result of the function when we build up Values plan. After this change, incorrect values ('abcd') is cast to column a instead of the result of the function.

The ideally solution is to find the result type of the function and check whether it matches the column type.

btw, I wonder is this query valid in postgres or elsewhere?

Valid query in postgres CREATE TABLE t AS SELECT length(a)::int AS a FROM (VALUES ('+123')) t(a);

Invalid query in postgres CREATE TABLE t (a int) AS SELECT length(a)::int AS a FROM (VALUES ('+123')) t(a);

ERROR:  syntax error at or near "AS"
LINE 1: CREATE TABLE t(a int) AS SELECT length(a)::int AS a FROM (VA...

It seems there is no way to insert value together with table if the column type is defined 🤔

@alamb alamb changed the title (POTENTIAL) Regression (POTENTIAL) Regression with type coercsion with CREATE TABLE AS SELECT ... inserting VALUES Oct 26, 2024
@alamb alamb added the regression Something that used to work no longer does label Oct 26, 2024
@alamb alamb changed the title (POTENTIAL) Regression with type coercsion with CREATE TABLE AS SELECT ... inserting VALUES Error with type coercsion with CREATE TABLE AS SELECT ... inserting VALUES Oct 26, 2024
@alamb alamb changed the title Error with type coercsion with CREATE TABLE AS SELECT ... inserting VALUES Error with type coercion with CREATE TABLE AS SELECT ... inserting VALUES Oct 26, 2024
@jayzhan211
Copy link
Contributor

Should we consider fixing this? I’m leaning towards not making changes, primarily to stay consistent with PostgreSQL’s syntax. Given the lack of test coverage, it seems this syntax might not be intentionally supported.

@alamb
Copy link
Contributor Author

alamb commented Oct 26, 2024

Should we consider fixing this? I’m leaning towards not making changes, primarily to stay consistent with PostgreSQL’s syntax. Given the lack of test coverage, it seems this syntax might not be intentionally supported.

I personally think it would nice to fix but there is no reason you need to do it personally or we need to block the 43 release for it. My rationale is:

  1. As you say this isn't something postgres supports
  2. there is a workaround for anyone relying on the old behavior -- which is to use another temporary relation

Here is the workaround

This fails

CREATE OR REPLACE TABLE t(a int) AS SELECT length(a) FROM (VALUES ('abcd')) t(a); 
SELECT * FROM t;

You can rewrite it to:

> CREATE OR REPLACE TABLE temp as  VALUES ('abcd');
0 row(s) fetched.
Elapsed 0.006 seconds.

> CREATE OR REPLACE TABLE t(a int) AS SELECT length(column1) from temp;
0 row(s) fetched.
Elapsed 0.008 seconds.

@findepi
Copy link
Member

findepi commented Oct 26, 2024

Just curious -- if we knew about this prior to merging #12864, would we still merge it?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Oct 27, 2024

I’d prefer not to block on this failing query, as it isn’t intended for support in DataFusion, lacks tests, has no related issue, and doesn’t align with expected PostgreSQL or DuckDB behavior.

The query worked previously only because we weren’t validating values against the schema—its success was coincidental, not by design.

What’s your view on handling this kind of query? Should we avoid merging it, and if so, why?

If the change breaks a previous PR or reverses expected behavior in DataFusion, we should address it before merging. However, this isn’t a strict rule and can be discussed on a case-by-case basis.

@alamb
Copy link
Contributor Author

alamb commented Oct 28, 2024

Just curious -- if we knew about this prior to merging #12864, would we still merge it?

I think we would have had the same discussion we are having now -- is the benefit gained from #12864 worth the potential regression in a feature that maybe worked by accident. I also think it is reasonable to have different opinions on the cost/benefit analysis, for what it is worth

The conversation would have been different if there had been tests showing someone was relying on the old behavior

In such cases I normally default to going with the person contributing the change / putting in the effort to maintain it. If someone else was relying on the feature they'll tell us with a PR / issue in the future

I do wish we had infinite capacity to make the software perfect, but given we are constrained with resources I think we need to be pragmatic and align the contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression Something that used to work no longer does
Projects
None yet
Development

No branches or pull requests

3 participants