Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48966][SQL] Improve error message with invalid unresolved column reference in UDTF call #47447

Closed
wants to merge 3 commits into from

Conversation

dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Jul 22, 2024

What changes were proposed in this pull request?

This bug covers improving an error message in the event of invalid UDTF calls. For example:

select * from udtf(
  observed => TABLE(select column from t),
  value_col => classic_dollars
)

Currently we get:

Unsupported subquery expression: Table arguments are used in a function where they are not supported:
'UnresolvedTableValuedFunction [udtf], [observed => table-argument#68918 [], value_col => 'classic_dollars, false
   +- Project ...
       +- SubqueryAlias ...
          +- Relation ...

But the real error is that the user passed column identifier classic_dollars rather than string "classic_dollars" into the string argument.

The core reason is that CheckAnalysis checks the analyzer output tree for unresolved TABLE arguments first before checking for remaining unresolved attribute references.

To fix it, we update CheckAnalysis to move the check for remaining TABLE arguments to later. Now that query returns something like:

{
  "errorClass" : "UNRESOLVED_COLUMN.WITHOUT_SUGGESTION",
  "sqlState" : "42703",
  "messageParameters" : {
    "objectName" : "`classic_dollars`"
  },
  "queryContext" : [ {
    "objectType" : "",
    "objectName" : "",
    "startIndex" : 93,
    "stopIndex" : 109,
    "fragment" : "classic_dollars"
  } ]
}

Why are the changes needed?

This improves error messages for SQL queries that are invalid, but that a user might reasonably create accidentally while figuring out how the syntax works for calling table-valued functions.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This PR adds and updates golden file test coverage.

Was this patch authored or co-authored using generative AI tooling?

Just some casual, everyday GitHub CoPilot usage.

@github-actions github-actions bot added the SQL label Jul 22, 2024
@dtenedor
Copy link
Contributor Author

cc @allisonwang-db @ueshin


// Check for unresolved TABLE arguments after the main check above to allow other analysis
// errors to apply first, providing better error messages.
getAllExpressions(operator).foreach(_.foreachUp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check this together with getAllExpressions(operator).foreach(_.foreachUp { in line 290 so we don't need to iterate the plan again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because we need to check all the expressions for "unresolved attribute" or other errors first, before we do this check for FunctionTableSubqueryArgumentExpression instances.

That code on L290 starts with this:

        getAllExpressions(operator).foreach(_.foreachUp {
          case a: Attribute if !a.resolved =>
            failUnresolvedAttribute(operator, a, "UNRESOLVED_COLUMN")
        ...

We need to make sure we perform that "unresolved attribute" check for all expressions in the plan before we proceed further to check for FunctionTableSubqueryArgumentExpression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I get it now. This is a node that should have been rewritten during analysis. cc @cloud-fan do you know if there are better places to add this?

Comment on lines 1071 to +1072
case _: FunctionTableSubqueryArgumentExpression =>
expr.failAnalysis(
errorClass = "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_TABLE_ARGUMENT",
messageParameters = Map("treeNode" -> planToString(plan)))
// Do nothing here, since we will check for this pattern later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just remove this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally tried this, but found that we encountered a MatchError here in the presence of the FunctionTableSubqueryArgumentExpression. Therefore I had to add this "do nothing" case back here, to allow the new code on L395-403 to check for this pattern separately later.

Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @allisonwang-db thanks for your review! Responded to your comments, please take another look.

Comment on lines 1071 to +1072
case _: FunctionTableSubqueryArgumentExpression =>
expr.failAnalysis(
errorClass = "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_TABLE_ARGUMENT",
messageParameters = Map("treeNode" -> planToString(plan)))
// Do nothing here, since we will check for this pattern later.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally tried this, but found that we encountered a MatchError here in the presence of the FunctionTableSubqueryArgumentExpression. Therefore I had to add this "do nothing" case back here, to allow the new code on L395-403 to check for this pattern separately later.


// Check for unresolved TABLE arguments after the main check above to allow other analysis
// errors to apply first, providing better error messages.
getAllExpressions(operator).foreach(_.foreachUp {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because we need to check all the expressions for "unresolved attribute" or other errors first, before we do this check for FunctionTableSubqueryArgumentExpression instances.

That code on L290 starts with this:

        getAllExpressions(operator).foreach(_.foreachUp {
          case a: Attribute if !a.resolved =>
            failUnresolvedAttribute(operator, a, "UNRESOLVED_COLUMN")
        ...

We need to make sure we perform that "unresolved attribute" check for all expressions in the plan before we proceed further to check for FunctionTableSubqueryArgumentExpression.

@HyukjinKwon
Copy link
Member

Merged to master.

IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
…mn reference in UDTF call

### What changes were proposed in this pull request?

This bug covers improving an error message in the event of invalid UDTF calls. For example:

```
select * from udtf(
  observed => TABLE(select column from t),
  value_col => classic_dollars
)
```
Currently we get:

```
Unsupported subquery expression: Table arguments are used in a function where they are not supported:
'UnresolvedTableValuedFunction [udtf], [observed => table-argument#68918 [], value_col => 'classic_dollars, false
   +- Project ...
       +- SubqueryAlias ...
          +- Relation ...
```

But the real error is that the user passed column identifier classic_dollars rather than string "classic_dollars" into the string argument.

The core reason is that `CheckAnalysis` checks the analyzer output tree for unresolved TABLE arguments first before checking for remaining unresolved attribute references.

To fix it, we update `CheckAnalysis` to move the check for remaining `TABLE` arguments to later. Now that query returns something like:

```
{
  "errorClass" : "UNRESOLVED_COLUMN.WITHOUT_SUGGESTION",
  "sqlState" : "42703",
  "messageParameters" : {
    "objectName" : "`classic_dollars`"
  },
  "queryContext" : [ {
    "objectType" : "",
    "objectName" : "",
    "startIndex" : 93,
    "stopIndex" : 109,
    "fragment" : "classic_dollars"
  } ]
}
```

### Why are the changes needed?

This improves error messages for SQL queries that are invalid, but that a user might reasonably create accidentally while figuring out how the syntax works for calling table-valued functions.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This PR adds and updates golden file test coverage.

### Was this patch authored or co-authored using generative AI tooling?

Just some casual, everyday GitHub CoPilot usage.

Closes apache#47447 from dtenedor/improve-error-udtf.

Authored-by: Daniel Tenedorio <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…mn reference in UDTF call

### What changes were proposed in this pull request?

This bug covers improving an error message in the event of invalid UDTF calls. For example:

```
select * from udtf(
  observed => TABLE(select column from t),
  value_col => classic_dollars
)
```
Currently we get:

```
Unsupported subquery expression: Table arguments are used in a function where they are not supported:
'UnresolvedTableValuedFunction [udtf], [observed => table-argument#68918 [], value_col => 'classic_dollars, false
   +- Project ...
       +- SubqueryAlias ...
          +- Relation ...
```

But the real error is that the user passed column identifier classic_dollars rather than string "classic_dollars" into the string argument.

The core reason is that `CheckAnalysis` checks the analyzer output tree for unresolved TABLE arguments first before checking for remaining unresolved attribute references.

To fix it, we update `CheckAnalysis` to move the check for remaining `TABLE` arguments to later. Now that query returns something like:

```
{
  "errorClass" : "UNRESOLVED_COLUMN.WITHOUT_SUGGESTION",
  "sqlState" : "42703",
  "messageParameters" : {
    "objectName" : "`classic_dollars`"
  },
  "queryContext" : [ {
    "objectType" : "",
    "objectName" : "",
    "startIndex" : 93,
    "stopIndex" : 109,
    "fragment" : "classic_dollars"
  } ]
}
```

### Why are the changes needed?

This improves error messages for SQL queries that are invalid, but that a user might reasonably create accidentally while figuring out how the syntax works for calling table-valued functions.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This PR adds and updates golden file test coverage.

### Was this patch authored or co-authored using generative AI tooling?

Just some casual, everyday GitHub CoPilot usage.

Closes apache#47447 from dtenedor/improve-error-udtf.

Authored-by: Daniel Tenedorio <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…mn reference in UDTF call

### What changes were proposed in this pull request?

This bug covers improving an error message in the event of invalid UDTF calls. For example:

```
select * from udtf(
  observed => TABLE(select column from t),
  value_col => classic_dollars
)
```
Currently we get:

```
Unsupported subquery expression: Table arguments are used in a function where they are not supported:
'UnresolvedTableValuedFunction [udtf], [observed => table-argument#68918 [], value_col => 'classic_dollars, false
   +- Project ...
       +- SubqueryAlias ...
          +- Relation ...
```

But the real error is that the user passed column identifier classic_dollars rather than string "classic_dollars" into the string argument.

The core reason is that `CheckAnalysis` checks the analyzer output tree for unresolved TABLE arguments first before checking for remaining unresolved attribute references.

To fix it, we update `CheckAnalysis` to move the check for remaining `TABLE` arguments to later. Now that query returns something like:

```
{
  "errorClass" : "UNRESOLVED_COLUMN.WITHOUT_SUGGESTION",
  "sqlState" : "42703",
  "messageParameters" : {
    "objectName" : "`classic_dollars`"
  },
  "queryContext" : [ {
    "objectType" : "",
    "objectName" : "",
    "startIndex" : 93,
    "stopIndex" : 109,
    "fragment" : "classic_dollars"
  } ]
}
```

### Why are the changes needed?

This improves error messages for SQL queries that are invalid, but that a user might reasonably create accidentally while figuring out how the syntax works for calling table-valued functions.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This PR adds and updates golden file test coverage.

### Was this patch authored or co-authored using generative AI tooling?

Just some casual, everyday GitHub CoPilot usage.

Closes apache#47447 from dtenedor/improve-error-udtf.

Authored-by: Daniel Tenedorio <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants