Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51208][SQL] ColumnDefinition.toV1Column should preserve EXISTS_DEFAULT resolution #49942

Closed
wants to merge 1 commit into from

Conversation

szehon-ho
Copy link
Contributor

@szehon-ho szehon-ho commented Feb 13, 2025

What changes were proposed in this pull request?

Fix ColumnDefintiion.toV1Column to preserve EXISTS_DEFAULT if it is already on the original column

Why are the changes needed?

Some external catalogs use toV1Column and fromV1Column. However, we noticed that toV1Column will un-resolve some already-resolved EXISTS default values.

Long Description: When user creates a column with default value, Spark will resolve the user-provided CURRENT_DEFAULT sql, to make an EXISTS_DEFAULT value. That is done to save the values at that time, for example to resolve current_database(), current_user() functions. Existing data will get those values.

The typical workflow is:

val existsSQL = ResolveDefaultColumns.analyze(col, type, defaultSQL)
// save both existsSQL and defaultSQL to column metadata

But this code sets existsDefault metadata back to the original defaultSQL.

Consuming code will get corrupted un-resolved expressions in EXISTS_DEFAULT, like current_user(). This means that later when these columns are read, these may need to be resolved again, potentially leading to data that changes value.

While Spark code base currently does not have such catalogs, there are some extneral catalogs out there that are hitting this issue.

Does this PR introduce any user-facing change?

Behavior should be fixed for consumers of the API

How was this patch tested?

Add unit test

@github-actions github-actions bot added the SQL label Feb 13, 2025
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-51208][SQL] ColumnDefinition.toV1Column should preserve EXISTS…_DEFAULT resolution [SPARK-51208][SQL] ColumnDefinition.toV1Column should preserve EXISTS_DEFAULT resolution Feb 13, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs). Thank you, @szehon-ho .

@szehon-ho
Copy link
Contributor Author

Thank you @dongjoon-hyun !

dongjoon-hyun pushed a commit that referenced this pull request Feb 13, 2025
…STS_DEFAULT` resolution

### What changes were proposed in this pull request?
Fix ColumnDefintiion.toV1Column to preserve EXISTS_DEFAULT if it is already on the original column

### Why are the changes needed?
Some external catalogs use toV1Column and fromV1Column.  However, we noticed that toV1Column will un-resolve some already-resolved EXISTS default values.

Long Description: When user creates a column with default value, Spark will resolve the user-provided CURRENT_DEFAULT sql, to make an EXISTS_DEFAULT value.  That is done to save the values at that time, for example to resolve current_database(), current_user() functions.  Existing data will get those values.

The typical workflow is:

```
val existsSQL = ResolveDefaultColumns.analyze(col, type, defaultSQL)
// save both existsSQL and defaultSQL to column metadata
 ```

But this code sets existsDefault metadata back to the original defaultSQL.

Consuming code will get corrupted un-resolved expressions in EXISTS_DEFAULT, like current_user().  This means that later when these columns are read, these may need to be resolved again, potentially leading to data that changes value.

While Spark code base currently does not have such catalogs, there are some extneral catalogs out there that are hitting this issue.

### Does this PR introduce _any_ user-facing change?
Behavior should be fixed for consumers of the API

### How was this patch tested?
Add unit test

Closes #49942 from szehon-ho/SPARK-51208.

Authored-by: Szehon Ho <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit e397207)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to master/4.0.

@@ -75,7 +75,7 @@ case class ColumnDefinition(
// For v1 CREATE TABLE command, we will resolve and execute the default value expression later
// in the rule `DataSourceAnalysis`. We just need to put the default value SQL string here.
metadataBuilder.putString(CURRENT_DEFAULT_COLUMN_METADATA_KEY, default.originalSQL)
metadataBuilder.putString(EXISTS_DEFAULT_COLUMN_METADATA_KEY, default.originalSQL)
metadataBuilder.putString(EXISTS_DEFAULT_COLUMN_METADATA_KEY, default.child.sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, Expression#sql is for display only, and it's not guaranteed to be able to parse back. I think we should only use child.sql here if child is a literal, which means this ColumnDefinition has been resolved and constant-folded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, to check my understanding, in all known cases this should be resolved and folded (because it was already output of ResolveColumnDefaults.analyze). But I guess for safety, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we convert v2 CREATE TABLE command to v1, ColumnDefinition is only resolved but not constant-folded (still in the analyzer phase).

Copy link
Contributor Author

@szehon-ho szehon-ho Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will make a follow up, I think there's many cases, but it doesnt hurt to only do it if the value is already resolved + folded .

test("SPARK-51208: ColumnDefinition.toV1Column should preserve EXISTS_DEFAULT resolution") {

def validateConvertedDefaults(
colName: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: 4 spaces indentation

@cloud-fan
Copy link
Contributor

With this change, I think we can simplify CatalogV2Utils.v2ColumnToStructField to ColumnDefinition.fromV2Column(col).toV1Column

@szehon-ho
Copy link
Contributor Author

#49947 is follow up

cloud-fan pushed a commit that referenced this pull request Feb 14, 2025
…erve EXISTS_DEFAULT resolution

Small follow up for: #49942

### What changes were proposed in this pull request?
Restrict the logic of #49942 to only resolved and folded existsDefaults.

### Why are the changes needed?
Previous part of PR may have unexpected behavior if expression is not resolved/folded.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #49947 from szehon-ho/SPARK-51119-follow.

Authored-by: Szehon Ho <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Feb 14, 2025
…erve EXISTS_DEFAULT resolution

Small follow up for: #49942

### What changes were proposed in this pull request?
Restrict the logic of #49942 to only resolved and folded existsDefaults.

### Why are the changes needed?
Previous part of PR may have unexpected behavior if expression is not resolved/folded.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #49947 from szehon-ho/SPARK-51119-follow.

Authored-by: Szehon Ho <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 83ee2a0)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Feb 21, 2025
…il existenceDefaultValues

### What changes were proposed in this pull request?
 The original change in #49840 was too optimistic and assumes that all column EXISTS_DEFAULT are already resolved and column folded.  However, if there is bad EXISTS_DEFAULT metadata (an unresolved expression is persisted) it will break.  Add fallback to use the original logic in that case.

### Why are the changes needed?
There are some cases where bad EXISTS_DEFAULT metadata is persisted by external catalogs, due to some bugs such as #49942 or other problems.

### Does this PR introduce _any_ user-facing change?
No, it should handle bad metadata better.

### How was this patch tested?
Add unit test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #49962 from szehon-ho/SPARK-51119-follow-2.

Authored-by: Szehon Ho <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Feb 21, 2025
…il existenceDefaultValues

### What changes were proposed in this pull request?
 The original change in #49840 was too optimistic and assumes that all column EXISTS_DEFAULT are already resolved and column folded.  However, if there is bad EXISTS_DEFAULT metadata (an unresolved expression is persisted) it will break.  Add fallback to use the original logic in that case.

### Why are the changes needed?
There are some cases where bad EXISTS_DEFAULT metadata is persisted by external catalogs, due to some bugs such as #49942 or other problems.

### Does this PR introduce _any_ user-facing change?
No, it should handle bad metadata better.

### How was this patch tested?
Add unit test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #49962 from szehon-ho/SPARK-51119-follow-2.

Authored-by: Szehon Ho <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 4ffc398)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants