-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-51208][SQL] ColumnDefinition.toV1Column
should preserve EXISTS_DEFAULT
resolution
#49942
Conversation
…_DEFAULT resolution
ColumnDefinition.toV1Column
should preserve EXISTS_DEFAULT
resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (Pending CIs). Thank you, @szehon-ho .
Thank you @dongjoon-hyun ! |
…STS_DEFAULT` resolution ### What changes were proposed in this pull request? Fix ColumnDefintiion.toV1Column to preserve EXISTS_DEFAULT if it is already on the original column ### Why are the changes needed? Some external catalogs use toV1Column and fromV1Column. However, we noticed that toV1Column will un-resolve some already-resolved EXISTS default values. Long Description: When user creates a column with default value, Spark will resolve the user-provided CURRENT_DEFAULT sql, to make an EXISTS_DEFAULT value. That is done to save the values at that time, for example to resolve current_database(), current_user() functions. Existing data will get those values. The typical workflow is: ``` val existsSQL = ResolveDefaultColumns.analyze(col, type, defaultSQL) // save both existsSQL and defaultSQL to column metadata ``` But this code sets existsDefault metadata back to the original defaultSQL. Consuming code will get corrupted un-resolved expressions in EXISTS_DEFAULT, like current_user(). This means that later when these columns are read, these may need to be resolved again, potentially leading to data that changes value. While Spark code base currently does not have such catalogs, there are some extneral catalogs out there that are hitting this issue. ### Does this PR introduce _any_ user-facing change? Behavior should be fixed for consumers of the API ### How was this patch tested? Add unit test Closes #49942 from szehon-ho/SPARK-51208. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit e397207) Signed-off-by: Dongjoon Hyun <[email protected]>
Merged to master/4.0. |
@@ -75,7 +75,7 @@ case class ColumnDefinition( | |||
// For v1 CREATE TABLE command, we will resolve and execute the default value expression later | |||
// in the rule `DataSourceAnalysis`. We just need to put the default value SQL string here. | |||
metadataBuilder.putString(CURRENT_DEFAULT_COLUMN_METADATA_KEY, default.originalSQL) | |||
metadataBuilder.putString(EXISTS_DEFAULT_COLUMN_METADATA_KEY, default.originalSQL) | |||
metadataBuilder.putString(EXISTS_DEFAULT_COLUMN_METADATA_KEY, default.child.sql) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that, Expression#sql
is for display only, and it's not guaranteed to be able to parse back. I think we should only use child.sql
here if child
is a literal, which means this ColumnDefinition
has been resolved and constant-folded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, to check my understanding, in all known cases this should be resolved and folded (because it was already output of ResolveColumnDefaults.analyze). But I guess for safety, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we convert v2 CREATE TABLE command to v1, ColumnDefinition
is only resolved but not constant-folded (still in the analyzer phase).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I will make a follow up, I think there's many cases, but it doesnt hurt to only do it if the value is already resolved + folded .
test("SPARK-51208: ColumnDefinition.toV1Column should preserve EXISTS_DEFAULT resolution") { | ||
|
||
def validateConvertedDefaults( | ||
colName: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: 4 spaces indentation
With this change, I think we can simplify |
#49947 is follow up |
…erve EXISTS_DEFAULT resolution Small follow up for: #49942 ### What changes were proposed in this pull request? Restrict the logic of #49942 to only resolved and folded existsDefaults. ### Why are the changes needed? Previous part of PR may have unexpected behavior if expression is not resolved/folded. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #49947 from szehon-ho/SPARK-51119-follow. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…erve EXISTS_DEFAULT resolution Small follow up for: #49942 ### What changes were proposed in this pull request? Restrict the logic of #49942 to only resolved and folded existsDefaults. ### Why are the changes needed? Previous part of PR may have unexpected behavior if expression is not resolved/folded. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #49947 from szehon-ho/SPARK-51119-follow. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 83ee2a0) Signed-off-by: Wenchen Fan <[email protected]>
…il existenceDefaultValues ### What changes were proposed in this pull request? The original change in #49840 was too optimistic and assumes that all column EXISTS_DEFAULT are already resolved and column folded. However, if there is bad EXISTS_DEFAULT metadata (an unresolved expression is persisted) it will break. Add fallback to use the original logic in that case. ### Why are the changes needed? There are some cases where bad EXISTS_DEFAULT metadata is persisted by external catalogs, due to some bugs such as #49942 or other problems. ### Does this PR introduce _any_ user-facing change? No, it should handle bad metadata better. ### How was this patch tested? Add unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #49962 from szehon-ho/SPARK-51119-follow-2. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…il existenceDefaultValues ### What changes were proposed in this pull request? The original change in #49840 was too optimistic and assumes that all column EXISTS_DEFAULT are already resolved and column folded. However, if there is bad EXISTS_DEFAULT metadata (an unresolved expression is persisted) it will break. Add fallback to use the original logic in that case. ### Why are the changes needed? There are some cases where bad EXISTS_DEFAULT metadata is persisted by external catalogs, due to some bugs such as #49942 or other problems. ### Does this PR introduce _any_ user-facing change? No, it should handle bad metadata better. ### How was this patch tested? Add unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #49962 from szehon-ho/SPARK-51119-follow-2. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4ffc398) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Fix ColumnDefintiion.toV1Column to preserve EXISTS_DEFAULT if it is already on the original column
Why are the changes needed?
Some external catalogs use toV1Column and fromV1Column. However, we noticed that toV1Column will un-resolve some already-resolved EXISTS default values.
Long Description: When user creates a column with default value, Spark will resolve the user-provided CURRENT_DEFAULT sql, to make an EXISTS_DEFAULT value. That is done to save the values at that time, for example to resolve current_database(), current_user() functions. Existing data will get those values.
The typical workflow is:
But this code sets existsDefault metadata back to the original defaultSQL.
Consuming code will get corrupted un-resolved expressions in EXISTS_DEFAULT, like current_user(). This means that later when these columns are read, these may need to be resolved again, potentially leading to data that changes value.
While Spark code base currently does not have such catalogs, there are some extneral catalogs out there that are hitting this issue.
Does this PR introduce any user-facing change?
Behavior should be fixed for consumers of the API
How was this patch tested?
Add unit test