-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33812][SQL] Split the histogram column stats when saving to hive metastore as table property #30809
Conversation
Test build #132898 has finished for PR 30809 at commit
|
Test build #132930 has finished for PR 30809 at commit
|
@@ -3075,7 +3085,9 @@ object SQLConf { | |||
RemovedConfig("spark.sql.optimizer.planChangeLog.rules", "3.1.0", "", | |||
s"Please use `${PLAN_CHANGE_LOG_RULES.key}` instead."), | |||
RemovedConfig("spark.sql.optimizer.planChangeLog.batches", "3.1.0", "", | |||
s"Please use `${PLAN_CHANGE_LOG_BATCHES.key}` instead.") | |||
s"Please use `${PLAN_CHANGE_LOG_BATCHES.key}` instead."), | |||
RemovedConfig("spark.sql.sources.schemaStringLengthThreshold", "3.2.0", "4000", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the old config, because people are unlikely to set it (4000 is the actual hive limitation), and we can't fallback to static config in dynamic config (I tried and there are object initialization issues).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any case when users set spark.sql.hive.tablePropertyLengthThreshold
to make it working then? Looks like we don't have to expose a configuration at all if it's still unlikely for people to set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. maybe users have a hive-compatible metastore which doesn't have this limitation or have a different limitation (Glue).
It's also useful for testing :)
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status success |
Test build #132947 has finished for PR 30809 at commit
|
Test build #132944 has finished for PR 30809 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #132955 has finished for PR 30809 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test build #132996 has finished for PR 30809 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #133020 has finished for PR 30809 at commit
|
Merged to master. |
What changes were proposed in this pull request?
Hive metastore has a limitation for the table property length. To work around it, Spark split the schema json string into several parts when saving to hive metastore as table properties. We need to do the same for histogram column stats as it can go very big.
This PR refactors the table property splitting code, so that we can share it between the schema json string and histogram column stats.
Why are the changes needed?
To be able to analyze table when histogram data is big.
Does this PR introduce any user-facing change?
no
How was this patch tested?
existing test and new tests