[SPARK-33812][SQL] Split the histogram column stats when saving to hive metastore as table property #30809

cloud-fan · 2020-12-16T17:23:49Z

What changes were proposed in this pull request?

Hive metastore has a limitation for the table property length. To work around it, Spark split the schema json string into several parts when saving to hive metastore as table properties. We need to do the same for histogram column stats as it can go very big.

This PR refactors the table property splitting code, so that we can share it between the schema json string and histogram column stats.

Why are the changes needed?

To be able to analyze table when histogram data is big.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing test and new tests

cloud-fan · 2020-12-16T17:25:25Z

cc @wzhfy @wangyum

SparkQA · 2020-12-16T17:36:11Z

Test build #132898 has finished for PR 30809 at commit ed0dee1.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-17T05:52:01Z

Test build #132930 has finished for PR 30809 at commit e161298.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-12-17T09:15:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -3075,7 +3085,9 @@ object SQLConf {
      RemovedConfig("spark.sql.optimizer.planChangeLog.rules", "3.1.0", "",
        s"Please use `${PLAN_CHANGE_LOG_RULES.key}` instead."),
      RemovedConfig("spark.sql.optimizer.planChangeLog.batches", "3.1.0", "",
-        s"Please use `${PLAN_CHANGE_LOG_BATCHES.key}` instead.")
+        s"Please use `${PLAN_CHANGE_LOG_BATCHES.key}` instead."),
+      RemovedConfig("spark.sql.sources.schemaStringLengthThreshold", "3.2.0", "4000",


I removed the old config, because people are unlikely to set it (4000 is the actual hive limitation), and we can't fallback to static config in dynamic config (I tried and there are object initialization issues).

Is there any case when users set spark.sql.hive.tablePropertyLengthThreshold to make it working then? Looks like we don't have to expose a configuration at all if it's still unlikely for people to set.

e.g. maybe users have a hive-compatible metastore which doesn't have this limitation or have a different limitation (Glue).

It's also useful for testing :)

…e property

SparkQA · 2020-12-17T09:57:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37547/

SparkQA · 2020-12-17T10:05:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37550/

SparkQA · 2020-12-17T10:10:21Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37550/

SparkQA · 2020-12-17T10:29:18Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37547/

SparkQA · 2020-12-17T10:45:38Z

Test build #132947 has finished for PR 30809 at commit 4e8e984.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-17T11:30:24Z

Test build #132944 has finished for PR 30809 at commit 0722f60.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-17T13:26:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37558/

SparkQA · 2020-12-17T13:58:04Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37558/

SparkQA · 2020-12-17T16:52:38Z

Test build #132955 has finished for PR 30809 at commit ef4358d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM

SparkQA · 2020-12-18T08:12:49Z

Test build #132996 has finished for PR 30809 at commit 7ea6d8b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-12-18T13:51:09Z

retest this please

SparkQA · 2020-12-18T15:00:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37619/

SparkQA · 2020-12-18T15:30:44Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37619/

SparkQA · 2020-12-18T18:05:00Z

Test build #133020 has finished for PR 30809 at commit b1c9e78.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-12-19T05:35:16Z

Merged to master.

github-actions bot added the SQL label Dec 16, 2020

cloud-fan force-pushed the cbo branch from ed0dee1 to e161298 Compare December 17, 2020 05:02

cloud-fan force-pushed the cbo branch from e161298 to 0722f60 Compare December 17, 2020 09:09

cloud-fan commented Dec 17, 2020

View reviewed changes

splt the histogram column stats when saving to hive metastore as tabl…

4e8e984

…e property

cloud-fan force-pushed the cbo branch from 0722f60 to 4e8e984 Compare December 17, 2020 09:16

wangyum approved these changes Dec 17, 2020

View reviewed changes

cloud-fan force-pushed the cbo branch from ef4358d to 7ea6d8b Compare December 18, 2020 05:43

HyukjinKwon approved these changes Dec 18, 2020

View reviewed changes

fix test

b1c9e78

cloud-fan force-pushed the cbo branch from 7ea6d8b to b1c9e78 Compare December 18, 2020 13:52

HyukjinKwon closed this in de234ee Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33812][SQL] Split the histogram column stats when saving to hive metastore as table property #30809

[SPARK-33812][SQL] Split the histogram column stats when saving to hive metastore as table property #30809

cloud-fan commented Dec 16, 2020

cloud-fan commented Dec 16, 2020

SparkQA commented Dec 16, 2020

SparkQA commented Dec 17, 2020

cloud-fan Dec 17, 2020

HyukjinKwon Dec 18, 2020

cloud-fan Dec 18, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

HyukjinKwon left a comment

SparkQA commented Dec 18, 2020

cloud-fan commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

HyukjinKwon commented Dec 19, 2020

[SPARK-33812][SQL] Split the histogram column stats when saving to hive metastore as table property #30809

[SPARK-33812][SQL] Split the histogram column stats when saving to hive metastore as table property #30809

Conversation

cloud-fan commented Dec 16, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Dec 16, 2020

SparkQA commented Dec 16, 2020

SparkQA commented Dec 17, 2020

cloud-fan Dec 17, 2020

Choose a reason for hiding this comment

HyukjinKwon Dec 18, 2020

Choose a reason for hiding this comment

cloud-fan Dec 18, 2020

Choose a reason for hiding this comment

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

HyukjinKwon left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 18, 2020

cloud-fan commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

HyukjinKwon commented Dec 19, 2020