Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33812][SQL] Split the histogram column stats when saving to hive metastore as table property #30809

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Hive metastore has a limitation for the table property length. To work around it, Spark split the schema json string into several parts when saving to hive metastore as table properties. We need to do the same for histogram column stats as it can go very big.

This PR refactors the table property splitting code, so that we can share it between the schema json string and histogram column stats.

Why are the changes needed?

To be able to analyze table when histogram data is big.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing test and new tests

@cloud-fan
Copy link
Contributor Author

cc @wzhfy @wangyum

@SparkQA
Copy link

SparkQA commented Dec 16, 2020

Test build #132898 has finished for PR 30809 at commit ed0dee1.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Test build #132930 has finished for PR 30809 at commit e161298.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -3075,7 +3085,9 @@ object SQLConf {
RemovedConfig("spark.sql.optimizer.planChangeLog.rules", "3.1.0", "",
s"Please use `${PLAN_CHANGE_LOG_RULES.key}` instead."),
RemovedConfig("spark.sql.optimizer.planChangeLog.batches", "3.1.0", "",
s"Please use `${PLAN_CHANGE_LOG_BATCHES.key}` instead.")
s"Please use `${PLAN_CHANGE_LOG_BATCHES.key}` instead."),
RemovedConfig("spark.sql.sources.schemaStringLengthThreshold", "3.2.0", "4000",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the old config, because people are unlikely to set it (4000 is the actual hive limitation), and we can't fallback to static config in dynamic config (I tried and there are object initialization issues).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case when users set spark.sql.hive.tablePropertyLengthThreshold to make it working then? Looks like we don't have to expose a configuration at all if it's still unlikely for people to set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. maybe users have a hive-compatible metastore which doesn't have this limitation or have a different limitation (Glue).

It's also useful for testing :)

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37547/

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37550/

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37550/

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37547/

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Test build #132947 has finished for PR 30809 at commit 4e8e984.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Test build #132944 has finished for PR 30809 at commit 0722f60.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37558/

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37558/

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Test build #132955 has finished for PR 30809 at commit ef4358d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #132996 has finished for PR 30809 at commit 7ea6d8b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37619/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37619/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #133020 has finished for PR 30809 at commit b1c9e78.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants