[SPARK-2706][SQL] Enable Spark to support Hive 0.13 #2241

zhzhan · 2014-09-02T23:31:33Z

Given that a lot of users are trying to use hive 0.13 in spark, and the incompatibility between hive-0.12 and hive-0.13 on the API level I want to propose following approach, which has no or minimum impact on existing hive-0.12 support, but be able to jumpstart the development of hive-0.13 and future version support.

Approach: Introduce “hive-version” property, and manipulate pom.xml files to support different hive version at compiling time through shim layer, e.g., hive-0.12.0 and hive-0.13.1. More specifically,

For each different hive version, there is a very light layer of shim code to handle API differences, sitting in sql/hive/hive-version, e.g., sql/hive/v0.12.0 or sql/hive/v0.13.1
Add a new profile hive-default active by default, which picks up all existing configuration and hive-0.12.0 shim (v0.12.0) if no hive.version is specified.
If user specifies different version (currently only 0.13.1 by -Dhive.version = 0.13.1), hive-versions profile will be activated, which pick up hive-version specific shim layer and configuration, mainly the hive jars and hive-version shim, e.g., v0.13.1.
With this approach, nothing is changed with current hive-0.12 support.

No change by default: sbt/sbt -Phive
For example: sbt/sbt -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly

To enable hive-0.13: sbt/sbt -Dhive.version=0.13.1
For example: sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly

Note that in hive-0.13, hive-thriftserver is not enabled, which should be fixed by other Jira, and we don’t need -Phive with -Dhive.version in building (probably we should use -Phive -Dhive.version=xxx instead after thrift server is also supported in hive-0.13.1).

AmplabJenkins · 2014-09-02T23:32:07Z

Can one of the admins verify this patch?

liancheng · 2014-09-03T00:14:39Z

Would you mind to change the PR title to [SPARK-2706][SQL] Enable Spark to support Hive 0.13?

liancheng · 2014-09-03T00:16:56Z

ok to test

zhzhan · 2014-09-03T00:48:54Z

done.

marmbrus · 2014-09-03T00:49:46Z

ok to test

SparkQA · 2014-09-03T00:54:15Z

QA tests have started for PR 2241 at commit 94b4fdc.

This patch merges cleanly.

SparkQA · 2014-09-03T02:31:38Z

QA tests have finished for PR 2241 at commit 94b4fdc.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- implicit class wrapperToPartition(p: Partition)
- implicit class wrapperToHive(client: Hive)
- class ShimContext(conf: Configuration) extends Context(conf)
- class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)
- implicit class wrapperToPartition(p: Partition)
- class ShimContext(conf: Configuration) extends Context(conf)
- class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)

zhzhan · 2014-09-03T03:10:28Z

Are the failed test cases in thrift server and python script the false positive? Actually my local test without the change also has the thrift server cases failed. Please comments.

liancheng · 2014-09-03T03:58:19Z

retest this please

SparkQA · 2014-09-03T04:04:17Z

QA tests have started for PR 2241 at commit 94b4fdc.

This patch merges cleanly.

SparkQA · 2014-09-03T05:39:27Z

QA tests have finished for PR 2241 at commit 94b4fdc.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2014-09-03T05:49:41Z

PySpark again :(

retest this please

zhzhan · 2014-09-03T17:41:56Z

Not sure why it fails pyspark. Can somebody help to retest or provide the insights? By the way, I saw some other PR has similar failure case around that time.

liancheng · 2014-09-03T23:32:04Z

PySpark test suites are somewhat flaky and sometimes fail, it's unrelated to your changes.

liancheng · 2014-09-03T23:32:15Z

retest this please

pwendell · 2014-10-14T06:25:56Z

@marmbrus from a build perspective this LGTM with the caveat that right now it's only testing Hive compatibility for 0.12 tests and may require further modification to actually pass 0.13 tests. Up to you in terms of whether that blocks merging.

SparkQA · 2014-10-14T07:36:58Z

QA tests have finished for PR 2241 at commit cbb4691.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)
- class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)

AmplabJenkins · 2014-10-14T07:37:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21718/
Test PASSed.

zhzhan · 2014-10-15T01:26:17Z

@marmbrus I actually quite exhaustively tested the code in both unit test and system test in sandbox, and real cluster, and didn't see major issues. Regarding the compatibility test, there are several test case failure due to some hive 0.13 internal behavior change, e.g, hive decimal. We can fix it in the follow up. In my point of view, it would be good to take a accumulative approach. The current patch does not have impact on existing hive 12 support, but enable the community to actively improve hive0.13 support.

Some instant benefits: 1st. Native parquet support, 2nd. some new UDFs in hive 0.13, and 3rd: better support for ORC as source, e.g., compression, predictor push down, etc.

Please let me know if you have any other concerns.

scwf · 2014-10-16T08:57:11Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala

+          results.map { r =>
+            r match {
+              case s: String => s
+              case o => o.toString


Here r maybe a Array type(https://github.com/scwf/hive/blob/branch-0.13/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchFormatter.java#L53-L64), we should cover that case, otherwise this will lead to console result printed as follows:

result [Object@5e41108b

And on the other hand i suggest that we should do some tests with this PR merged with #2685 to check the basic functionality

@scwf Thanks for the review. The reason I did this is that in hive testing code, I actually didn't find a case which the result value is not Array[String], and there the results is even initialized as Array[String]. For for the safety reason, I will change the code to process Array[Arrray[Object]].

This patch is independent with thrift server, but the thrift server patch should be verified after this one going to upstream, mainly due to pom file change.

zhzhan · 2014-10-20T20:54:17Z

@scwf I did some basic functionality testing with you thrift patch, and it looks ok to me. By the way, because the 0.13.1 customized package is not available now, so I revert the pom back for testing.

scwf · 2014-10-20T21:05:00Z

Thanks, if you have any comment, let me know:)

marmbrus · 2014-10-21T04:22:11Z

I think this is looking pretty good, but I'm not okay with merging it before the tests are passing for Hive 13. Let me take a look and see how hard that will be.

scwf · 2014-10-21T05:07:26Z

We can reproduce the golden answer for hive 0.13 as i done in my closed PR, how about that?

marmbrus · 2014-10-21T05:17:44Z

@scwf, which PR?

zhzhan · 2014-10-21T05:21:19Z

@scwf The golden answer is different in hive12 and hive13. We need some extra shim layer to handle that.

zhzhan · 2014-10-21T05:26:28Z

@marmbrus I think he refers to #2499

zhzhan · 2014-10-21T05:35:38Z

@scwf Did you also replace the query plan for hive0.13 in your another PR? because I also saw some query plan changes in hive0.13.

scwf · 2014-10-21T06:23:07Z

@marmbrus in #2499, i reproduce the golden answer and changed some *.ql because of 0.13 changes, the tests passed in my local machine.
@zhzhan not get you, why to replace the query plan?

zhzhan · 2014-10-21T07:04:06Z

@scwf I mean "change some *.ql" you already did. The problem is that it need to add another layer to take care of compatibility test suite. I have not found a good way to do it. I will thank again to see whether there is a simple way to make it work.

zhzhan · 2014-10-21T07:17:37Z

@scwf I am wondering how do you handle the decimal support, since hive-0.13 has new semantic for this type, and it seems that it can not be made compatible by regenerating golden answer and has to be fixed inside of spark.

zhzhan · 2014-10-21T07:33:29Z

@marmbrus FYI: I ran the compatibility test, and so far the major outstanding issues include 1st: decimal support, 2nd: udf7 and udf_round, which can be fixed, but I am not 100% sure it is the right way. Most of other failures are false positive and should be solved by regenerating golden answer.

marmbrus · 2014-10-21T17:13:57Z

A few comments: I'm not talking only about getting the current tests to pass, but upgrading the test set to include the new files. Also, I hope to update the whitelist to include any new tests that are now passing.

I'm not particularly concerned about matching every small detail (i.e. if we don't need to match the empty comment field on metadata based on version). What is important is that we can connect to both Hive 12&13 metastores and that queries run and return the correct answers.

zhzhan · 2014-10-22T20:19:06Z

@marmbrus Thanks for the comments.

Given that we have to support hive-0.12. There are two approaches I can think out to address the issue.

1st: we can temporally make the compatibility test as hive12 only in pom, and find a good way as followup to add corresponding compatibility test for hive0.13 in an elegant way. This approach also unblock some other jiras and build the foundation for further development of hive 0.13 feature support.

2nd: we can create a set of separate files for hive-0.13, e.g., compatibility suite, golden plan, golden answer, which may involve more than hundreds of files. In addition we need to change the current basic hive test code to adapt to different hive version. I think this approach may be a little rush, and also make the scope of this PR really big and hard to maintain.

I prefer the first approach, and opening followup jiras to address leftovers in a more granular way.

Please let know how you do think about it. If you have other options, please also let me know.

marmbrus · 2014-10-24T18:05:04Z

Okay, I've merged this to master. Will file a PR shortly to fix the tests.

qiaohaijun · 2014-10-28T09:41:49Z

I will try it

In #2241 hive-thriftserver is not enabled. This patch enable hive-thriftserver to support hive-0.13.1 by using a shim layer refer to #2241. 1 A light shim layer(code in sql/hive-thriftserver/hive-version) for each different hive version to handle api compatibility 2 New pom profiles "hive-default" and "hive-versions"(copy from #2241) to activate different hive version 3 SBT cmd for different version as follows: hive-0.12.0 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.12.0 assembly hive-0.13.1 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.13.1 assembly 4 Since hive-thriftserver depend on hive subproject, this patch should be merged with #2241 to enable hive-0.13.1 for hive-thriftserver Author: wangfei <[email protected]> Author: scwf <[email protected]> Closes #2685 from scwf/shim-thriftserver1 and squashes the following commits: f26f3be [wangfei] remove clean to save time f5cac74 [wangfei] remove local hivecontext test 578234d [wangfei] use new shaded hive 18fb1ff [wangfei] exclude kryo in hive pom fa21d09 [wangfei] clean package assembly/assembly 8a4daf2 [wangfei] minor fix 0d7f6cf [wangfei] address comments f7c93ae [wangfei] adding build with hive 0.13 before running tests bcf943f [wangfei] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1 c359822 [wangfei] reuse getCommandProcessor in hiveshim 52674a4 [scwf] sql/hive included since examples depend on it 3529e98 [scwf] move hive module to hive profile f51ff4e [wangfei] update and fix conflicts f48d3a5 [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1 41f727b [scwf] revert pom changes 13afde0 [scwf] fix small bug 4b681f4 [scwf] enable thriftserver in profile hive-0.13.1 0bc53aa [scwf] fixed when result filed is null dfd1c63 [scwf] update run-tests to run hive-0.12.0 default now c6da3ce [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver 7c66b8e [scwf] update pom according spark-2706 ae47489 [scwf] update and fix conflicts

zhzhan added 14 commits August 8, 2014 10:47

test

7d5fce2

test

42585ec

revert

70ffd93

Merge branch 'master' of https://github.com/zhzhan/spark

fe0f379

revert

70964fe

Merge remote-tracking branch 'upstream/master'

dbedff3

test

ba14f28

revert

f6a8a40

Merge branch 'master' of https://github.com/apache/spark

cb53a2c

Merge branch 'master' of https://github.com/apache/spark

789ea21

Merge branch 'master' into spark-2706

f896b2a

Merge branch 'master' of https://github.com/apache/spark

921e914

Merge branch 'master' into spark-2706

87ebf3b

Spark-2706: hive-0.13.1 support on spark

94b4fdc

zhzhan changed the title ~~Spark 2706~~ [SPARK-2706][SQL] Enable Spark to support Hive 0.13 Sep 3, 2014

scwf reviewed Oct 16, 2014
View reviewed changes

solve review comments

410b668

minor fix

3ece905

asfgit closed this in 7c89a8f Oct 24, 2014

This was referenced Oct 24, 2014

[SPARK-2814][SQL] HiveThriftServer2 throws NPE when executing native commands #1753

Closed

[SPARK-4037][SQL] Removes the SessionState instance created in HiveThriftServer2 #2887

Closed

[SPARK-2706][SQL] Enable Spark to support Hive 0.13 #2241

[SPARK-2706][SQL] Enable Spark to support Hive 0.13 #2241

Conversation

zhzhan commented Sep 2, 2014

AmplabJenkins commented Sep 2, 2014

liancheng commented Sep 3, 2014

liancheng commented Sep 3, 2014

zhzhan commented Sep 3, 2014

marmbrus commented Sep 3, 2014

SparkQA commented Sep 3, 2014

SparkQA commented Sep 3, 2014

zhzhan commented Sep 3, 2014

liancheng commented Sep 3, 2014

SparkQA commented Sep 3, 2014

SparkQA commented Sep 3, 2014

liancheng commented Sep 3, 2014

zhzhan commented Sep 3, 2014

liancheng commented Sep 3, 2014

liancheng commented Sep 3, 2014

pwendell commented Oct 14, 2014

SparkQA commented Oct 14, 2014

AmplabJenkins commented Oct 14, 2014

zhzhan commented Oct 15, 2014

scwf Oct 16, 2014

Choose a reason for hiding this comment

zhzhan Oct 16, 2014

Choose a reason for hiding this comment

zhzhan commented Oct 20, 2014

scwf commented Oct 20, 2014

marmbrus commented Oct 21, 2014

scwf commented Oct 21, 2014

marmbrus commented Oct 21, 2014

zhzhan commented Oct 21, 2014

zhzhan commented Oct 21, 2014

zhzhan commented Oct 21, 2014

scwf commented Oct 21, 2014

zhzhan commented Oct 21, 2014

zhzhan commented Oct 21, 2014

zhzhan commented Oct 21, 2014

marmbrus commented Oct 21, 2014

zhzhan commented Oct 22, 2014

marmbrus commented Oct 24, 2014

qiaohaijun commented Oct 28, 2014