Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2706][SQL] Enable Spark to support Hive 0.13 #2241

Closed
wants to merge 55 commits into from

Conversation

zhzhan
Copy link
Contributor

@zhzhan zhzhan commented Sep 2, 2014

Given that a lot of users are trying to use hive 0.13 in spark, and the incompatibility between hive-0.12 and hive-0.13 on the API level I want to propose following approach, which has no or minimum impact on existing hive-0.12 support, but be able to jumpstart the development of hive-0.13 and future version support.

Approach: Introduce “hive-version” property, and manipulate pom.xml files to support different hive version at compiling time through shim layer, e.g., hive-0.12.0 and hive-0.13.1. More specifically,

  1. For each different hive version, there is a very light layer of shim code to handle API differences, sitting in sql/hive/hive-version, e.g., sql/hive/v0.12.0 or sql/hive/v0.13.1
  2. Add a new profile hive-default active by default, which picks up all existing configuration and hive-0.12.0 shim (v0.12.0) if no hive.version is specified.
  3. If user specifies different version (currently only 0.13.1 by -Dhive.version = 0.13.1), hive-versions profile will be activated, which pick up hive-version specific shim layer and configuration, mainly the hive jars and hive-version shim, e.g., v0.13.1.
  4. With this approach, nothing is changed with current hive-0.12 support.

No change by default: sbt/sbt -Phive
For example: sbt/sbt -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly

To enable hive-0.13: sbt/sbt -Dhive.version=0.13.1
For example: sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly

Note that in hive-0.13, hive-thriftserver is not enabled, which should be fixed by other Jira, and we don’t need -Phive with -Dhive.version in building (probably we should use -Phive -Dhive.version=xxx instead after thrift server is also supported in hive-0.13.1).

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@liancheng
Copy link
Contributor

Would you mind to change the PR title to [SPARK-2706][SQL] Enable Spark to support Hive 0.13?

@liancheng
Copy link
Contributor

ok to test

@zhzhan zhzhan changed the title Spark 2706 [SPARK-2706][SQL] Enable Spark to support Hive 0.13 Sep 3, 2014
@zhzhan
Copy link
Contributor Author

zhzhan commented Sep 3, 2014

done.

@marmbrus
Copy link
Contributor

marmbrus commented Sep 3, 2014

ok to test

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2241 at commit 94b4fdc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2241 at commit 94b4fdc.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • implicit class wrapperToPartition(p: Partition)
    • implicit class wrapperToHive(client: Hive)
    • class ShimContext(conf: Configuration) extends Context(conf)
    • class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)
    • implicit class wrapperToPartition(p: Partition)
    • class ShimContext(conf: Configuration) extends Context(conf)
    • class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)

@zhzhan
Copy link
Contributor Author

zhzhan commented Sep 3, 2014

Are the failed test cases in thrift server and python script the false positive? Actually my local test without the change also has the thrift server cases failed. Please comments.

@liancheng
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2241 at commit 94b4fdc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2241 at commit 94b4fdc.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

PySpark again :(

retest this please

@zhzhan
Copy link
Contributor Author

zhzhan commented Sep 3, 2014

Not sure why it fails pyspark. Can somebody help to retest or provide the insights? By the way, I saw some other PR has similar failure case around that time.

@liancheng
Copy link
Contributor

PySpark test suites are somewhat flaky and sometimes fail, it's unrelated to your changes.

@liancheng
Copy link
Contributor

retest this please

@pwendell
Copy link
Contributor

@marmbrus from a build perspective this LGTM with the caveat that right now it's only testing Hive compatibility for 0.12 tests and may require further modification to actually pass 0.13 tests. Up to you in terms of whether that blocks merging.

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have finished for PR 2241 at commit cbb4691.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)
    • class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21718/
Test PASSed.

@zhzhan
Copy link
Contributor Author

zhzhan commented Oct 15, 2014

@marmbrus I actually quite exhaustively tested the code in both unit test and system test in sandbox, and real cluster, and didn't see major issues. Regarding the compatibility test, there are several test case failure due to some hive 0.13 internal behavior change, e.g, hive decimal. We can fix it in the follow up. In my point of view, it would be good to take a accumulative approach. The current patch does not have impact on existing hive 12 support, but enable the community to actively improve hive0.13 support.

Some instant benefits: 1st. Native parquet support, 2nd. some new UDFs in hive 0.13, and 3rd: better support for ORC as source, e.g., compression, predictor push down, etc.

Please let me know if you have any other concerns.

results.map { r =>
r match {
case s: String => s
case o => o.toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here r maybe a Array type(https://github.com/scwf/hive/blob/branch-0.13/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchFormatter.java#L53-L64), we should cover that case, otherwise this will lead to console result printed as follows:

   result
   [Object@5e41108b

And on the other hand i suggest that we should do some tests with this PR merged with #2685 to check the basic functionality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scwf Thanks for the review. The reason I did this is that in hive testing code, I actually didn't find a case which the result value is not Array[String], and there the results is even initialized as Array[String]. For for the safety reason, I will change the code to process Array[Arrray[Object]].

This patch is independent with thrift server, but the thrift server patch should be verified after this one going to upstream, mainly due to pom file change.

@zhzhan
Copy link
Contributor Author

zhzhan commented Oct 20, 2014

@scwf I did some basic functionality testing with you thrift patch, and it looks ok to me. By the way, because the 0.13.1 customized package is not available now, so I revert the pom back for testing.

@scwf
Copy link
Contributor

scwf commented Oct 20, 2014

Thanks, if you have any comment, let me know:)

@marmbrus
Copy link
Contributor

I think this is looking pretty good, but I'm not okay with merging it before the tests are passing for Hive 13. Let me take a look and see how hard that will be.

@scwf
Copy link
Contributor

scwf commented Oct 21, 2014

We can reproduce the golden answer for hive 0.13 as i done in my closed PR, how about that?

@marmbrus
Copy link
Contributor

@scwf, which PR?

@zhzhan
Copy link
Contributor Author

zhzhan commented Oct 21, 2014

@scwf The golden answer is different in hive12 and hive13. We need some extra shim layer to handle that.

@zhzhan
Copy link
Contributor Author

zhzhan commented Oct 21, 2014

@marmbrus I think he refers to #2499

@zhzhan
Copy link
Contributor Author

zhzhan commented Oct 21, 2014

@scwf Did you also replace the query plan for hive0.13 in your another PR? because I also saw some query plan changes in hive0.13.

@scwf
Copy link
Contributor

scwf commented Oct 21, 2014

@marmbrus in #2499, i reproduce the golden answer and changed some *.ql because of 0.13 changes, the tests passed in my local machine.
@zhzhan not get you, why to replace the query plan?

@zhzhan
Copy link
Contributor Author

zhzhan commented Oct 21, 2014

@scwf I mean "change some *.ql" you already did. The problem is that it need to add another layer to take care of compatibility test suite. I have not found a good way to do it. I will thank again to see whether there is a simple way to make it work.

@zhzhan
Copy link
Contributor Author

zhzhan commented Oct 21, 2014

@scwf I am wondering how do you handle the decimal support, since hive-0.13 has new semantic for this type, and it seems that it can not be made compatible by regenerating golden answer and has to be fixed inside of spark.

@zhzhan
Copy link
Contributor Author

zhzhan commented Oct 21, 2014

@marmbrus FYI: I ran the compatibility test, and so far the major outstanding issues include 1st: decimal support, 2nd: udf7 and udf_round, which can be fixed, but I am not 100% sure it is the right way. Most of other failures are false positive and should be solved by regenerating golden answer.

@marmbrus
Copy link
Contributor

A few comments: I'm not talking only about getting the current tests to pass, but upgrading the test set to include the new files. Also, I hope to update the whitelist to include any new tests that are now passing.

I'm not particularly concerned about matching every small detail (i.e. if we don't need to match the empty comment field on metadata based on version). What is important is that we can connect to both Hive 12&13 metastores and that queries run and return the correct answers.

@zhzhan
Copy link
Contributor Author

zhzhan commented Oct 22, 2014

@marmbrus Thanks for the comments.

Given that we have to support hive-0.12. There are two approaches I can think out to address the issue.

1st: we can temporally make the compatibility test as hive12 only in pom, and find a good way as followup to add corresponding compatibility test for hive0.13 in an elegant way. This approach also unblock some other jiras and build the foundation for further development of hive 0.13 feature support.

2nd: we can create a set of separate files for hive-0.13, e.g., compatibility suite, golden plan, golden answer, which may involve more than hundreds of files. In addition we need to change the current basic hive test code to adapt to different hive version. I think this approach may be a little rush, and also make the scope of this PR really big and hard to maintain.

I prefer the first approach, and opening followup jiras to address leftovers in a more granular way.

Please let know how you do think about it. If you have other options, please also let me know.

@marmbrus
Copy link
Contributor

Okay, I've merged this to master. Will file a PR shortly to fix the tests.

@qiaohaijun
Copy link

I will try it

asfgit pushed a commit that referenced this pull request Oct 31, 2014
 In #2241 hive-thriftserver is not enabled. This patch enable hive-thriftserver to support hive-0.13.1 by using a shim layer refer to #2241.

 1 A light shim layer(code in sql/hive-thriftserver/hive-version) for each different hive version to handle api compatibility

 2 New pom profiles "hive-default" and "hive-versions"(copy from #2241) to activate different hive version

 3 SBT cmd for different version as follows:
   hive-0.12.0 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.12.0 assembly
   hive-0.13.1 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.13.1 assembly

 4 Since hive-thriftserver depend on hive subproject, this patch should be merged with #2241 to enable hive-0.13.1 for hive-thriftserver

Author: wangfei <[email protected]>
Author: scwf <[email protected]>

Closes #2685 from scwf/shim-thriftserver1 and squashes the following commits:

f26f3be [wangfei] remove clean to save time
f5cac74 [wangfei] remove local hivecontext test
578234d [wangfei] use new shaded hive
18fb1ff [wangfei] exclude kryo in hive pom
fa21d09 [wangfei] clean package assembly/assembly
8a4daf2 [wangfei] minor fix
0d7f6cf [wangfei] address comments
f7c93ae [wangfei] adding build with hive 0.13 before running tests
bcf943f [wangfei] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
c359822 [wangfei] reuse getCommandProcessor in hiveshim
52674a4 [scwf] sql/hive included since examples depend on it
3529e98 [scwf] move hive module to hive profile
f51ff4e [wangfei] update and fix conflicts
f48d3a5 [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
41f727b [scwf] revert pom changes
13afde0 [scwf] fix small bug
4b681f4 [scwf] enable thriftserver in profile hive-0.13.1
0bc53aa [scwf] fixed when result filed is null
dfd1c63 [scwf] update run-tests to run hive-0.12.0 default now
c6da3ce [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver
7c66b8e [scwf] update pom according spark-2706
ae47489 [scwf] update and fix conflicts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.