-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2706][SQL] Enable Spark to support Hive 0.13 #2241
Conversation
Can one of the admins verify this patch? |
Would you mind to change the PR title to |
ok to test |
done. |
ok to test |
QA tests have started for PR 2241 at commit
|
QA tests have finished for PR 2241 at commit
|
Are the failed test cases in thrift server and python script the false positive? Actually my local test without the change also has the thrift server cases failed. Please comments. |
retest this please |
QA tests have started for PR 2241 at commit
|
QA tests have finished for PR 2241 at commit
|
PySpark again :( retest this please |
Not sure why it fails pyspark. Can somebody help to retest or provide the insights? By the way, I saw some other PR has similar failure case around that time. |
PySpark test suites are somewhat flaky and sometimes fail, it's unrelated to your changes. |
retest this please |
@marmbrus from a build perspective this LGTM with the caveat that right now it's only testing Hive compatibility for 0.12 tests and may require further modification to actually pass 0.13 tests. Up to you in terms of whether that blocks merging. |
QA tests have finished for PR 2241 at commit
|
Test PASSed. |
@marmbrus I actually quite exhaustively tested the code in both unit test and system test in sandbox, and real cluster, and didn't see major issues. Regarding the compatibility test, there are several test case failure due to some hive 0.13 internal behavior change, e.g, hive decimal. We can fix it in the follow up. In my point of view, it would be good to take a accumulative approach. The current patch does not have impact on existing hive 12 support, but enable the community to actively improve hive0.13 support. Some instant benefits: 1st. Native parquet support, 2nd. some new UDFs in hive 0.13, and 3rd: better support for ORC as source, e.g., compression, predictor push down, etc. Please let me know if you have any other concerns. |
results.map { r => | ||
r match { | ||
case s: String => s | ||
case o => o.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here r
maybe a Array
type(https://github.com/scwf/hive/blob/branch-0.13/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchFormatter.java#L53-L64), we should cover that case, otherwise this will lead to console result printed as follows:
result
[Object@5e41108b
And on the other hand i suggest that we should do some tests with this PR merged with #2685 to check the basic functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scwf Thanks for the review. The reason I did this is that in hive testing code, I actually didn't find a case which the result value is not Array[String], and there the results is even initialized as Array[String]. For for the safety reason, I will change the code to process Array[Arrray[Object]].
This patch is independent with thrift server, but the thrift server patch should be verified after this one going to upstream, mainly due to pom file change.
@scwf I did some basic functionality testing with you thrift patch, and it looks ok to me. By the way, because the 0.13.1 customized package is not available now, so I revert the pom back for testing. |
Thanks, if you have any comment, let me know:) |
I think this is looking pretty good, but I'm not okay with merging it before the tests are passing for Hive 13. Let me take a look and see how hard that will be. |
We can reproduce the golden answer for hive 0.13 as i done in my closed PR, how about that? |
@scwf, which PR? |
@scwf The golden answer is different in hive12 and hive13. We need some extra shim layer to handle that. |
@scwf Did you also replace the query plan for hive0.13 in your another PR? because I also saw some query plan changes in hive0.13. |
@scwf I mean "change some *.ql" you already did. The problem is that it need to add another layer to take care of compatibility test suite. I have not found a good way to do it. I will thank again to see whether there is a simple way to make it work. |
@scwf I am wondering how do you handle the decimal support, since hive-0.13 has new semantic for this type, and it seems that it can not be made compatible by regenerating golden answer and has to be fixed inside of spark. |
@marmbrus FYI: I ran the compatibility test, and so far the major outstanding issues include 1st: decimal support, 2nd: udf7 and udf_round, which can be fixed, but I am not 100% sure it is the right way. Most of other failures are false positive and should be solved by regenerating golden answer. |
A few comments: I'm not talking only about getting the current tests to pass, but upgrading the test set to include the new files. Also, I hope to update the whitelist to include any new tests that are now passing. I'm not particularly concerned about matching every small detail (i.e. if we don't need to match the empty comment field on metadata based on version). What is important is that we can connect to both Hive 12&13 metastores and that queries run and return the correct answers. |
@marmbrus Thanks for the comments. Given that we have to support hive-0.12. There are two approaches I can think out to address the issue. 1st: we can temporally make the compatibility test as hive12 only in pom, and find a good way as followup to add corresponding compatibility test for hive0.13 in an elegant way. This approach also unblock some other jiras and build the foundation for further development of hive 0.13 feature support. 2nd: we can create a set of separate files for hive-0.13, e.g., compatibility suite, golden plan, golden answer, which may involve more than hundreds of files. In addition we need to change the current basic hive test code to adapt to different hive version. I think this approach may be a little rush, and also make the scope of this PR really big and hard to maintain. I prefer the first approach, and opening followup jiras to address leftovers in a more granular way. Please let know how you do think about it. If you have other options, please also let me know. |
Okay, I've merged this to master. Will file a PR shortly to fix the tests. |
I will try it |
In #2241 hive-thriftserver is not enabled. This patch enable hive-thriftserver to support hive-0.13.1 by using a shim layer refer to #2241. 1 A light shim layer(code in sql/hive-thriftserver/hive-version) for each different hive version to handle api compatibility 2 New pom profiles "hive-default" and "hive-versions"(copy from #2241) to activate different hive version 3 SBT cmd for different version as follows: hive-0.12.0 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.12.0 assembly hive-0.13.1 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.13.1 assembly 4 Since hive-thriftserver depend on hive subproject, this patch should be merged with #2241 to enable hive-0.13.1 for hive-thriftserver Author: wangfei <[email protected]> Author: scwf <[email protected]> Closes #2685 from scwf/shim-thriftserver1 and squashes the following commits: f26f3be [wangfei] remove clean to save time f5cac74 [wangfei] remove local hivecontext test 578234d [wangfei] use new shaded hive 18fb1ff [wangfei] exclude kryo in hive pom fa21d09 [wangfei] clean package assembly/assembly 8a4daf2 [wangfei] minor fix 0d7f6cf [wangfei] address comments f7c93ae [wangfei] adding build with hive 0.13 before running tests bcf943f [wangfei] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1 c359822 [wangfei] reuse getCommandProcessor in hiveshim 52674a4 [scwf] sql/hive included since examples depend on it 3529e98 [scwf] move hive module to hive profile f51ff4e [wangfei] update and fix conflicts f48d3a5 [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1 41f727b [scwf] revert pom changes 13afde0 [scwf] fix small bug 4b681f4 [scwf] enable thriftserver in profile hive-0.13.1 0bc53aa [scwf] fixed when result filed is null dfd1c63 [scwf] update run-tests to run hive-0.12.0 default now c6da3ce [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver 7c66b8e [scwf] update pom according spark-2706 ae47489 [scwf] update and fix conflicts
Given that a lot of users are trying to use hive 0.13 in spark, and the incompatibility between hive-0.12 and hive-0.13 on the API level I want to propose following approach, which has no or minimum impact on existing hive-0.12 support, but be able to jumpstart the development of hive-0.13 and future version support.
Approach: Introduce “hive-version” property, and manipulate pom.xml files to support different hive version at compiling time through shim layer, e.g., hive-0.12.0 and hive-0.13.1. More specifically,
No change by default: sbt/sbt -Phive
For example: sbt/sbt -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly
To enable hive-0.13: sbt/sbt -Dhive.version=0.13.1
For example: sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly
Note that in hive-0.13, hive-thriftserver is not enabled, which should be fixed by other Jira, and we don’t need -Phive with -Dhive.version in building (probably we should use -Phive -Dhive.version=xxx instead after thrift server is also supported in hive-0.13.1).