-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Spark 3.3.0 test failure: NoSuchMethodError org.apache.orc.TypeDescription.getAttributeValue #4031
Comments
it looks like perhaps we are picking up wrong ORC version somehow. Is the version we are using not shaded at this point and mismatches with Spark version. |
Note spark 3.3 upgraded to orc 1.7 - https://issues.apache.org/jira/browse/SPARK-34112 |
This seems related to #3932. Since we're pulling in the aggregator classes (and thus sql-plugin classes and their dependencies) on the classpath, we're getting ORC 1.5.8 on the classpath. |
Spark301 depends on orc-core 1.5.10.
This problem can be simplified as follows: I think #3932 can't fix this. @jlowe @tgravescs |
#3932 would fix this, as the tests would only run against the dist jar which has ORC shaded and there would be no conflict. However the tests would then be unable to access interior classes directly, as they would be setup in the parallel-world of the dist jar. There are two ways to address the changing ORC version across the different Spark versions we support:
I'm fine if we want to try going the latter route, where we stop bundling ORC and use the provided one directly via shimmed classes. It would make it easier to handle the changing ORC versions in the tests (unless the tests also access ORC directly, and then the tests themselves would need to use shims). |
@jlowe @tgravescs The details are as follows:
|
I think it will be better to use the ORC version that corresponds to the Spark version. We've had other issues with the shading of Hive in the plugin jar messing with GPU support of Hive UDFs, and I think things would just get simpler if we shade as little as possible. So I'm +1 for stop shading ORC/Hive classes and use shims when ORC APIs change between Spark versions if we can get it to work. We need to double-check that we're OK with Spark installations that don't have Hive support compiled in. cc: @tgravescs in case he can think of any issues there, as I vaguely recall a problem we hit in the past where the Spark artifacts don't have a classifier for which ORC they are using (i.e.: ORC with or without Hive support) and compiling against one could lead to class not found issues when running against the other ORC at runtime. |
Yes we had issues with ORC when we built against I believe the standard version and then user had orc with nohive. I don't remember exact details. I think the minimum version of hive spark supports in 3.0 is hive 2.3 or newer. They do support hive 3.0 as well. Those have orc versions 1.3.3 and 1.6.9. Then the orc nohive profile shades hive inside of orc. Find more info here: Are we only using ORC Api that Spark also uses? including any changes between Spark versions. Seems like if we did that we would be relatively safe, unless of course you get CSP that modified Spark version, I guess they could modify the plugin then. I think we need to be very careful about this both now and if someone modifies in the future. There are a lot of different ways ORC and HIVE can be picked up and I'm not sure how good they are about API compatibility. |
Requesting we move the target version to 22.04, so we can get the fix for rapidsai/cudf#9964 into RAPIDS first. cc: @GaryShen2008 |
the nightly build fails on the Spark 3.3.0 shim layer tests:
09:03:31 OrcScanSuite:
09:03:32 *** RUN ABORTED ***
09:03:32 java.lang.NoSuchMethodError: org.apache.orc.TypeDescription.getAttributeValue(Ljava/lang/String;)Ljava/lang/String;
09:03:32 at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.toCatalystType$1(OrcUtils.scala:103)
09:03:32 at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.$anonfun$toCatalystSchema$1(OrcUtils.scala:118)
09:03:32 at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
09:03:32 at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
09:03:32 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
09:03:32 at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.toStructType$1(OrcUtils.scala:116)
09:03:32 at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.toCatalystSchema(OrcUtils.scala:138)
09:03:32 at org.apache.spark.sql.execution.datasources.orc.OrcUtils$$anonfun$readSchema$5.applyOrElse(OrcUtils.scala:148)
09:03:32 at org.apache.spark.sql.execution.datasources.orc.OrcUtils$$anonfun$readSchema$5.applyOrElse(OrcUtils.scala:145)
09:03:32 at scala.collection.TraversableOnce.collectFirst(TraversableOnce.scala:148)
09:03:32 ...
The text was updated successfully, but these errors were encountered: