-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the ORC version that corresponds to the Spark version [databricks] #4408
Conversation
Use the ORC version that corresponds to the Spark version. |
build |
2 similar comments
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on the end goal of this PR. It looks like it is going to compile against different versions of ORC yet still pull ORC into the dist jar. How is that going to work in practice -- won't the different ORC versions from different aggregator jars conflict when we try to pull it all together into the dist jar?
Additionally, how are the concerns about varying ORC classifiers pulled in by different Spark builds, as detailed in #4031 (comment), being addressed? |
"binary-dedupe.sh" is used to compare the class binary, so there is no conflict for multiple ORC versions. Of cause, there will be more class files.
The scope of Spark core jar is provided, the maven shade plugin only shade compile jars, so explicitly specify orc jar as compile scope.
To address this concern, I have no good idea, it's better to use the current strategy of shading the ORC with the hive code. |
.../spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/Spark311CDHShims.scala
Outdated
Show resolved
Hide resolved
build |
1 similar comment
build |
sql-plugin/src/main/301until320-all/scala/com/nvidia/spark/rapids/shims/v2/OrcShims.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/320+/scala/com/nvidia/spark/rapids/shims/v2/OrcShims.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/320+/scala/com/nvidia/spark/rapids/shims/v2/OrcShims.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/VersionUtils.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/OrcScanSuite.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/SparkQueryCompareTestSuite.scala
Outdated
Show resolved
Hide resolved
Can we retarget this to 22.04? Reason being we ought to wait for the fix for rapidsai/cudf#9964 to go in before we merge this. |
Updated, but still a draft, need to find a way to compute the version by examining the Spark jar dependencies. |
build |
build |
Changed the solution of aligning Orc versions PR after investigated the history of shading ORC. |
build |
After 11/24/2019 and from ORC-1.5.7 Spark no longer use "nohive" classifier ORC uber jar again for Hive 2.0+. Details: |
ORC 1.6.11+ failed to prune when reading ORC file in Proleptic calendar which was written in Hybrid calendar. |
build |
Signed-off-by: Chong Gao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is pretty close, just a small nit in the test.
@tgravescs it would be good to have you take another look.
common/src/test/scala/com/nvidia/spark/rapids/ThreadFactoryBuilderTest.scala
Outdated
Show resolved
Hide resolved
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor nits remain, so this looks good to me. Would like to hear from @tgravescs before merging.
common/src/test/scala/com/nvidia/spark/rapids/ThreadFactoryBuilderTest.scala
Outdated
Show resolved
Hide resolved
common/src/test/scala/com/nvidia/spark/rapids/ThreadFactoryBuilderTest.scala
Outdated
Show resolved
Hide resolved
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the hard work, @res-life! This looks good to me. @tgravescs can you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good, a couple of nits. It would also be nice to update the description to talk about the new common module and why it was added.
jenkins/databricks/build.sh
Outdated
ORC_CORE_JAR=----workspace_${SPARK_MAJOR_VERSION_STRING}--maven-trees--hive-2.3__hadoop-2.7--org.apache.orc--orc-core--org.apache.orc__orc-core__1.5.12.jar | ||
ORC_SHIM_JAR=----workspace_${SPARK_MAJOR_VERSION_STRING}--maven-trees--hive-2.3__hadoop-2.7--org.apache.orc--orc-shims--org.apache.orc__orc-shims__1.5.12.jar | ||
ORC_MAPREDUCE_JAR=----workspace_${SPARK_MAJOR_VERSION_STRING}--maven-trees--hive-2.3__hadoop-2.7--org.apache.orc--orc-mapreduce--org.apache.orc__orc-mapreduce__1.5.12.jar | ||
PROTOBUF_JAR=----workspace_${SPARK_MAJOR_VERSION_STRING}--maven-trees--hive-2.3__hadoop-2.7--com.google.protobuf--protobuf-java--com.google.protobuf__protobuf-java__2.6.1.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this jar looks the same the one in the else statement, if so move out
import org.apache.orc.impl.{DataReaderProperties, OutStream, SchemaEvolution} | ||
import org.apache.orc.impl.RecordReaderImpl.SargApplier | ||
|
||
// [301, 320) ORC shims |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - not sure if the ) is supposed to be until, I think we can just remove this comment as the name should be pretty clear
build |
[BUG] Spark 3.3.0 test failure: NoSuchMethodError org.apache.orc.TypeDescription.getAttributeValue #4031
Root cause
Rapids plugin uses a constant ORC version 1.5.10 and Spark 3.3.0 begins to use ORC 1.7.x which is not compatible with 1.5.10.
For the unit test cases, it uses ORC 1.5.10, because of directly specified the dependency which overides the ORC 1.7.x dependency, Spark invoking the "getAttributeValue" which resides in ORC 1.7.x will fail.
But for the integration test cases, because of it uses the jars resides in the $SPARK_HOME/jars which includes ORC 1.7.x, it will not fail.
There are different behaviors between integration tests and unit tests.
Solution
This will make the behaviors of IT and UT consistent.
Will make the aggregator jar small because of stopping shading ORC.
Will upgrade ORC accordingly with Spark.
This will make us confusing why we can't put this case into UT.
Here we use option 1.
Another change, added a common module:
In this PR it contains a util class ThreadFactoryBuilder that is used to replace the Guava dependency. The guava dependency is removed because of it's a messy jar in practice.
The ThreadFactoryBuilder is used in both sql-plugin and tools modules, so put it into the new common module.
This fixes #4031