Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the ORC version that corresponds to the Spark version [databricks] #4408

Merged
merged 26 commits into from
Feb 8, 2022

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Dec 21, 2021

[BUG] Spark 3.3.0 test failure: NoSuchMethodError org.apache.orc.TypeDescription.getAttributeValue #4031

Root cause
Rapids plugin uses a constant ORC version 1.5.10 and Spark 3.3.0 begins to use ORC 1.7.x which is not compatible with 1.5.10.
For the unit test cases, it uses ORC 1.5.10, because of directly specified the dependency which overides the ORC 1.7.x dependency, Spark invoking the "getAttributeValue" which resides in ORC 1.7.x will fail.
But for the integration test cases, because of it uses the jars resides in the $SPARK_HOME/jars which includes ORC 1.7.x, it will not fail.
There are different behaviors between integration tests and unit tests.

Solution

  1. Use the ORC version that corresponds to the Spark version and stop shading ORC.
    This will make the behaviors of IT and UT consistent.
    Will make the aggregator jar small because of stopping shading ORC.
    Will upgrade ORC accordingly with Spark.
  2. Move the failed test case into IT and keep shading ORC 1.5.10
    This will make us confusing why we can't put this case into UT.

Here we use option 1.

Another change, added a common module:
In this PR it contains a util class ThreadFactoryBuilder that is used to replace the Guava dependency. The guava dependency is removed because of it's a messy jar in practice.
The ThreadFactoryBuilder is used in both sql-plugin and tools modules, so put it into the new common module.

This fixes #4031

@res-life
Copy link
Collaborator Author

res-life commented Dec 21, 2021

Use the ORC version that corresponds to the Spark version.
Make the test module depend on the aggregator jar instead of the dist jar.
Temp comment a test case in OrcScanSuite, will check it later.
Currently keep shading the guava, ORC, etc, will update the shade scope in the issue #1398 .
Move the dependency management of ORC to sql-plugin and omit the version in order to dynamically set the ORC version.
Will add some comments later.

@res-life res-life changed the title Use the ORC version that corresponds to the Spark version Use the ORC version that corresponds to the Spark version [databricks] Dec 21, 2021
@res-life
Copy link
Collaborator Author

build

2 similar comments
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on the end goal of this PR. It looks like it is going to compile against different versions of ORC yet still pull ORC into the dist jar. How is that going to work in practice -- won't the different ORC versions from different aggregator jars conflict when we try to pull it all together into the dist jar?

sql-plugin/pom.xml Outdated Show resolved Hide resolved
@jlowe
Copy link
Member

jlowe commented Dec 21, 2021

Additionally, how are the concerns about varying ORC classifiers pulled in by different Spark builds, as detailed in #4031 (comment), being addressed?

@res-life
Copy link
Collaborator Author

res-life commented Dec 22, 2021

I'm not clear on the end goal of this PR. It looks like it is going to compile against different versions of ORC yet still pull ORC into the dist jar. How is that going to work in practice -- won't the different ORC versions from different aggregator jars conflict when we try to pull it all together into the dist jar?

"binary-dedupe.sh" is used to compare the class binary, so there is no conflict for multiple ORC versions. Of cause, there will be more class files.

If the intent is to use the same version of ORC and Hive that Spark is using, why are we pulling in these dependencies explicitly rather than getting them transitively from the Spark artifacts? The risk of putting them explicitly here is that they change versions for Spark snapshot artifacts and we fail to notice.

The scope of Spark core jar is provided, the maven shade plugin only shade compile jars, so explicitly specify orc jar as compile scope.
Sometimes the transitive jar is not the same as the Spark runtime, so explicitly specify.
You are right, for the Spark snapshot, it's better to use the transitive jar.

Additionally, how are the concerns about varying ORC classifiers pulled in by different Spark builds, as detailed in #4031 (comment), being addressed?

To address this concern, I have no good idea, it's better to use the current strategy of shading the ORC with the hive code.
The PR of #1398 will create a new module to collect the necessary ORC and hive classes to minimum the shading scope.
Seems the rapids plugin only used the orc-core, hive-storage-api and protobuf-java, I'll shade the ORC input/output class.

@res-life res-life mentioned this pull request Dec 28, 2021
@sameerz sameerz added the audit_3.3.0 Audit related tasks for 3.3.0 label Dec 29, 2021
shims/pom.xml Outdated Show resolved Hide resolved
sql-plugin/pom.xml Outdated Show resolved Hide resolved
sql-plugin/pom.xml Outdated Show resolved Hide resolved
@res-life
Copy link
Collaborator Author

build

1 similar comment
@res-life
Copy link
Collaborator Author

build

@sameerz
Copy link
Collaborator

sameerz commented Jan 11, 2022

Can we retarget this to 22.04? Reason being we ought to wait for the fix for rapidsai/cudf#9964 to go in before we merge this.

@res-life
Copy link
Collaborator Author

Updated, but still a draft, need to find a way to compute the version by examining the Spark jar dependencies.
Sent a mail to discuss If shade is needed.

@res-life
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator

tgravescs commented Jan 12, 2022

@res-life Please put the issue description for the PR in the first post. ie (where you put This fixes #4031)

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

res-life commented Jan 15, 2022

Changed the solution of aligning Orc versions PR after investigated the history of shading ORC.
Now do not shade ORC again, updated for all the Spark versions: 3.0.1 to 3.3.0, 301db, 312db and 311cdh.
After this change, the aggregator jar reduced size from 13M to 6.8M.
And do not need to handle the refine Orc shade issue.

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

res-life commented Jan 17, 2022

After 11/24/2019 and from ORC-1.5.7 Spark no longer use "nohive" classifier ORC uber jar again for Hive 2.0+.
If our product aims Spark v3.0.0+ and Hive 2.0+ without Hive 1.x, then it's safe to remove the shade configuration of ORC.
So, let's trying to remove it and do some regression tests before merging the code.

Details:
cd spark-source-path
git checkout v3.0.0
git log -p ./dev/deps | grep -20 orc-core-1.5

@res-life
Copy link
Collaborator Author

ORC 1.6.11+ failed to prune when reading ORC file in Proleptic calendar which was written in Hybrid calendar.
The ORC file in the issue was created by Spark CPU version and ORC 1.5.10 behavior is correct.
Filed a ORC bug: https://issues.apache.org/jira/browse/ORC-1083

@res-life
Copy link
Collaborator Author

build

pom.xml Outdated Show resolved Hide resolved
sql-plugin/pom.xml Outdated Show resolved Hide resolved
sql-plugin/pom.xml Outdated Show resolved Hide resolved
shuffle-plugin/pom.xml Outdated Show resolved Hide resolved
andygrove and others added 2 commits January 19, 2022 16:15
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty close, just a small nit in the test.

@tgravescs it would be good to have you take another look.

@res-life
Copy link
Collaborator Author

build

tools/pom.xml Show resolved Hide resolved
@res-life
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor nits remain, so this looks good to me. Would like to hear from @tgravescs before merging.

@res-life
Copy link
Collaborator Author

build

common/pom.xml Outdated Show resolved Hide resolved
common/pom.xml Outdated Show resolved Hide resolved
@res-life
Copy link
Collaborator Author

build

jlowe
jlowe previously approved these changes Jan 28, 2022
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the hard work, @res-life! This looks good to me. @tgravescs can you take a look?

Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good, a couple of nits. It would also be nice to update the description to talk about the new common module and why it was added.

ORC_CORE_JAR=----workspace_${SPARK_MAJOR_VERSION_STRING}--maven-trees--hive-2.3__hadoop-2.7--org.apache.orc--orc-core--org.apache.orc__orc-core__1.5.12.jar
ORC_SHIM_JAR=----workspace_${SPARK_MAJOR_VERSION_STRING}--maven-trees--hive-2.3__hadoop-2.7--org.apache.orc--orc-shims--org.apache.orc__orc-shims__1.5.12.jar
ORC_MAPREDUCE_JAR=----workspace_${SPARK_MAJOR_VERSION_STRING}--maven-trees--hive-2.3__hadoop-2.7--org.apache.orc--orc-mapreduce--org.apache.orc__orc-mapreduce__1.5.12.jar
PROTOBUF_JAR=----workspace_${SPARK_MAJOR_VERSION_STRING}--maven-trees--hive-2.3__hadoop-2.7--com.google.protobuf--protobuf-java--com.google.protobuf__protobuf-java__2.6.1.jar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this jar looks the same the one in the else statement, if so move out

import org.apache.orc.impl.{DataReaderProperties, OutStream, SchemaEvolution}
import org.apache.orc.impl.RecordReaderImpl.SargApplier

// [301, 320) ORC shims
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - not sure if the ) is supposed to be until, I think we can just remove this comment as the name should be pretty clear

@res-life
Copy link
Collaborator Author

res-life commented Feb 7, 2022

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.3.0 Audit related tasks for 3.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Spark 3.3.0 test failure: NoSuchMethodError org.apache.orc.TypeDescription.getAttributeValue
5 participants