Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment #3704

Closed
andygrove opened this issue Sep 29, 2021 · 6 comments · Fixed by #3763
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@andygrove
Copy link
Contributor

andygrove commented Sep 29, 2021

Describe the bug

Benchmark queries are failing in SJC4 k8s cluster:

java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
        at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)
        at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2411)
...
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
        at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:527)
...
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
        at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Steps/Code to reproduce bug

See benchmark runs in sjc4 cluster.

Expected behavior

Should work.

Additional context
N/A

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 29, 2021
@andygrove andygrove added this to the Sep 27 - Oct 1 milestone Sep 29, 2021
@andygrove andygrove self-assigned this Sep 29, 2021
@andygrove
Copy link
Contributor Author

I manually built the base Docker image containing Spark 3.2.0-rc6 and added the plugin jar, compiled against 3.2.0-rc6. I also built the plugin with Scala 2.12.15 to match Spark (not that it should matter, but wanted to rule this out). I did not add any Hadoop or AWS jars.

I then ran the CI job as usual, with cuDF and benchmark jars being pulled from artifactory, but not the plugin jar.

The ClassCastException still occurs in the executors. Here is the initial logging from executor startup, showing path information. Note that even though hadoop is referenced, there are no hadoop jars in the file system and I enabled classloader verbose logging and confirmed that no hadoop jars were loaded.

++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n /opt/spark/jars/rapids-4-spark_2.12-21.10.0-SNAPSHOT.jar:cudf-21.10.0-SNAPSHOT-cuda11.jar:/opt/spark/jars/rapids-4-spark-benchmarks_2.12-0.4.0-SNAPSHOT.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*:/opt/spark/jars/rapids-4-spark_2.12-21.10.0-SNAPSHOT.jar:cudf-21.10.0-SNAPSHOT-cuda11.jar:/opt/spark/jars/rapids-4-spark-benchmarks_2.12-0.4.0-SNAPSHOT.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*:/opt/spark/jars/rapids-4-spark_2.12-21.10.0-SNAPSHOT.jar:cudf-21.10.0-SNAPSHOT-cuda11.jar:/opt/spark/jars/rapids-4-spark-benchmarks_2.12-0.4.0-SNAPSHOT.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar'
+ case "$1" in
+ shift 1
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP --resourceProfileId $SPARK_RESOURCE_PROFILE_ID)
+ exec /usr/bin/tini -s -- /usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -verbose:class -Dspark.driver.port=7078 -Dspark.driver.blockManager.port=7079 -Xms92160m -Xmx92160m -cp '/opt/spark/conf::/opt/spark/jars/*:/opt/spark/jars/rapids-4-spark_2.12-21.10.0-SNAPSHOT.jar:cudf-21.10.0-SNAPSHOT-cuda11.jar:/opt/spark/jars/rapids-4-spark-benchmarks_2.12-0.4.0-SNAPSHOT.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@benchmark-runner-35312b7c3ca95272-driver-svc.default.svc:7078 --executor-id 1 --cores 4 --app-id spark-1b3a2327b62841369e786e6937988bdf --hostname 10.233.65.5 --resourceProfileId 0

I have only tested with q5 and q24a so far and both fail at stage 24 (the previous stages may be mostly data setup though).

@andygrove
Copy link
Contributor Author

If we use the aggregator jar then the issue goes away so it does seem classloader-related.

@gerashegalov
Copy link
Collaborator

@andygrove please double-check that we don't have any overlap between the initial classpath as provided via spark.*.extraClassPath and the jars passed via --jars

@sameerz sameerz modified the milestones: Sep 27 - Oct 1, Oct 4 - Oct 15 Oct 4, 2021
@sameerz sameerz assigned gerashegalov and unassigned andygrove Oct 5, 2021
@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Oct 5, 2021
@gerashegalov gerashegalov changed the title [BUG] ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment [BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment Oct 5, 2021
gerashegalov added a commit to gerashegalov/spark-rapids that referenced this issue Oct 7, 2021
- making this option default because it's equivalent to the old flat jar
- making it optional because we still debug how we miss the non-default
  classloader in #NVIDIA#3704 and it's not the right behavior for addJar with
  userClassPathFirst. However, I think we shoudl generally stop
  documenting --jars as the plugin deploy option

Fixes NVIDIA#3704

Signed-off-by: Gera Shegalov <[email protected]>
@nvauto nvauto closed this as completed in 16fc3aa Oct 7, 2021
@gerashegalov
Copy link
Collaborator

the issue is potentially related that we fail to intercept the resource profile classloader apache/spark#27410

@tgravescs
Copy link
Collaborator

I'm not seeing how that could be affecting it so if you have more details please provide them and likely file a separate issue.

@gerashegalov
Copy link
Collaborator

In the ShimLoader we have the assumption that executor class loader is a singleton, which ends being wrong

We instrumented Spark MutableURLClassLoader creation and see two instance on the Executor:
First parseOrFindResources creates the class loader that we end up erroneously manipulating and tracking as the executor class loader

java.lang.Throwable: CL_DEBUG Created a mutable URL classloader: org.apache.spark.util.MutableURLClassLoader@2222a13a with parent sun.misc.Launcher$AppClassLoader@4d7e1886
        at org.apache.spark.util.MutableURLClassLoader.<init>(MutableURLClassLoader.java:34)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend.createClassLoader(CoarseGrainedExecutorBackend.scala:131)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend.parseOrFindResources(CoarseGrainedExecutorBackend.scala:139)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:102)

via this call path:

ShimLoader.newInternalExclusiveModeGpuDiscoveryPlugin()

Then the actual executor class loader is being created:

java.lang.Throwable: CL_DEBUG Created a mutable URL classloader: org.apache.spark.util.MutableURLClassLoader@c07c75a with parent sun.misc.Launcher$AppClassLoader@4d7e1886
        at org.apache.spark.util.MutableURLClassLoader.<init>(MutableURLClassLoader.java:34)
        at org.apache.spark.executor.Executor.createClassLoader(Executor.scala:891)
        at org.apache.spark.executor.Executor.<init>(Executor.scala:159)

which we fail to manipulate because we memoized the wrong one in the ShimLoader already.

@pxLi pxLi changed the title [BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment [BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment Oct 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants