[BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment #3704

andygrove · 2021-09-29T16:36:36Z

Describe the bug

Benchmark queries are failing in SJC4 k8s cluster:

java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
        at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)
        at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2411)
...
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
        at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:527)
...
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
        at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Steps/Code to reproduce bug

See benchmark runs in sjc4 cluster.

Expected behavior

Should work.

Additional context
N/A

The text was updated successfully, but these errors were encountered:

andygrove · 2021-10-01T16:32:43Z

I manually built the base Docker image containing Spark 3.2.0-rc6 and added the plugin jar, compiled against 3.2.0-rc6. I also built the plugin with Scala 2.12.15 to match Spark (not that it should matter, but wanted to rule this out). I did not add any Hadoop or AWS jars.

I then ran the CI job as usual, with cuDF and benchmark jars being pulled from artifactory, but not the plugin jar.

The ClassCastException still occurs in the executors. Here is the initial logging from executor startup, showing path information. Note that even though hadoop is referenced, there are no hadoop jars in the file system and I enabled classloader verbose logging and confirmed that no hadoop jars were loaded.

++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n /opt/spark/jars/rapids-4-spark_2.12-21.10.0-SNAPSHOT.jar:cudf-21.10.0-SNAPSHOT-cuda11.jar:/opt/spark/jars/rapids-4-spark-benchmarks_2.12-0.4.0-SNAPSHOT.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*:/opt/spark/jars/rapids-4-spark_2.12-21.10.0-SNAPSHOT.jar:cudf-21.10.0-SNAPSHOT-cuda11.jar:/opt/spark/jars/rapids-4-spark-benchmarks_2.12-0.4.0-SNAPSHOT.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*:/opt/spark/jars/rapids-4-spark_2.12-21.10.0-SNAPSHOT.jar:cudf-21.10.0-SNAPSHOT-cuda11.jar:/opt/spark/jars/rapids-4-spark-benchmarks_2.12-0.4.0-SNAPSHOT.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar'
+ case "$1" in
+ shift 1
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP --resourceProfileId $SPARK_RESOURCE_PROFILE_ID)
+ exec /usr/bin/tini -s -- /usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -verbose:class -Dspark.driver.port=7078 -Dspark.driver.blockManager.port=7079 -Xms92160m -Xmx92160m -cp '/opt/spark/conf::/opt/spark/jars/*:/opt/spark/jars/rapids-4-spark_2.12-21.10.0-SNAPSHOT.jar:cudf-21.10.0-SNAPSHOT-cuda11.jar:/opt/spark/jars/rapids-4-spark-benchmarks_2.12-0.4.0-SNAPSHOT.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@benchmark-runner-35312b7c3ca95272-driver-svc.default.svc:7078 --executor-id 1 --cores 4 --app-id spark-1b3a2327b62841369e786e6937988bdf --hostname 10.233.65.5 --resourceProfileId 0

I have only tested with q5 and q24a so far and both fail at stage 24 (the previous stages may be mostly data setup though).

andygrove · 2021-10-01T17:05:03Z

If we use the aggregator jar then the issue goes away so it does seem classloader-related.

gerashegalov · 2021-10-03T03:07:07Z

@andygrove please double-check that we don't have any overlap between the initial classpath as provided via spark.*.extraClassPath and the jars passed via --jars

- making this option default because it's equivalent to the old flat jar - making it optional because we still debug how we miss the non-default classloader in #NVIDIA#3704 and it's not the right behavior for addJar with userClassPathFirst. However, I think we shoudl generally stop documenting --jars as the plugin deploy option Fixes NVIDIA#3704 Signed-off-by: Gera Shegalov <[email protected]>

gerashegalov · 2021-10-07T23:15:48Z

the issue is potentially related that we fail to intercept the resource profile classloader apache/spark#27410

tgravescs · 2021-10-08T13:23:37Z

I'm not seeing how that could be affecting it so if you have more details please provide them and likely file a separate issue.

gerashegalov · 2021-10-08T16:08:31Z

In the ShimLoader we have the assumption that executor class loader is a singleton, which ends being wrong

We instrumented Spark MutableURLClassLoader creation and see two instance on the Executor:
First parseOrFindResources creates the class loader that we end up erroneously manipulating and tracking as the executor class loader

java.lang.Throwable: CL_DEBUG Created a mutable URL classloader: org.apache.spark.util.MutableURLClassLoader@2222a13a with parent sun.misc.Launcher$AppClassLoader@4d7e1886
        at org.apache.spark.util.MutableURLClassLoader.<init>(MutableURLClassLoader.java:34)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend.createClassLoader(CoarseGrainedExecutorBackend.scala:131)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend.parseOrFindResources(CoarseGrainedExecutorBackend.scala:139)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:102)

via this call path:

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/ExclusiveModeGpuDiscoveryPlugin.scala

Line 45 in f4465cb

ShimLoader.newInternalExclusiveModeGpuDiscoveryPlugin()

Then the actual executor class loader is being created:

java.lang.Throwable: CL_DEBUG Created a mutable URL classloader: org.apache.spark.util.MutableURLClassLoader@c07c75a with parent sun.misc.Launcher$AppClassLoader@4d7e1886
        at org.apache.spark.util.MutableURLClassLoader.<init>(MutableURLClassLoader.java:34)
        at org.apache.spark.executor.Executor.createClassLoader(Executor.scala:891)
        at org.apache.spark.executor.Executor.<init>(Executor.scala:159)

which we fail to manipulate because we memoized the wrong one in the ShimLoader already.

andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 29, 2021

andygrove added this to the Sep 27 - Oct 1 milestone Sep 29, 2021

andygrove self-assigned this Sep 29, 2021

sameerz modified the milestones: Sep 27 - Oct 1, Oct 4 - Oct 15 Oct 4, 2021

sameerz assigned gerashegalov and unassigned andygrove Oct 5, 2021

sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Oct 5, 2021

gerashegalov changed the title ~~[BUG] ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment~~ [BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment Oct 5, 2021

gerashegalov mentioned this issue Oct 7, 2021

Force parallel world in Shim caller's classloader #3763

Merged

nvauto closed this as completed in 16fc3aa Oct 7, 2021

pxLi changed the title ~~[BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment~~ [BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment #3704

[BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment #3704

andygrove commented Sep 29, 2021 •

edited by gerashegalov

Loading

andygrove commented Oct 1, 2021

andygrove commented Oct 1, 2021

gerashegalov commented Oct 3, 2021

gerashegalov commented Oct 7, 2021

tgravescs commented Oct 8, 2021

gerashegalov commented Oct 8, 2021

[BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment #3704

[BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment #3704

Comments

andygrove commented Sep 29, 2021 • edited by gerashegalov Loading

andygrove commented Oct 1, 2021

andygrove commented Oct 1, 2021

gerashegalov commented Oct 3, 2021

gerashegalov commented Oct 7, 2021

tgravescs commented Oct 8, 2021

gerashegalov commented Oct 8, 2021

andygrove commented Sep 29, 2021 •

edited by gerashegalov

Loading