-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Executor-side ClassCastException when testing with Spark 3.2.1-SNAPSHOT in k8s environment #3704
Comments
I manually built the base Docker image containing Spark 3.2.0-rc6 and added the plugin jar, compiled against 3.2.0-rc6. I also built the plugin with Scala 2.12.15 to match Spark (not that it should matter, but wanted to rule this out). I did not add any Hadoop or AWS jars. I then ran the CI job as usual, with cuDF and benchmark jars being pulled from artifactory, but not the plugin jar. The ++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n /opt/spark/jars/rapids-4-spark_2.12-21.10.0-SNAPSHOT.jar:cudf-21.10.0-SNAPSHOT-cuda11.jar:/opt/spark/jars/rapids-4-spark-benchmarks_2.12-0.4.0-SNAPSHOT.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*:/opt/spark/jars/rapids-4-spark_2.12-21.10.0-SNAPSHOT.jar:cudf-21.10.0-SNAPSHOT-cuda11.jar:/opt/spark/jars/rapids-4-spark-benchmarks_2.12-0.4.0-SNAPSHOT.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*:/opt/spark/jars/rapids-4-spark_2.12-21.10.0-SNAPSHOT.jar:cudf-21.10.0-SNAPSHOT-cuda11.jar:/opt/spark/jars/rapids-4-spark-benchmarks_2.12-0.4.0-SNAPSHOT.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar'
+ case "$1" in
+ shift 1
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP --resourceProfileId $SPARK_RESOURCE_PROFILE_ID)
+ exec /usr/bin/tini -s -- /usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -verbose:class -Dspark.driver.port=7078 -Dspark.driver.blockManager.port=7079 -Xms92160m -Xmx92160m -cp '/opt/spark/conf::/opt/spark/jars/*:/opt/spark/jars/rapids-4-spark_2.12-21.10.0-SNAPSHOT.jar:cudf-21.10.0-SNAPSHOT-cuda11.jar:/opt/spark/jars/rapids-4-spark-benchmarks_2.12-0.4.0-SNAPSHOT.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar:/opt/hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@benchmark-runner-35312b7c3ca95272-driver-svc.default.svc:7078 --executor-id 1 --cores 4 --app-id spark-1b3a2327b62841369e786e6937988bdf --hostname 10.233.65.5 --resourceProfileId 0 I have only tested with q5 and q24a so far and both fail at stage 24 (the previous stages may be mostly data setup though). |
If we use the aggregator jar then the issue goes away so it does seem classloader-related. |
@andygrove please double-check that we don't have any overlap between the initial classpath as provided via |
- making this option default because it's equivalent to the old flat jar - making it optional because we still debug how we miss the non-default classloader in #NVIDIA#3704 and it's not the right behavior for addJar with userClassPathFirst. However, I think we shoudl generally stop documenting --jars as the plugin deploy option Fixes NVIDIA#3704 Signed-off-by: Gera Shegalov <[email protected]>
the issue is potentially related that we fail to intercept the resource profile classloader apache/spark#27410 |
I'm not seeing how that could be affecting it so if you have more details please provide them and likely file a separate issue. |
In the ShimLoader we have the assumption that executor class loader is a singleton, which ends being wrong We instrumented Spark MutableURLClassLoader creation and see two instance on the Executor:
via this call path: spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/ExclusiveModeGpuDiscoveryPlugin.scala Line 45 in f4465cb
Then the actual executor class loader is being created:
which we fail to manipulate because we memoized the wrong one in the ShimLoader already. |
Describe the bug
Benchmark queries are failing in SJC4 k8s cluster:
Steps/Code to reproduce bug
See benchmark runs in sjc4 cluster.
Expected behavior
Should work.
Additional context
N/A
The text was updated successfully, but these errors were encountered: