Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.io.NotSerializableException: scala.xml.NodeSeq$$anon$1 #201

Open
dyf102 opened this issue Mar 20, 2018 · 4 comments
Open

java.io.NotSerializableException: scala.xml.NodeSeq$$anon$1 #201

dyf102 opened this issue Mar 20, 2018 · 4 comments

Comments

@dyf102
Copy link

dyf102 commented Mar 20, 2018

I am writing the map function in spark to parse xml within the log. But I got the NotSerializableException. I cannot figure it out the reason. The trace stack is followed. How to walk around it? Anyone has suggestion?

org.apache.spark.SparkException: Task not serializable
  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345)
  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:371)
  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.map(RDD.scala:370)
  at parse(<console>:33)
  ... 49 elided
Caused by: java.io.NotSerializableException: scala.xml.NodeSeq$$anon$1
Serialization stack:
	- object not serializable (class: scala.xml.NodeSeq$$anon$1, value: <ns18:userID>4536000170315902</ns18:userID>)

The way I am using is

rows.mapPartitions(rows => {
      val XMLParser = scala.xml.XML
      rows.map(row => {
        val xmlContent = sliceLogHeader(row)
        val xmlDom = XMLParser.loadString(xmlContent)
        val headerDOM = xmlDom\ "header"
        val userID = (headerDOM \"userID").text
        val clientSessionID = (headerDOM \"clientSessionID").text
        Account(userID, clientSessionID)
      })
@ashawley
Copy link
Member

What version of scala-xml are you using?

We had a release recently which contained a related fix on serialization, #154.

@dyf102
Copy link
Author

dyf102 commented Mar 21, 2018

@ashawley
Same issue when I start spark-sell by
spark-shell --packages org.scala-lang.modules:scala-xml_2.11:1.1.0

@ashawley
Copy link
Member

I don't know enough about using the Spark shell, but there's a good chance it may be using the scala-xml version that is brought in for the scala-compiler rather than the one you specified with --packages.

Going down the rabbit hole, it appears this is the Spark shell script that finally runs Java:

https://github.com/apache/spark/blob/73f28530/bin/spark-class

java -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"

And LAUNCH_CLASSPATH is the wildcard match of anything in the SPARK_JARS_DIR:

LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"

The jars directory is compiled in this shell script:

https://github.com/apache/spark/blob/7013eea/dev/make-distribution.sh

Those jars come from 'assembly/target`

# Copy jars
cp "$SPARK_HOME"/assembly/target/scala*/jars/* "$DISTDIR/jars/"

And there are multiple maven build files referencing scala-compiler:

https://github.com/apache/spark/blob/73f2853/pom.xml
https://github.com/apache/spark/blob/73f2853/repl/pom.xml
https://github.com/apache/spark/blob/73f2853/tools/pom.xml

@ashawley
Copy link
Member

Seems like Spark maintainers could either:

  1. Modify their maven config to explicitly depend on a specific version of scala-xml
  2. Modify their maven config to exclude the scala-xml transitive dependency from scala-compiler and let the user add it themselves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants