Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't deserialize scala.List[M] if scala-library is not in the same classloader as M #9237

Closed
scabug opened this issue Mar 19, 2015 · 18 comments

Comments

@scabug
Copy link

scabug commented Mar 19, 2015

List has some serialization problems.

Take a look at the minimized version here.

https://github.com/hamnis/minimized-list-serialization

This seems to be triggered by classloaders, not that I understand why that is the case.
Comment in the fork in Test in build.sbt, then it works.

@scabug
Copy link
Author

scabug commented Mar 19, 2015

Imported From: https://issues.scala-lang.org/browse/SI-9237?orig=1
Reporter: @hamnis
Affected Versions: 2.10.5, 2.11.5

@scabug
Copy link
Author

scabug commented Mar 19, 2015

@retronym said:
Thanks for the small reproduction. I'm digging into it.

Here's a related SBT ticket: sbt/sbt#89

@scabug
Copy link
Author

scabug commented Mar 19, 2015

@retronym said:
The core problem is that SBT, by deafault, has scala-library.jar in a parent classloader of the application classes.

When we deserialize a List, we go through custom deserizalization code in object:

    // Java serialization calls this before readResolve during de-serialization.
    // Read the whole list and store it in `orig`.
    private def readObject(in: ObjectInputStream) {
      val builder = List.newBuilder[A]
      while (true) in.readObject match {
        case ListSerializeEnd =>
          orig = builder.result()
          return
        case a =>
          builder += a.asInstanceOf[A]
      }
    }

The ObjectInputStream provided by the JVM here internally resolves classes with the classloader the owner of closest enclosing method in the call stack. This is handled in ObjectInputStream#resolveClass.

protected Class<?> resolveClass(ObjectStreamClass desc)
	throws IOException, ClassNotFoundException
    {
	String name = desc.getName();
	try {
	    return Class.forName(name, false, latestUserDefinedLoader());
	} catch (ClassNotFoundException ex) {
	    Class cl = (Class) primClasses.get(name);
	    if (cl != null) {
		return cl;
	    } else {
		throw ex;
	    }
	}
    }

This ticket against OpenJDK suggests that this behaviour is buggy and instead the current thread's context class loader should be used. But the response is that the current behaviour is as specified.
https://bugs.openjdk.java.net/browse/JDK-4340158

Here's a further minimization:

// src/test/scala/issue/Meh.scala
class Meh
// src/test/scala/issue/Main.scala
package issue

import java.io.{ObjectOutputStream, ObjectInputStream, ByteArrayOutputStream, ByteArrayInputStream}

object Test {
  def main(args: Array[String]): Unit = {
    val obj = List(new Meh)
    val arr = serialize(obj)
    val obj2 = deserialize[List[Meh]](arr)
    assert(obj == obj2)
  }

  def serialize[A](obj: A): Array[Byte] = {
    val o = new ByteArrayOutputStream()
    val os = new ObjectOutputStream(o)
    os.writeObject(obj)
    o.toByteArray()
  }

  def deserialize[A](bytes: Array[Byte]): A = {
    val s = new ByteArrayInputStream(bytes)
    val is = new ObjectInputStream(s)
    is.readObject().asInstanceOf[A]
  }
}

I'm not sure if we can do anything to fix this in our custom serialization code. I would appreciate ideas from serialization experts.

@scabug
Copy link
Author

scabug commented Mar 19, 2015

@hamnis said:
What is interesting is that this works fine for Vector. I have only found this problem with List

@scabug
Copy link
Author

scabug commented Mar 19, 2015

@retronym said:
For Vector, we just rely on vanilla serialization. List, on the other hand, needs to use the Serialization Proxy pattern (http://www.javacodegeeks.com/2015/01/the-serialization-proxy-pattern.html). This forces us to use the ObjectInputStream provided by the platform which has these classloader assumptions baked in.

@scabug
Copy link
Author

scabug commented Apr 8, 2015

@retronym said:
/cc @scottcarey for any insights.

@scabug
Copy link
Author

scabug commented Apr 8, 2015

@scottcarey said:
List is not the only thing that uses this pattern. List got its inspiration from immutable.HashMap and others.

This may happen in a lot more places than List.

https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/immutable/HashMap.scala#L569

what happens if we change

val is = new ObjectInputStream(s)

to use a custom extended ObjectInputStream that either overrides that method or changes 'latestUserDefinedLoader'?

If that works, it would be a work-around on the user side -- not much we can do here if something beyond our control is trying to load classes it can't see.

@scabug
Copy link
Author

scabug commented Apr 8, 2015

@scottcarey said:
This appears to be a common problem.

Others have implemented things like a "ClassloaderAwareObjectInputStream". Essentially, this appears to be a flaw with ObjectInputStream's design.

Also see the results when searching google for 'latestUserDefinedLoader'.

The common solution is for users to extend ObjectInputStream and override resolveClass. This turns out to be non-trivial as well, see bugs in other libraries such as: prevayler/prevayler#10

@scabug
Copy link
Author

scabug commented Apr 9, 2015

@retronym said (edited on Apr 9, 2015 5:09:57 AM UTC):
I wonder why this pattern works for java.util.TreeMap (which also uses the Serialization Proxy pattern) but not for List.

Here's a modified version of Erlend's test project that contrasts them. Output in the commit comment:

retronym/minimized-list-serialization@cbe0d9d

@scabug
Copy link
Author

scabug commented Apr 9, 2015

@retronym said (edited on Apr 9, 2015 5:50:34 AM UTC):
Oh, I suppose that latestUserDefinedLoader skips the classloader that provides TreeMap because it is provided by the bootstrap classloader.

    /**
     * Returns the first non-null class loader (not counting class loaders of
     * generated reflection implementation classes) up the execution stack, or
     * null if only code from the null class loader is on the stack.  This
     * method is also called via reflection by the following RMI-IIOP class:
     * 
     *     com.sun.corba.se.internal.util.JDKClassLoader
     *     
     * This method should not be removed or its signature changed without
     * corresponding modifications to the above class.
     */
    // REMIND: change name to something more accurate?
    private static native ClassLoader latestUserDefinedLoader();
scala> classOf[java.util.TreeMap[_, _]].getClassLoader
res1: ClassLoader = null

So you can have scala-library.jar in the boot class loader, or in one that has access to the classes you want to put in your collection, but you can't have it in between.

Delightful stuff!

@scabug
Copy link
Author

scabug commented Apr 25, 2015

@Ichoran said:
[~retronym] The best workaround I can come up with is to override only {{readResolve}} to fix things afterwards, which doubles the amount of garbage. But I'm not sure that we are actually using the serialization proxy pattern for the reason stated in the source code. The reader code uses {{List.newBuffer}} which doesn't avoid mutation. If the whole point of the proxy pattern for {{List}} is to avoid mutation, and we mutate anyway, shouldn't we just drop the serialization proxy pattern and let Java use normal serialization? (Or if not, shouldn't we write the list in reverse order so we can build it immutably when we build it?) If it's just {{Nil}} that we care about keeping straight, that can be {{readResolve}}d.

@scabug
Copy link
Author

scabug commented Apr 27, 2015

@scottcarey said:
@Ichoran
The serialization was changed to this pattern due to changing hd from var to val:

scala/scala@a5fc6e6

The old serialization code mutated both list and hd.

I had plans to move tail to val as well, and stop using ListBuffer, but there was resistance since it requires changing the compiler in a couple places to not mutate it, and requires performance validation that the techniques used to become fully immutable do not hurt performance in various cases. I still feel that there will be about as many wins as losses if done right, as there are performance opportunities in places like map and fold.

Without the serialization proxy pattern, Java serialization will create an instance of the object without executing its constructor, and then readObject can mutate its fields. If you have a val member, you can't use readObject because Scala won't let you mutate a val.

Reversing the list on serialization will be slow on the other end when writing.

@scabug
Copy link
Author

scabug commented Apr 27, 2015

@Ichoran said:
@scottcarey Given that the change has caused a bug, what is the reason we shouldn't just revert the change? Were any bugs fixed by making that change? (I grant that in principle it's safer to have at least hd be a val, but if we think we might change the logic that much, we can achieve a similar level of safety by changing the var to a val and making sure that the class still compiles.)

@scabug
Copy link
Author

scabug commented Apr 29, 2015

@scottcarey said (edited on Apr 29, 2015 12:34:37 AM UTC):
Does this bug not exist for immutable.HashMap as well? Or the other places in the library that use this pattern?

To some extent, this is not a bug at all, its just how Java Serialization works.

@Ichoran I can't quite parse 'we can achieve a similar level of safety by changing the var to a val and making sure that the class still compiles'. The old code would not compile after only changing var to val, and so the pattern used in immutable.HashMap (and elsewhere) was applied to List.

This all started here:
https://groups.google.com/forum/?fromgroups=#!topic/scala-internals/P7WNVkJQdHk

The last few messages on that thread by me cover a few other options -- reflection and/or Unsafe can modify a val (final field), but will break if there is a security context that does not allow it. Those options aren't so great. Rex, you have a comment in that thread that not using ListBuffer (and using Array instead) is much faster and that is part of my inspiration to make the tail a val as well, but replacing the use of ListBuffer will require a lot of convincing others with performance numbers. (then the two other instances in the compiler / reflection that modify the val need to change).

If we simply revert the change, we are either admitting that List will always be mutable and https://gist.github.com/jrudolph/6552186 will never be fixed, or deferring this 'bug' to the future when it becomes immutable.

I had a local branch where the above was fixed and list was immutable. The library worked fine, but needed performance work since ListBuffer was purposely broken (performance wise) to test it. I did not continue any work because there was resistance to changing anything else at the time. The discussion here: scala/scala#3252 Is the best place to see where things are at.

I would be interested in picking this back up some time, but worry that the compiler and library still do not have enough performance testing tools to convince people that I'm not breaking things, and I don't want to embark on a big chunk of work without feeling like there is a chance it will get anywhere.

@dknoester
Copy link

Hi all,

This seems related to a bug I've recently reported to Apache Spark: https://issues.apache.org/jira/browse/SPARK-20525

Was hoping someone here might be able to help - Thanks in advance!

@SethTisue
Copy link
Member

putting this on the 2.13.0-RC1 milestone because in the 2.13 collections, most/all collections are affected, instead of only List

however, I also think we should just close this as "not a bug". as @scottcarey wrote, "this is not a bug at all, its just how Java Serialization works"

but we do need to make sure we've documented the issue in e.g. the 2.13 collections migration guide

@SethTisue
Copy link
Member

scala/scala#7624

@SethTisue
Copy link
Member

SethTisue commented Feb 20, 2019

that PR is marked with the release-notes label, which should serve as sufficient reminder to include it in migration doc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants