Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Scala] Scala Serialization optimization #682

Open
4 of 6 tasks
Tracked by #681
chaokunyang opened this issue Jul 18, 2023 · 13 comments
Open
4 of 6 tasks
Tracked by #681

[Scala] Scala Serialization optimization #682

chaokunyang opened this issue Jul 18, 2023 · 13 comments
Labels
enhancement New feature or request good first issue Good for newcomers java

Comments

@chaokunyang
Copy link
Collaborator

chaokunyang commented Jul 18, 2023

Is your feature request related to a problem? Please describe.

Scala normal classes and case classes are supported by fury well.

But scala still has some special serialization behaviour which need optimization:

@chaokunyang chaokunyang changed the title Add specialized serializer for Scala Add specialized serializers for Scala Jul 18, 2023
@chaokunyang chaokunyang added java enhancement New feature or request labels Jul 18, 2023
@pjfanning
Copy link
Contributor

I would recommend that specialized Scala support be put in a separate jar from the Java support. The Scala Library jar is big so you don't want to have a dependency on it unless you need to. The fury-scala jar would need to be published for multiple Scala versions. Many Spark and Flink users still use Scala 2.11 so you would need to support 2.11, 2.12, 2.13 and 3. If you are not familiar with Scala, publishing the Scala version specific jars is pretty straightforward because build tools like Maven have specialized plugins for Scala.

@chaokunyang
Copy link
Collaborator Author

chaokunyang commented Oct 20, 2023

Maybe we can use java.lang.invoke.MethodHandle to load scala classes dynamically to avoid source dependency on scala library. In this way, we can avoid release different jar for different scala version too.
Do you think this is feasible?

@pjfanning
Copy link
Contributor

pjfanning commented Oct 20, 2023

I think it will be hard to support reasonably complicated Scala classes using Java Reflection only code.

@pjfanning
Copy link
Contributor

pjfanning commented Oct 20, 2023

Supporting Serialization/Deserialization of Collection<T> is very hard with Java Reflection only. The T is erased when the T is a primitive type. Scala has its own separate reflection libraries but this is also messy because Scala 3 abandoned the scala-reflect lib available in Scala 2. New Scala 3 libs are being developed instead.

This FAQ shows some of the problems that Jackson-Module-Scala has in this area.

https://github.com/FasterXML/jackson-module-scala/wiki/FAQ

@chaokunyang
Copy link
Collaborator Author

chaokunyang commented Oct 20, 2023

Supporting Serialization/Deserialization of Collection is very hard with Java Reflection only. The T is erased when the T is a primitive type. Scala has its own separate reflection libraries but this is also messy because Scala 3 abandoned the scala-reflect lib available in Scala 2. New Scala 3 libs are being developed instead.

This FAQ shows some of the problems that Jackson-Module-Scala has in this area.

https://github.com/FasterXML/jackson-module-scala/wiki/FAQ

I see, thanks. I didn't realized scala will erase primitive type for nested generic type. I haven't use scala for several years. Thanks very much for sharing this information.
image

This breaks our assumption for java genercis. For java collection types, we can know the value is a Long, and write by long for it.

// generate pseudo-code
for (Entry e : map.entrySet()) {
  // null flag write
  stringSerailizer.writeString(e.getKey());
  buffer.writeSliLong(e.getValue());
} 

But for scala, since we can't know the actual type using relection when infer Foo field generic type, we must write map value type foro every value, then write the the value:

// generate pseudo-code
for (Entry e : map.entrySet()) {
  // null flag write
  stringSerailizer.writeString(e.getKey());
  Object v = e.getValue();
  writeType(v.getClass());
  Serializer s = getSerializer(v); // query from map
  s.write(buffer, v)// vritual method call
} 

It will be slower and introduce more space overhead.

@pjfanning
Copy link
Contributor

pjfanning commented Oct 20, 2023

It may only be necessary to write the inner type info for Collection<T> when the T is a primitive (long, int, etc.).

It may be possible to use scala libs to find to out what the erased types are but it is pretty complicated and you would like need a separate solution for Scala 2.x and Scala 3.

@pjfanning
Copy link
Contributor

You could experiment with List[java.lang.Long]. You should find that the java.lang.Long is not erased. The erasure issue occurs for List[scala.Long] though. scala.Long is effectively an alias for the Java primitive type long.

Unfortunately, scala.Long is commonly used - so it is difficult to ignore the effect of this erasure.

@chaokunyang
Copy link
Collaborator Author

It may only be necessary to write the inner type info for Collection<T> when the T is a primitive (long, int, etc.).

It may be possible to use scala libs to find to out what the erased types are but it is pretty complicated and you would like need a separate solution for Scala 2.x and Scala 3.

Our current protocol will write it only once, see #923 , but without this information ahead, the serialization for elements will introduce virtual method call, which will be slower. And the jit optimization is how fury get such boost up, it will be better if we can support such things for scala, considering it's used wisedly in spark/flink/akka.

@chaokunyang
Copy link
Collaborator Author

You could experiment with List[java.lang.Long]. You should find that the java.lang.Long is not erased. The erasure issue occurs for List[scala.Long] though. scala.Long is effectively an alias for the Java primitive type long.

Unfortunately, scala.Long is commonly used - so it is difficult to ignore the effect of this erasure.

Yes, we shouldn't ignore this type information, otherwise the pperformance won't be the best

@chaokunyang
Copy link
Collaborator Author

Another thing I found it that scala collection doesn't implement java collection interface Iterable/Collection/Set/Map, which make the integration with fury collection JIT not easy to implement, since we don't have a java Collection base interface type to call to scala collection.

One method I can see is we convert scala collection to java collection in generated serializer(come with object creation overhead), or implement seperate jit support for such collection

@pjfanning
Copy link
Contributor

Converting Scala collections to Java collections and vice versa will not be cheap.

@chaokunyang
Copy link
Collaborator Author

chaokunyang commented Oct 20, 2023

Now with your new input, I totally agree that we should add scala optimization support in a new library. This is a complicated work, and scala collections are much more complex than java collection framework. We must write the implementation using scala collection API, otherwise this work will be too much.

@chaokunyang
Copy link
Collaborator Author

Currently if a scala class doesn't have scala collections field type, the performance is good in fury. case type is supported by fury natively. But if collections types are included, the serialization performance will be not good since we can't use any collection generic information.

Better scala serialization support should be done in a new jar. And perhaps we also need to add a new adapter inferface in java to make let scala collections hook into fury java codegen. Currently we do java codegen in io.fury.builder.BaseObjectCodecBuilder#serializeForCollection. Our codegen generate java code only. Maybe we need to adjust the jit and let it invoke the new adapter inferface to read or populate scala collection. In such wa, we can avoid generate scala code in our jit framework.

@chaokunyang chaokunyang added the good first issue Good for newcomers label Nov 2, 2023
@chaokunyang chaokunyang changed the title Add specialized serializers for Scala [Scala] Scala Serialization optimization Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers java
Projects
None yet
Development

No branches or pull requests

2 participants