[Scala] Scala Serialization optimization #682

chaokunyang · 2023-07-18T03:18:53Z

Is your feature request related to a problem? Please describe.

Scala normal classes and case classes are supported by fury well.

But scala still has some special serialization behaviour which need optimization:

The text was updated successfully, but these errors were encountered:

pjfanning · 2023-10-20T09:55:29Z

I would recommend that specialized Scala support be put in a separate jar from the Java support. The Scala Library jar is big so you don't want to have a dependency on it unless you need to. The fury-scala jar would need to be published for multiple Scala versions. Many Spark and Flink users still use Scala 2.11 so you would need to support 2.11, 2.12, 2.13 and 3. If you are not familiar with Scala, publishing the Scala version specific jars is pretty straightforward because build tools like Maven have specialized plugins for Scala.

chaokunyang · 2023-10-20T11:35:10Z

Maybe we can use java.lang.invoke.MethodHandle to load scala classes dynamically to avoid source dependency on scala library. In this way, we can avoid release different jar for different scala version too.
Do you think this is feasible?

pjfanning · 2023-10-20T11:50:05Z

I think it will be hard to support reasonably complicated Scala classes using Java Reflection only code.

pjfanning · 2023-10-20T12:12:14Z

Supporting Serialization/Deserialization of Collection<T> is very hard with Java Reflection only. The T is erased when the T is a primitive type. Scala has its own separate reflection libraries but this is also messy because Scala 3 abandoned the scala-reflect lib available in Scala 2. New Scala 3 libs are being developed instead.

This FAQ shows some of the problems that Jackson-Module-Scala has in this area.

https://github.com/FasterXML/jackson-module-scala/wiki/FAQ

chaokunyang · 2023-10-20T13:44:30Z

Supporting Serialization/Deserialization of Collection is very hard with Java Reflection only. The T is erased when the T is a primitive type. Scala has its own separate reflection libraries but this is also messy because Scala 3 abandoned the scala-reflect lib available in Scala 2. New Scala 3 libs are being developed instead.

This FAQ shows some of the problems that Jackson-Module-Scala has in this area.

https://github.com/FasterXML/jackson-module-scala/wiki/FAQ

I see, thanks. I didn't realized scala will erase primitive type for nested generic type. I haven't use scala for several years. Thanks very much for sharing this information.

This breaks our assumption for java genercis. For java collection types, we can know the value is a Long, and write by long for it.

// generate pseudo-code
for (Entry e : map.entrySet()) {
  // null flag write
  stringSerailizer.writeString(e.getKey());
  buffer.writeSliLong(e.getValue());
}

But for scala, since we can't know the actual type using relection when infer Foo field generic type, we must write map value type foro every value, then write the the value:

// generate pseudo-code
for (Entry e : map.entrySet()) {
  // null flag write
  stringSerailizer.writeString(e.getKey());
  Object v = e.getValue();
  writeType(v.getClass());
  Serializer s = getSerializer(v); // query from map
  s.write(buffer, v)// vritual method call
}

It will be slower and introduce more space overhead.

pjfanning · 2023-10-20T13:49:00Z

It may only be necessary to write the inner type info for Collection<T> when the T is a primitive (long, int, etc.).

It may be possible to use scala libs to find to out what the erased types are but it is pretty complicated and you would like need a separate solution for Scala 2.x and Scala 3.

pjfanning · 2023-10-20T13:54:49Z

You could experiment with List[java.lang.Long]. You should find that the java.lang.Long is not erased. The erasure issue occurs for List[scala.Long] though. scala.Long is effectively an alias for the Java primitive type long.

Unfortunately, scala.Long is commonly used - so it is difficult to ignore the effect of this erasure.

chaokunyang · 2023-10-20T13:58:22Z

It may only be necessary to write the inner type info for Collection<T> when the T is a primitive (long, int, etc.).

It may be possible to use scala libs to find to out what the erased types are but it is pretty complicated and you would like need a separate solution for Scala 2.x and Scala 3.

Our current protocol will write it only once, see #923 , but without this information ahead, the serialization for elements will introduce virtual method call, which will be slower. And the jit optimization is how fury get such boost up, it will be better if we can support such things for scala, considering it's used wisedly in spark/flink/akka.

chaokunyang · 2023-10-20T13:59:27Z

You could experiment with List[java.lang.Long]. You should find that the java.lang.Long is not erased. The erasure issue occurs for List[scala.Long] though. scala.Long is effectively an alias for the Java primitive type long.

Unfortunately, scala.Long is commonly used - so it is difficult to ignore the effect of this erasure.

Yes, we shouldn't ignore this type information, otherwise the pperformance won't be the best

chaokunyang · 2023-10-20T14:00:25Z

Another thing I found it that scala collection doesn't implement java collection interface Iterable/Collection/Set/Map, which make the integration with fury collection JIT not easy to implement, since we don't have a java Collection base interface type to call to scala collection.

One method I can see is we convert scala collection to java collection in generated serializer(come with object creation overhead), or implement seperate jit support for such collection

pjfanning · 2023-10-20T14:01:36Z

Converting Scala collections to Java collections and vice versa will not be cheap.

chaokunyang · 2023-10-20T14:09:16Z

Now with your new input, I totally agree that we should add scala optimization support in a new library. This is a complicated work, and scala collections are much more complex than java collection framework. We must write the implementation using scala collection API, otherwise this work will be too much.

chaokunyang · 2023-10-20T14:19:50Z

Currently if a scala class doesn't have scala collections field type, the performance is good in fury. case type is supported by fury natively. But if collections types are included, the serialization performance will be not good since we can't use any collection generic information.

Better scala serialization support should be done in a new jar. And perhaps we also need to add a new adapter inferface in java to make let scala collections hook into fury java codegen. Currently we do java codegen in io.fury.builder.BaseObjectCodecBuilder#serializeForCollection. Our codegen generate java code only. Maybe we need to adjust the jit and let it invoke the new adapter inferface to read or populate scala collection. In such wa, we can avoid generate scala code in our jit framework.

chaokunyang mentioned this issue Jul 18, 2023

[Java] JVM languages serialization optimization #681

Open

4 tasks

chaokunyang changed the title ~~Add specialized serializer for Scala~~ Add specialized serializers for Scala Jul 18, 2023

chaokunyang added java enhancement New feature or request labels Jul 18, 2023

chaokunyang mentioned this issue Oct 31, 2023

[Scala] Setup scala project #1055

Closed

chaokunyang added the good first issue Good for newcomers label Nov 2, 2023

chaokunyang changed the title ~~Add specialized serializers for Scala~~ [Scala] Scala Serialization optimization Nov 4, 2023

chaokunyang mentioned this issue Nov 4, 2023

[Scala] support scala collection jit serialization #1076

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Scala] Scala Serialization optimization #682

[Scala] Scala Serialization optimization #682

chaokunyang commented Jul 18, 2023 •

edited

Loading

pjfanning commented Oct 20, 2023

chaokunyang commented Oct 20, 2023 •

edited

Loading

pjfanning commented Oct 20, 2023 •

edited

Loading

pjfanning commented Oct 20, 2023 •

edited

Loading

chaokunyang commented Oct 20, 2023 •

edited

Loading

pjfanning commented Oct 20, 2023 •

edited

Loading

pjfanning commented Oct 20, 2023

chaokunyang commented Oct 20, 2023

chaokunyang commented Oct 20, 2023

chaokunyang commented Oct 20, 2023

pjfanning commented Oct 20, 2023

chaokunyang commented Oct 20, 2023 •

edited

Loading

chaokunyang commented Oct 20, 2023

[Scala] Scala Serialization optimization #682

[Scala] Scala Serialization optimization #682

Comments

chaokunyang commented Jul 18, 2023 • edited Loading

Is your feature request related to a problem? Please describe.

pjfanning commented Oct 20, 2023

chaokunyang commented Oct 20, 2023 • edited Loading

pjfanning commented Oct 20, 2023 • edited Loading

pjfanning commented Oct 20, 2023 • edited Loading

chaokunyang commented Oct 20, 2023 • edited Loading

pjfanning commented Oct 20, 2023 • edited Loading

pjfanning commented Oct 20, 2023

chaokunyang commented Oct 20, 2023

chaokunyang commented Oct 20, 2023

chaokunyang commented Oct 20, 2023

pjfanning commented Oct 20, 2023

chaokunyang commented Oct 20, 2023 • edited Loading

chaokunyang commented Oct 20, 2023

chaokunyang commented Jul 18, 2023 •

edited

Loading

chaokunyang commented Oct 20, 2023 •

edited

Loading

pjfanning commented Oct 20, 2023 •

edited

Loading

pjfanning commented Oct 20, 2023 •

edited

Loading

chaokunyang commented Oct 20, 2023 •

edited

Loading

pjfanning commented Oct 20, 2023 •

edited

Loading

chaokunyang commented Oct 20, 2023 •

edited

Loading