Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question: slinc is about 3 times slower than jni (when using OpenJDK 17). Is this expected peformance? #81

Open
i10416 opened this issue Feb 26, 2023 · 13 comments

Comments

@i10416
Copy link
Contributor

i10416 commented Feb 26, 2023

Hello.
I ran a small comparative benchmark between slinc and jni, and the benchmark result shows slinc is about 3 times slower than jni.
Is this expected peformance? I guess slinc(or Panama) abstraction is not free and I heard that there is some performance overhead for struct allocation in Panama, thus I assume this overhead is expected, but I want to hear author's opinion for my confidence.

context:

  • Scala 3.2.2
  • JVM: JDK 17.0.3, OpenJDK 64-Bit Server VM, 17.0.3+7-LTS
  • slinc: 0.1.1-110-7863cb
  • Apple clang version 13.1.6 (clang-1316.0.21.2.5)

src:

Benchmark Mode Cnt Score Error Units
NativeBenchmarks.jni avgt 5 5064.292 ± 593.829 ns/op
NativeBenchmarks.slinc avgt 5 16882.792 ± 1172.054 ns/op
@markehammons
Copy link
Collaborator

I haven't had a good comparison with JNI so I can't say for sure. However one thing I note is that your code in the JNI implementation doesn't seem to handle deallocation at all, while the Slinc code does on account of the confined Scope. Scope.global would give a similar effect as what's going on in the JNI version.

That being said, it's possible there's more effective ways to implement the Slinc code to get closer to JNI performance. If you'd like to contribute some JNI benchmarks to the project I'd appreciate it!

@i10416
Copy link
Contributor Author

i10416 commented Feb 26, 2023

Thank you for feedback.

your code in the JNI implementation doesn't seem to handle deallocation at all

Ah, that's a good point. I slacked off deallocation😰 I will investigate it.

you'd like to contribute some JNI benchmarks to the project I'd appreciate it!

I'm happy to contribute JNI benchmarks but I'm concerned that benchmark workflow gets messy as JNI requires building native lib. In addition, I usually use sbt for my build, so it will take a bit to translate sbt build into mill's and make a PR.

@i10416
Copy link
Contributor Author

i10416 commented Feb 26, 2023

Panama competes with JNI or even outperforms JNI in some situation as shown in this talk(https://www.youtube.com/watch?v=4xFV-A7JToY), so I think(hope) it is possible to improve performance.

@markehammons
Copy link
Collaborator

I'm happy to contribute JNI benchmarks but I'm concerned that benchmark workflow gets messy as JNI requires building native lib. In addition, I usually use sbt for my build, so it will take a bit to translate sbt build into mill's and make a PR.

I'm already doing this in some capacity for my tests, so it's not a huge issue. I'm not too worried about it overcomplicating things. If you want, we can meet on google meet and I can show you how we can extend mill to do the build of the C++ part.

@markehammons
Copy link
Collaborator

Panama competes with JNI or even outperforms JNI in some situation as shown in this talk(https://www.youtube.com/watch?v=4xFV-A7JToY), so I think(hope) it is possible to improve performance.

It should be possible, and one way will be to drop the usage of MethodHandleFacade, a shim I put in place while Scala 3 didn't officially support MethodHandle.invoke. Now that Scala 3 does support these methods, I should be able to get better performance by using them directly.

There's other things to do to, but right now, the current version of Slinc is probably going to be slower. I'm currently reworking it to be better designed, less complex, and more suitable to build libraries that can be loaded by users using java 17, 18, 19, or whatever. Part of that process is me giving up on trying to do compile-time optimization. Where I'm hoping to gain performance back is JITC powered by runtime multi-stage compilation.

@i10416
Copy link
Contributor Author

i10416 commented Feb 28, 2023

we can meet on google meet and I can show you how we can extend mill to do the build of the C++ part.

That's great. I live in Japan now, but I plan to go to EU region for travel next week, so it is convenient to hold meets next week or later in terms of timezone.(I guess you are in EU from your GitHub profile and fr domain.) Thanks a lot.

By the way, https://github.com/scala-cli/libsodiumjni seems a good example of using JNI with mill, so I'll take a look at it for now to learn mill stuffs.

@i10416
Copy link
Contributor Author

i10416 commented Mar 5, 2023

With Java 19, SlinC is nearly as fast as JNI 😉!

  • JVM: OpenJDK Runtime Environment Zulu19.30+11-CA (build 19.0.1+10)
Benchmark Mode Cnt Score Error Units
NativeBenchmarks.jni avgt 5 4872.056 ± 57.582 ns/op
NativeBenchmarks.slinc avgt 5 5607.126 ± 115.210 ns/op

@i10416 i10416 changed the title question: slinc is about 3 times slower than jni. Is this expected peformance? question: slinc is about 3 times slower than jni (when using OpenJDK 17). Is this expected peformance? Mar 5, 2023
@i10416
Copy link
Contributor Author

i10416 commented Mar 7, 2023

I added simpler benchmark, sorting 1,000,000 elements by qsort, that upcalls JVM method from native. It seems upcall has large overhead even if we use JNI.
I couldn't find out why SlinC(or foreign API) takes 5 time longer than JNI.

JVM: OpenJDK Runtime Environment Zulu19.30+11-CA (build 19.0.1+10)

Benchmark Mode Cnt Score Error Units
SimpleNativeCallBenchmarks.jniNativeQSort using native comparator avgt 5 4113.280 ± 184.594 ns/op
SimpleNativeCallBenchmarks.jniQSort using upcall comparator, destructively mutate original array avgt 5 281968.369 ± 4070.398 ns/op
SimpleNativeCallBenchmarks.slincQSortWithCopyBack using upcall comparator, copy and transfer array avgt 5 1609949.152 ± 429499.499 ns/op
SimpleNativeCallBenchmarks.slincQSortWithoutCopyBack using upcall comparator, copy and transfer array, discarding result avgt 5 1574451.526 ± 378398.468 ns/op

https://github.com/i10416/bench#qsort-benchmark

@markehammons
Copy link
Collaborator

What we can try, and what I don't have available at the moment, is creating an upcall from a method rather than a lambda. The way the foreign API suggests creating an upcall is targeting a method, but I used lambdas instead for ease of use.

@markehammons
Copy link
Collaborator

markehammons commented Mar 7, 2023

Another thing is that I think your bench is doing a lot of extra work in Slinc. I notice that for each call you recreate the upcall, use it, then toss it away. Upcall creation is expensive, and I don't think the JNI version is recreating its upcall binding for each iteration.

Can you try allocating the upcall in a static location (not in the benchmark loop) using Scope.global?

@markehammons
Copy link
Collaborator

markehammons commented Mar 7, 2023

Having cloned your bench and having the callback allocated once (rather than per benchmark iteration), I see a improvement in performance of Slinc's upcall code to just 2x slower than JNI, rather than 5x slower. I think there may be more performance improvements to be found, but first I should make us able to generate an upcall from a method rather than a lambda and see what the performance from that looks like.

@i10416
Copy link
Contributor Author

i10416 commented Mar 7, 2023

Thank you for feedback!

I see a improvement in performance of Slinc's upcall code to just 2x slower than JNI, rather than 5x slower.

Oh! it's significant!

@i10416
Copy link
Contributor Author

i10416 commented Mar 11, 2023

i10416/bench@22323c9

JFYI:

Hi, I can reproduce your improvement in performance by pre-allocating upcall in my local machine! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants