Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate possible AOT performance issues #31307

Closed
bclozel opened this issue Sep 25, 2023 · 8 comments
Closed

Investigate possible AOT performance issues #31307

bclozel opened this issue Sep 25, 2023 · 8 comments
Assignees
Labels
in: core Issues in core modules (aop, beans, core, context, expression) theme: aot An issue related to Ahead-of-time processing type: task A general task

Comments

@bclozel
Copy link
Member

bclozel commented Sep 25, 2023

As reported by @shodo in #30431

I've finally created a fake/dummy service that (at least on my PC) shows result similar to our production service.
The repo is available here: https://github.com/shodo/fake-service

There is a readme that explain a little bit how is structured and how to run it, the result I'm obtaining are the following:

  • standard mode: c.g.s.f.webapp.FakeServiceApplicationKt : Started FakeServiceApplicationKt in 13.056 seconds (process running for 14.914)
  • indexer mode: c.g.s.f.webapp.FakeServiceApplicationKt : Started FakeServiceApplicationKt in 8.805 seconds (process running for 9.753)
  • AOT mode: c.g.s.f.webapp.FakeServiceApplicationKt : Started FakeServiceApplicationKt in 13.984 seconds (process running for 14.813)

I've done many runs, the AOT one seems the most variable, some time 11seconds, some time 16seconds, while the standard and the indexer are more stable.

We will investigate this sample application to find out why in this case AOT is not faster than the context-indexer.

@bclozel bclozel added status: waiting-for-triage An issue we've not yet triaged or decided on type: task A general task labels Sep 25, 2023
@shodo
Copy link

shodo commented Sep 25, 2023

Thanks @bclozel and @snicoll! I really appreciate your help!
I know that the sample application is quite complex: I started from a production one and wrote fake services but preserved a similar structure to have some degree of "fidelity" in the startup behavior.

I can try in a new branch to reduce the package nesting in the core module, eliminate some unuseful classes, and check if I have similar results so that the sample is a bit easier to investigate.

@kobaeugenea
Copy link
Contributor

kobaeugenea commented Sep 26, 2023

Hey everyone.
Don't know if it helps but I profiled @shodo code a little and noticed next.
Significant diffrence in time execution is placed here in AbstractAutowireCapableBeanFactory:

		Supplier<?> instanceSupplier = mbd.getInstanceSupplier();
		if (instanceSupplier != null) {
			return obtainFromSupplier(instanceSupplier, beanName, mbd);
		}

		if (mbd.getFactoryMethodName() != null) {
			return instantiateUsingFactoryMethod(beanName, mbd, args);
		}

If we use AOT then instance supplier is used for creating bean instances. In case of indexer instantiateUsingFactoryMethod is used. And by some reason instantiateUsingFactoryMethod works twice faster at least for me.

And a little bit more info:
For me app with indexer starts for ~5 secs and app with AOT for ~7 sec.
obtainFromSupplier takes ~1236ms, where instantiateUsingFactoryMethod takes ~540 ms.
0.6 secs of this difference is taken by
238 private AutowiredArguments resolveArguments(RegisteredBean registeredBean, Executable executable) of BeanInstanceSupplier.

@snicoll snicoll removed the status: waiting-for-triage An issue we've not yet triaged or decided on label Oct 25, 2023
@snicoll snicoll added this to the 6.2.x milestone Oct 25, 2023
@snicoll snicoll added in: core Issues in core modules (aop, beans, core, context, expression) theme: aot An issue related to Ahead-of-time processing labels Oct 25, 2023
@dreis2211
Copy link
Contributor

dreis2211 commented Nov 20, 2023

Hello,

for transparency reasons I should add that @snicoll approached me and asked if I could (look into/help with) that and there has been some email communication around it, where I shared some findings already. I will try to summarize these findings now here...I should also note that I'm only a community member, not a part of the Spring team. So take all of the following with a grain of salt.

Moreover, I don't want to argue against the timing difference between obtainFromSupplier and instantiateUsingFactoryMethod that was found earlier. I haven't found it to be the real root cause, but there might be indeed an issue. Probably not the most relevant one as you will notice by the end of this comment.

First investigation step: Async-Profiler

I've profiled the AOT example vs. the indexed example with the async-profiler with several modes and settings. E.g.:

  • CPU, Wall & alloc mode
  • With and without disabled tier compilation via -XX:TieredStopAtLevel=1 to "slim" down the profiles a bit more and focus on the application code rather than on JVM internals.

All of the profilings unfortunately didn't show up any major difference inside the html flamegraphs. (Of course slight differences, but nothing that would have explained a second difference - I thought at least).

What it showed was that with AOT the stack is a bit larger with obtainFromSupplier in play. Via the ThrowingSupplier for example a few more methods are called in the process. So some thought of mine went into the question if we have an inlining problem or something along the new approach, where equally callers are too large or hot methods not being inlined. So I did the same again and executed the jar command with the following options -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining - of course with full tiered compilation again. But no difference really popped up. No hot methods being too big to be inlined for example.

Second investigation step: Type pollution

What I noticed during some code checks was that obtainFromSupplier contained an instanceof check. This year quite often my mind jumps directly to https://bugs.openjdk.org/browse/JDK-8180450 when seeing instanceof. So I fired up the https://github.com/RedHatPerf/type-pollution-agent from @franz1981 to quickly check if we have a severe problem here in the area of the new obtainFromSupplier. While it didn't run out-of-the-box, I managed to exclude some bytebuddy package instrumentation for certain packages and indeed something popped up

656:	com.github.shodo.fakeservice.webapp.FakeServiceApplicationKt__ApplicationContextInitializer$$Lambda$244/0x0000000800e78498
Count:	6
Types:
	org.springframework.beans.factory.support.InstanceSupplier
	org.springframework.util.function.ThrowingSupplier
Traces:
	org.springframework.beans.factory.support.RootBeanDefinition.setInstanceSupplier(RootBeanDefinition.java:444)
		class: org.springframework.beans.factory.support.InstanceSupplier
		count: 4
	org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.obtainInstanceFromSupplier(AbstractAutowireCapableBeanFactory.java:1251)
		class: org.springframework.util.function.ThrowingSupplier
		count: 1
	org.springframework.beans.factory.support.DefaultListableBeanFactory.obtainInstanceFromSupplier(DefaultListableBeanFactory.java:946)
		class: org.springframework.beans.factory.support.InstanceSupplier
		count: 1

BUT as you can see the count (6) was fairly low and the rank of it was also not relevant (656). While it pops up several times for each lambda or bean that's going through obtainInstanceFromSupplier, it doesn't add up to make enough of a difference.

Third investigation step: Syscalls

The async-profiler - while being a great tool - is still a sampling profiler, after all. While one can tune the sampling rate of course, it's still sampling, so my thought was that we might miss out on something. Wall profiles usually also show a lot of irrelevant (idle) threads either. And while you can exclude these as well to a certain degree, I'm a friend of using dtrace in these cases on my Mac to get a different perspective of what's going on. The bread and butter of this is the following command, that shows the count of syscalls matching a given PID (the PID from the java -jar ... call in this case):

dtrace -n 'syscall:::entry /pid == <pid>/ {@[probefunc] = count();}'

This showed the following difference of TOP 5 syscalls in AOT vs. indexed:

  AOT
  ==========
  read                                                          61976
  lseek                                                         61881
  psynch_mutexwait                                              34749
  psynch_cvwait                                                 22709
  psynch_mutexdrop                                              21019

  INDEXED
  ==========
  read                                                          52465
  lseek                                                         52381
  psynch_mutexwait                                              25358
  psynch_mutexdrop                                              16720
  psynch_cvwait                                                 16519

Not just the general count was up with AOT, but more substantially read + lseek was way up. That indicated I/O and my thoughts were immediately targeted towards classloading. Additionally, depending on how the files are opened (blocking or unblocking) this might have hid the information from the CPU profiles, because it's not on-CPU, but off-CPU. And while wall profilings can uncover this off-CPU stuff, it's yet again a matter of having enough samples then, which we in this case didn't have. This was another confirmation for me to follow this path outside of using the async-profiler for now.

SIDE-NOTE: While diving through the JDK code to seek confirmation that read & lseek are used during classloading - which they are when classes are read from a JAR - I wondered why pread is not used on modern systems. And shortly before implementing it directly, someone was faster than me earlier this year and filed https://bugs.openjdk.org/browse/JDK-8301621 . Unfortunately, the corresponding PR was closed due to inactivity, but I believe it could be benefitial for fat-jar workloads that load many classes. My plan is to verify the patch under openjdk/jdk#12417 somewhen in the future and at best provide some benchmark numbers.

Fourth investigation step: Classloading

Classloading is a bit simpler to "profile" and should have been one of the earlier steps I suppose. At least before tracking the syscalls, but hey - that's how my brain worked during the investigation. First I wanted to get a basic understanding of how many classes were loaded in the two different approaches, so I executed the following:

jcmd <pid> PerfCounter.print | grep -i "class"

The JVM has a few interesting perf-counters that can be printed with the above command. Grepping for class related stuff to focus only on the important counters gave me the following:

AOT
==========
java.cls.loadedClasses=14577
...
sun.classloader.findClassTime=5743309175

INDEXED
==========
java.cls.loadedClasses=11870
...
sun.classloader.findClassTime=4749238342

Great. So we indeed not only confirmed an increased amount of classes being loaded (~2-3k), but also (unsurprisingly) a whole second difference being spent there. Being onto something, I added -verbose:class (or even better-Xlog:class+load:file=classloading.log) to the java command that prints out some additional information like the following:

[3,672s][info][class,load] org.hibernate.annotations.AttributeBinderType source: jar:file:/Users/user/projects/aot-fake-service/webapp/target/webapp-0.0.1-SNAPSHOT.jar!/BOOT-INF/lib/hibernate-core-6.2.2.Final.jar!/
[3,672s][info][class,load] org.hibernate.internal.util.collections.JoinedList source: jar:file:/Users/user/projects/aot-fake-service/webapp/target/webapp-0.0.1-SNAPSHOT.jar!/BOOT-INF/lib/hibernate-core-6.2.2.Final.jar!/
[3,672s][info][class,load] org.hibernate.internal.util.collections.JoinedList$$Lambda$1023/0x000000080129f5b0 source: org.hibernate.internal.util.collections.JoinedList
[3,673s][info][class,load] org.hibernate.internal.util.collections.JoinedList$$Lambda$1024/0x000000080129f7f0 source: org.hibernate.internal.util.collections.JoinedList

So I did this again for both AOT vs. indexed. I sanitized the lines a bit (e.g. stripped away the prefix [.*][info][class,load] and the source: information) and sorted it naturally to craft an easier diff of the two approaches.

While looking at the diff, I noticed lots of bytebuddy classes, but also some javax.persistence related classes.

image

Since the example isn't that large only Hibernate could have been an explanation for the additional classes being loaded.

Fifth investigation step: Finding the "root" cause

With Hibernate in my mind, I looked again at the profilings that I did in step 1 and noticed indeed something that sparked my interest: AnnotationMetadataSourceProcessorImpl::processEntityHierarchies.
image

Let's speed this investigation up. I basically looked through the code of the mentioned method and came up with the following explanation.

In the example project provided here there are two modules: core & webapp. Inside core there are entities, in webapp there are none. The context-indexer only is inside core, so only in core a spring.components file is generated that contains the following (omitted the other entities & repositories for brevity reasons).

com.github.shodo.fakeservice.core.adapter.repository.JpaEntity1Repository=org.springframework.stereotype.Component,org.springframework.data.repository.Repository
…
com.github.shodo.fakeservice.core.domain.entity.Entity1=jakarta.persistence.Entity,jakarta.persistence.Table
…

However, the PersistenceManagedTypes bean inside JpaBaseConfiguration only scans for the webapp package – not for core! Which ultimately leads to the persistence info unit not containing any managed classes. Why this doesn’t break is beyond my understanding, because I haven’t used the indexer ever, but I assume it isn’t a problem because repositories are also part of the index and the managed classes would be likely only relevant if the repositories are scanned normally as well. Maybe?

Anyhow. On the AOT example however, the managed types inside the AOT generated JpaBaseConfiguration__BeanDefinitions contains the entities

    private static PersistenceManagedTypes getPersistenceManagedTypesInstance() {
      List<String> managedClassNames = List.of("com.github.shodo.fakeservice.core.domain.entity.Entity6", "com.github.shodo.fakeservice.core.domain.entity.Entity4", "com.github.shodo.fakeservice.core.domain.entity.Entity2", "com.github.shodo.fakeservice.core.domain.entity.Entity5", "com.github.shodo.fakeservice.core.domain.entity.Entity7", "com.github.shodo.fakeservice.core.domain.entity.Entity3", "com.github.shodo.fakeservice.core.domain.entity.Entity1", "com.github.shodo.fakeservice.core.domain.entity.Entity9", "com.github.shodo.fakeservice.core.domain.entity.Entity10", "com.github.shodo.fakeservice.core.domain.entity.Entity8");
      List<String> managedPackages = List.of();
      return PersistenceManagedTypes.of(managedClassNames, managedPackages);
    }

Therefore the metadata building processes the managed classes inside e.g. AnnotationMetadataSourceProcessorImpl::processEntityHierarchies. And while I’m sure there are more places inside the whole flow that deal with managed types, all of them combined and having properly registered managed entities ultimately cause lots of additional classes being loaded in the startup phase.

In order to test my theory I reworked the example project to have only one module to eliminate the difference in the indexer and correctly use managed types and AOT was even slightly faster in that example. At least no drastic performance degression in AOT really. I sort of regressed the indexed example with that setup because managed types were properly registered and handled there now as well. I guess the question which you need to solve is if the managed classes should be registered in the indexed example. And why the indexer doesn’t fail without the entities being registered as managed types. (I'm sure @snicoll can say a word here)

Summary & Learnings

With all of the above there is no real performance regression in AOT, in my opinion. It's rather a misbehaviour of the indexed example of not registering managed types inside the persistent info unit.

Nonetheless it revealed a potential scaling problem in the future for large projects caused by (https://bugs.openjdk.org/browse/JDK-8180450) and revealed an optimization opportunity when looking up classes from JARs (https://bugs.openjdk.org/browse/JDK-8301621) that is worth being evaluated further, because it likely helps in all normal Spring workloads where the app is started as a fat-jar.

This was quite a bit of fun to be honest. During the investigation it once again showed that profiling is important. But quickly jumping to conclusions based on just one aspect is often too easy... I noticed that I'm for example "biased" these days to look into the most recent issue(s) I had on other projects. If you have a hammer, everything looks like a nail after all ;-) . E.g. it could have saved me some time if I hadn't had a look into inlining or type-pollution. While I didn't believe they were the real issues in the first place, they often were issues in recent similar investigations where the picture was not obvious enough from the start...Ruling out ideas was as important and providing contextual information via different tools and then combining it to a bigger picture was once again the key to success. At least success in the sense of finding a performance difference. If that's the result you wanted is not for me to decide. But I'm sharing this a bit more in-depth in the hope of sharing some knowledge on some tools and techniques when investigating things like that and hope I could therefore make up for the maybe unsatisfying results ;-)

Cheers,
Christoph

@shodo
Copy link

shodo commented Nov 21, 2023

@dreis2211 Kudos!!
Many many thanks for the detailed explanation!! I'm relatively new to the JVM, and your investigation pointed out many (free) tools I didn't know before. (And if you have some interesting book to suggest about JVM or SpringBoot internals I'll really appreciate it).

You're explanation is quite satisfying to me. I'm curious to understand while the indexer is not breaking and if this misbehavior can somehow "suggest" some improvement also on the AOT side, but if there is no gain in startup while acting correctly I think we are done on the "Indexer vs AOT" question.

PS: I don't know if you had the chance to observe some improvement on the "AOT vs Standard" mode while using a single module application, since in my multi-module example there is no much gain using AOT, or better, there is this oscillation between being a a few second faster or few second slower. Honestly it can depend also on a variable load on my machine while testing it, so just know if you experienced the same or something else.

Thanks again for your time,
Mattias

@snicoll
Copy link
Member

snicoll commented Nov 21, 2023

I'm curious to understand while the indexer is not breaking

It's just starting the app with an empty persistence unit and no Spring Data repositories. I haven't had the time to investigate but the way the index is created in your sample looks wrong to me. If you use the app to do something with the persistence unit, you'll see it breaking with the indexer.

I don't know if you had the chance to observe some improvement on the "AOT vs Standard" mode while using a single module application, since in my multi-module example there is no much gain using AOT

There's no gain because the setup is wrong as we've just described. Chris already answered that:

In order to test my theory I reworked the example project to have only one module to eliminate the difference in the indexer and correctly use managed types and AOT was even slightly faster in that example.

The index "only" replaces classpath scanning, which is a tiny bit of what the ApplicationContext does on startup. AOT does replace classpath scanning, configuration class parsing, condition evaluation and a bunch of others decisions. So it being faster, or at least as fast, is definitely the plan. There might be some differences due to what the generated code actually exercises (instance supplier vs. factory method) and we'll have a chance to revisit those topics soon.

@shodo
Copy link

shodo commented Nov 21, 2023

@snicoll With "standard" I mean "no use of AOT, nor indexer", just starting the app the standard way.

It's clear to me that the setup was wrong in the indexer case, and that explains why the Indexer appears to be faster since it's skipping scans of entities and persistence stuff.

Still, it didn't explain why I had small gains or worse results vs the "standard" "not compiled nor indexed" mode, in which the setup should work correctly (and if not in the example app, I had equal results in a production application that is fully working since months).

I'll maybe try to play with the tools pointed out by Chris and do some profiling on my own, thanks again for your patience and the effort put into this.

@snicoll
Copy link
Member

snicoll commented Nov 21, 2023

Still, it didn't explain why I had small gains or worse results vs the "standard" "not compiled nor indexed" mode,

Then it means that Spring is not the limiting factor and something else is taking time in startup time. We've already seen the effect of starting Hibernate with 0 entities or with the expected entities. I am going to close this as the issue is no longer actionable.

@snicoll snicoll closed this as not planned Won't fix, can't repro, duplicate, stale Nov 21, 2023
@snicoll snicoll modified the milestones: 6.2.x, 6.1.1 Nov 21, 2023
@dreis2211
Copy link
Contributor

In order to test my theory I reworked the example project to have only one module to eliminate the difference in the indexer and correctly use managed types and AOT was even slightly faster in that example.

To clarify the above point. I basically moved everything into one module.

These are the very rough timings after restructuring the project on my machine.

AOT
Roughly ~4.3 seconds

"Normal" (no index, no AOT)
Roughly ~4.8 seconds

Indexed
Roughly ~4.7 seconds

As the index is also fairly small, there is little gain over the "normal" example (if at all). AOT is indeed faster than all the other variants after the project does the same for all 3 comparisons (a.k.a correctly processing Hibernate metadata and registering managed types).

Keep in mind that there are still lots of classes being loaded on startup (10k or more) and read+lseek still being executed. And it connects to a database. This is (network) I/O. Depending on your machine this can be slower or faster.

As @snicoll outlined I think there's no real actionable item left here. I do think that there's still value in keeping an eye on possible performance improvements, but the given example unfortunately doesn't suffice for that. For this particular example AOT is indeed the fastest option.

@bclozel bclozel removed this from the 6.1.1 milestone Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: core Issues in core modules (aop, beans, core, context, expression) theme: aot An issue related to Ahead-of-time processing type: task A general task
Projects
None yet
Development

No branches or pull requests

5 participants