Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WrongMethodTypeException after upgrade to 3.1.7 #1111

Closed
carterkozak opened this issue Aug 7, 2023 · 30 comments · Fixed by #1114
Closed

WrongMethodTypeException after upgrade to 3.1.7 #1111

carterkozak opened this issue Aug 7, 2023 · 30 comments · Fixed by #1114

Comments

@carterkozak
Copy link

I don't have a reproducer for this yet, but I've seen it pop up in a variety of contexts when building a caffeine cache. This has occurred on a few configurations using both build and buildAsync. #905 seems like the most likely culprit.

java.lang.invoke.WrongMethodTypeException: expected ()NodeFactory but found ()NodeFactory
	at java.base/java.lang.invoke.Invokers.newWrongMethodTypeException(Invokers.java:523)
	at java.base/java.lang.invoke.Invokers.checkExactType(Invokers.java:532)
	at com.github.benmanes.caffeine.cache.NodeFactory.newFactory(NodeFactory.java:152)
	at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1708)
	at com.github.benmanes.caffeine.cache.NodeFactory.loadFactory(NodeFactory.java:141)
	at com.github.benmanes.caffeine.cache.NodeFactory.newFactory(NodeFactory.java:89)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.<init>(BoundedLocalCache.java:269)
	at com.github.benmanes.caffeine.cache.SS.<init>(Unknown Source)
	at com.github.benmanes.caffeine.cache.SSL.<init>(Unknown Source)
	at com.github.benmanes.caffeine.cache.SSLMW.<init>(Unknown Source)
	at com.github.benmanes.caffeine.cache.SSLMWA.<init>(Unknown Source)
	at com.github.benmanes.caffeine.cache.LocalCacheFactory.newBoundedLocalCache(LocalCacheFactory.java:47)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalManualCache.<init>(BoundedLocalCache.java:3952)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalManualCache.<init>(BoundedLocalCache.java:3948)
	at com.github.benmanes.caffeine.cache.Caffeine.build(Caffeine.java:1048)
@carterkozak
Copy link
Author

I don't have as much information as I would like quite yet, posted early in case others have observed this issue. When I uncover additional details or manage a minimal reproducer, I will update this issue.

@ben-manes
Copy link
Owner

I think any details you have, like jdk version, would help. We primarily test on jdk11 with a few configurations evaluated on later jdks (e.g. benchmarks) that are hotspot/graal based.
@yuzawa-san @uschindler

@carterkozak
Copy link
Author

So far I've seen this on jdk-17 using Amazon Corretto 17.0.8.7.1 on linux hosts. I'm not confident that it's necessarily specific to jdk-17, because the vast majority of the services we run use jdk-17. The failure is not consistently reproducible in the environments where it has occurred, so I suspect there may be a cache interaction ordering component (or the cache key doesn't take into account some piece of class-loading/lookup state), as I wasn't able to force the failure by executing the same interactions with caffeine in isolation.

@ben-manes
Copy link
Owner

ben-manes commented Aug 7, 2023

do you use multiple classloaders where threads might cross that boundary? For example, iirc Tomcat hosts each application in its own classloader but ForkJoinPool threads cut across these, so app.ClassA != app.ClassA due to different classloaders and there can be contamination by the shared threads. Those can be tricky bugs and we did not try to protect against that, so I could imagine something quirky there.

@carterkozak
Copy link
Author

That was my first thought as well, but these applications are roughly dropwizard-shaped microservices which use a single standard classloader. The rate of reproduction is very low, I've only seen ~10 occurrences across thousands of jvm launches.

@ben-manes
Copy link
Owner

It's weird too because, if I understand correctly, Lookup.findClass uses the caller's classloader (NodeFactory.getClassLoader()). Since the static map is held by that class, it should always use the same classloader during the invoke and not become contaminated. There have been bugs in prior JDKs with MethodHandles so it may be worth digging into the JDK issue tracker (e.g. a memory leak meant we reverted our usage pre-jdk11).

@uschindler
Copy link
Contributor

uschindler commented Aug 7, 2023

To me this looks like a Hotspot bug...

I haven't seen this anywhere. In addition the error message makes no sense. The signature matches exactly: expected ()NodeFactory but found ()NodeFactory

@uschindler
Copy link
Contributor

The crazy error message comes from the fact that the method ypes are compared with identity not equals: https://github.com/openjdk/jdk17/blob/4afbcaf55383ec2f5da53282a1547bac3d099e9d/src/java.base/share/classes/java/lang/invoke/Invokers.java#L531

No idea why this is done like that. There are many places where invokeExact is used, so this bug is really strange.

As workaround you can go back to invoke, but this will slowdown calls because exact type information is known before.

@carterkozak
Copy link
Author

It does seem as though MethodType.makeImpl perhaps isn't canonicalizing returned MethodType as we would expect -- I haven't had a chance to review the ConcurrentWeakInternSet in depth.

@ben-manes
Copy link
Owner

I imagine our workaround will be a fallback path once we settle on the cause, e.g. use invokeExact but fallback to invoke if this exception is encountered.

@uschindler
Copy link
Contributor

uschindler commented Aug 7, 2023

Maybe it is a multithreading problem and some bug in openjdk does not correctly intern/canonicalize the method type instance. We have to figure out if theres some caching involved. If there's really an issue, we may need to open openjdk bug report.

@uschindler
Copy link
Contributor

I imagine our workaround will be a fallback path once we settle on the cause, e.g. use invokeExact but fallback to invoke if this exception is encountered.

Haha, add a try...catch around and call invoke on exception 😜

I have never seen this in our code. Invoke exact always works as expected. I really think this is a muktithreading bug in the cache of jdk. 😕

@ben-manes
Copy link
Owner

ben-manes commented Aug 7, 2023

I wonder if it is expunging a temporary WeakEntry when it fails the race, which causes it to remove a live entry and create duplicates. Following the jdk sources.

When it optimistically creates a WeakEntry<>(elem, stale), that instance is registered on the reference queue for tracking. When a race causes another entry to be cached, then the loser's is discarded. That makes both the entry and its elem eligible for collection. One might expect elem to live longer than the entry since it was a method parameter, but since its unused that might be elided by escape analysis and inlining. So possibly some reordering and lifetimes might cause the GC to think that the WeakReference is still alive and should be enqueued. Since the map.remove(e) is based on object equality, it could discard a different instance than the collected one as we no longer can assume a strict state ordering. A simple fix would be to use map.computeIfPresent(reference, (k, v) -> (reference == v) ? null : v).

I am kind of reaching, but it seems just plausible enough to happen, not be obvious to the author, and pretty hard to orchestrate that race. does this seem believable?

@ben-manes
Copy link
Owner

The more that I look at the code, the more that I convince myself their concurrency is correct (assuming no gc/hotspot visibility race). Searching around all could find was FasterXML/jackson-modules-base#142, which was closed as not reproducible. Since you did not have this problem in the prior release, which used invoke, that seems like a safe fallback.

The major difference with the previous release is that we now cache and reuse the instantiated NodeFactory, which is a stateless factory. That means we should use the method handle only perbuild() for a unique policy configuration, rather than on every cache construction. The actual cost between invoke versus invokeExact is small, so the benefit was doing it only once rather than many times.

Ideally you could dig in far enough to open a hotspot bug, or if the issue goes away we could close as not reproducible like the Jackson team did and hope the cosmic rays stay in our favor. Let me know what you'd like to do and a release timeline.

@uschindler
Copy link
Contributor

Hi,
I checked the actual code: using ibvokeExa t makes no sense here:

return (NodeFactory<Object, Object>) constructor

Actually this creates a new MethodType on each call so it does the work of invoke().

I would remove the method type adaption and just call invoke. In short! Lookup constructor and call it with invoke (like before).

When we implemented this first time, the code was different (we cached the constructor handles and called them. But here we do the whole reflective lookup and call it directly a single time after adapting the type. That is too complex and brings nothing.

You should also check the other one that uses invoke exact. If pattern is same, remove adaption, too.

@uschindler
Copy link
Contributor

I checked the other ones: https://github.com/search?q=repo%3Aben-manes%2Fcaffeine%20invokeExact&type=code

The one here is the only one that adapts on each call. The other 2 instances use adaption of method type a single time and cache result.

@ben-manes
Copy link
Owner

I think the other one does it to, it just does it earlier than the search snippet shows

final class MethodHandleBasedFactory implements LocalCacheFactory {
final MethodHandle methodHandle;
MethodHandleBasedFactory(Class<?> clazz) throws NoSuchMethodException, IllegalAccessException {
var constructor = LOOKUP.findConstructor(clazz, FACTORY);
this.methodHandle = constructor.asType(
constructor.type().changeReturnType(BoundedLocalCache.class));
}
@SuppressWarnings("unchecked")
@Override public <K, V> BoundedLocalCache<K, V> newInstance(Caffeine<K, V> builder,
@Nullable AsyncCacheLoader<? super K, V> cacheLoader, boolean async) throws Throwable {
return (BoundedLocalCache<K, V>) methodHandle.invokeExact(builder, cacheLoader, async);
}

@ben-manes
Copy link
Owner

If you'd like to send a PR or provide a snippet then I'd happily merge that. I think I know what you mean but you're much more of an expert here 🙂

@uschindler
Copy link
Contributor

The alternative is to define the method type once as static final. Adapting and creating a new method type on each call looks wrong. 😜
But I would not do that here. Just use invoke.

@uschindler
Copy link
Contributor

Will do.
For the other ones I will use a static final method type for safety.

@ben-manes
Copy link
Owner

Oh okay, so we should revert back to invoke like before? That seemed simple and effective. The main benefit from our prior discussion ended up being that we cached the factories so the method handle is used only once to create the factory instance. Then the factory is cached in a map, where using a factory uses direct calls. So ideally we only use the method handle once per unique policy combination (size, size + expire, etc). Then even old school reflection would have been equivalent performance-wise as the cost is paid once, rather than each time a new cache instance is built.

@uschindler
Copy link
Contributor

Then even old school reflection would have been equivalent performance-wise as the cost is paid once, rather than each time a new cache instance is built.

For the case here, it is not made for performance, reflection would be fine. The main difference is: Because we use a lookup, we can handle package private classes without setAccessible().
In Lucene, to support Java 21 vector API, we also do it in the same way in Lucene: We just use MethodHandles to call package private methods (only visible and without setAccessible): https://github.com/apache/lucene/blob/a65cf8960a1057d98126256d1610292ad5c8f1b3/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java#L97-L104

The LocalCacheFactory is a bit different: Here we should use invokeExact (as it looks to be called more often), but here we are caching the MethodHandles. When creating them, we should not do it on each lookup, instead define it once as static final.

I have a PR almost ready, running tests.

@ben-manes
Copy link
Owner

I have a PR almost ready, running tests.

haha, we have an exhaustive suite so you might prefer sending a draft pr and letting the CI burn it’s cpu cycles instead of yours.

uschindler added a commit to uschindler/caffeine that referenced this issue Aug 8, 2023
…n (to work around a possible concurrency bug in JDK)
@uschindler
Copy link
Contributor

Here you are: #1114

@uschindler
Copy link
Contributor

I have not changed the benchmark, but I think you know wthat I did - maybe adapt the benchmark classes, too. Actually the LocalCacheFactory is the way how one should normally use MethodType instances: Do not create them over an over and keep them as static finals. If you look at the code inside the JDK how they use it for bootstrapping.

I am quite sure that there is some strange problem inside the JDK that sometimes creates non-interned MethodTypes under high load.

@ben-manes
Copy link
Owner

beautiful, I’ll merge this in after the CI completes. I’ll try to cut a release in a day or two when I find the time.

@uschindler
Copy link
Contributor

The LambdaMetaFactory benchmark looks fine (...and its not productive code). I would remove that as this is no longer used (but that's up to you).

ben-manes pushed a commit that referenced this issue Aug 8, 2023
…k around a possible concurrency bug in JDK) (#1114)
@carterkozak
Copy link
Author

Thank you both!

@uschindler
Copy link
Contributor

Thank you both!

Of course it would be great if you could start up your services with your jdk using a preview build of the library and report if you see any problems.

@ben-manes
Copy link
Owner

Released in 3.1.8

Thanks again for all the help fixing this @uschindler, @carterkozak

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants