-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in NonReentrantLock #203
Comments
I’m sorry to hear this.
I didn’t think this could happen either, at least with the ForkJoinPool. If
a custom pool (even just commonPool()::execute) it uses a normal
ReentrantLock. I think you could switch to that until we have a fix ready.
See Caffeine.executor
The difference is negligible so we could switch over, as I was perhaps
overly aggressive when experimenting. I would like to figure out a unit
test to assert the deadlock first, though.
…On Tue, Nov 14, 2017 at 1:52 AM Agoston Horvath ***@***.***> wrote:
Under heavy load of JVM, we are experiencing deadlocks in caffeine version
2.5.6.
Here is the threaddump of the two threads in deadlock:
"ajp-0.0.0.0-8109-59 (P6V7aMaZbbXkfaYzYzXxID0tvHylFQJumanAAABmA)":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000032a600bb8> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315)
at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1096)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1017)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1655)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1602)
at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64)
and
"ajp-0.0.0.0-8109-78 (-jIkyghhBePEsLll9i5dnGr65Dx8PfahGe2gAABxE)":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000032a600bb8> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315)
at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1096)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1017)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1655)
at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1602)
at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64)
Looking at the code, I can't imagine a scenario this could occur, but it
*does* occur under heavy load of our JVMs.
Any ideas what we could do? For now, as a workaround, I've added a
synchronized() block around the cache.put() methods to make sure only 1
thread is adding new values to the cache at a time.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#203>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXG9tZtHiwrHd_nAkTkRnGwUXyTupGtks5s2WLngaJpZM4QdEQj>
.
|
Looking at the stack, if these are the only threads blocked then it looks like a lost unlock signal. But of course I can't reproduce this using my This code path should only occur when the write rate exceeds the drain rate, or that the drain task was lost such as due to the ForkJoinPool bug on older JDKs. The write buffer typically allows writes to make progress by batching the work onto the penalized thread, but if filled it will block on the lock. That improves performance by descheduling the threads, increasing overall throughput by reducing context switches. A few questions to help investigate:
|
I've checked ReentrantLock vs. NonreentrantLock, but in the end, they both end up in AbstractQueuedSynchronizer.acquire(), which is where the problem occurs. All the stacktraces are in this code path, there is no difference. I do get a deadlock warning from JVM (this is the full output, the other 2 threads are not relevant here, they are waiting on locks held by the 2 threads deadlocked above):
JVM version is 1.8.0u131. Cache configuration:
We can not reproduce this bug, it only happens on production, very sporadically (a few times a week, then nothing for months, then returns to few times a week). I was thinking of replacing evictionLock construct by a simple synchronized(this). I'm not sure what the upside of using reentrantlock is, but is sure more error-prone. Make sure you check out this document: https://bugs.openjdk.java.net/browse/JDK-8156765 which, in turn, is caused by a serious design flaw in java.util.concurrent that is only fixed by an epic hack in java9: |
Yes, but perhaps my
Unfortunately
Do your logs include any warning like So I'd still recommend trying the trick to force |
No, I see nothing of the sort. Forcing the reentrantlock is not so simple, as the code in BoundedLocalCache goes like this: evictionLock = (builder.getExecutor() instanceof ForkJoinPool)
? new NonReentrantLock()
: new ReentrantLock(); So the recommended FJP::commonPool() would still use a NonReentrantLock. |
Nope, it would be an anonymous executor and use a ReentrantLock. $ javarepl
Welcome to JavaREPL version 425 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type expression to evaluate, :help for more options or press tab to auto-complete.
Connected to local instance at http://localhost:53166
java> import java.util.concurrent.*
Imported java.util.concurrent.*
java> Executor e = ForkJoinPool.commonPool()::execute;
java.util.concurrent.Executor e = Evaluation$7adjq3cgmtr50ps9z6b4$$Lambda$211/634150796@6a757ee
java> e instanceof ForkJoinPool
java.lang.Boolean res1 = false |
👏, neat :) I will push a test to production. Note that since this issue is not reproducible manually, I won't be able to say if it's fixed or not. |
Sounds good. I'll switch over regardless and close this when released, assuming this doesn't occur again. If it does occur afterwards, we should revisit with the new stacktrace. I won't get a chance to look at this until after the holidays, so that should give us some time to see if it reproduces. |
Commenting here, as we are also affected. It seems to only happen on production and at odd hours. Today (the third event) happened on a server with little to no load at these hours. The stack trace is the same as far as I can tell above. This was with version
We use an async weigher which uses a pessimistic (largest object of that class so far) at first, then in a thread pool it will re-insert the object with a thread local set, such that it is actually weighed within that thread pool. So most objects are inserted twice under the same key.
|
Can you also try changing to use a Caffeine.newBuilder().executor(ForkJoinPool.commonPool()::executor)... The inability to reproduce and eye balling the code, I haven't stumbled upon the likely cause. So I don't have much advice yet except bypass the most likely culprit. Do you see |
Hello @ben-manes As for We currently use the following functions (duplicates for specific types that we use, if it makes a difference):
|
Yep, due to the holidays I don't plan on fixing it this week either =) Thanks. |
So I wouldn't focus on that. |
Right, I'm trying to think of lock interactions that could be stress tested in an attempt to reproduce the problem. Maybe I should write a dedicated stress test on the lock class itself to see if I can break it, so we are sure about where the bug lies. |
While preparing a censored jstack to share (if you so wish to see, reach out to [email protected] ), here are the counts I saw.
There was one thread in a fork join pool that was made to do a clean up, but it didn't get anywhere
|
Interestingly.. as I was censoring the jstack and getting those numbers, the same server had a deadlock with caffeine again. |
I'm sorry for the inconvenience this is causing. I'm surprised its happened all of a sudden, since that code wasn't changed recently and its had heavy usages and stress tests. I wonder if its a JDK change around memory fences, perhaps a newer optimization triggering this bug. |
It's really odd, because we have not updated in a while but it is only happening this last week and now. These servers are running with about 10GB of cache (
|
@levischuckeats Am I right to assume that this Java version you're running is some sort of build of OpenJDK? :) |
I don't think so, the folder our java bins are in is named |
When I run the Stresser in write mode on 8_144, I see the lock holder change over time.
There are stacks of blocked threads, but that is because the write buffer was filled. Rather than being unbounded, it adds back pressure to block threads by helping to perform a full cleanup. In the common case writes are processed asynchronously to minimize user-facing latencies, and in the write storm it throttles when overwhelmed. The throttling improves throughput by reducing context switching.
I would have expected the lock holder to be stuck and, potentially, the throughput to drop as if all threads get blocked. So this is frustrating since I cannot reproduce... |
I understand the back pressure, though when the CPU for all cores is near 0, I don't think work is being done 😃 For our use case (an intermediate cache for entities, which are backed by the database), we'd prefer things getting dropped, rather than being blocked. |
@agoston By any chance are you running this on Tomcat? In a private stacktrace from @levischuckeats, he mentioned the threads are from Apache Catalina (Tomcat's servlet pool). I know the Tomcat class loader can cause problems, especially when shared classes like FJP are in the mix. I wonder if it is an interaction because of the container... |
Yep, if the system isn't generating writes then we'd all expect the cache to catch up and quickly be idle. A write cannot be dropped, since those are add/update/remove operations already performed on the hash table. Think of it as a database write-ahead log. The writes are applied in-memory, appended to a log, and replayed on the persisted form. In this way the expensive operation (random fsync) are batched and asynchronous, helping to scale throughput. The cache is similar, where the locking around the eviction policy is expensive because every read/write is a policy mutation. We can and do drop reads, which is the common case and non-critical state. But writes may effect things like require an eviction. |
@ben-manes |
Any chance you have clearReferencesStopThreads enabled? That uses Thread.stop() and I’m wondering if a redeployment causes the lock to be left in a bad state due to halted threads. Also what version of Tomcat and did you recently upgrade? |
On our end, we actually kill java when we redeploy, so every time Tomcat is running for the first time in the JVM. |
I wonder if we work at the same company @levischuckeats :) we do the same thing, we do a rolling restart only, never redeploy. |
I'm stuck then, because I can't work out the failure conditions.
The only similarity so far is Tomcat which can be invasive, but not so much in your usages. I haven't used Tomcat in many years (preferring embedded Jetty) so I'm not good at debugging it anymore. But it also feels like a copout to presume its a weird interaction bug. I'm back to having to wait until sometime after the holidays when you both can test using |
It would also be helpful if you track the cache stats. Then we could know for certain that eviction halted. Otherwise my testing only shows threads backed up, but progress continues. If I add a shutdown hook, then they all quickly complete normally. I cannot reproduce any hanging. |
I’m not sure how to help, unfortunately. What kernel and jdk? |
Nah, I know there's not much we can do, this is more to keep track for other unfortunate souls. For us, java version is 1.8.0_161 and the kernel version is 2.6.32-696.23.1.el6.x86_64. |
@agoston what JVM options? Are you monitoring your heap? Could an OOM have occurred? |
JVM options are really super complex (a CMS GC tuned to the teeth), but I'm assured there was no OOM situation. |
Can you dump the JVM options here so we can run the Stresser with them? |
I still have this problem. caffeine version 2.8.6. java version 1.8.0_201 and kernel version is Linux 3.10.0-862.14.4.el7.x86_64 x86_64
|
Sorry, that isn't enough details. You probably need to send all of the stack traces to see the other threads blocked |
The Do you have a thread performing a compute? Maybe you have a live-lock on eviction due to recursive computations, so it cannot remove the entry. |
Thank you for your reply. Indeed in my application, a lot of write operations exist,but they were blocked, and I had not find the owner of this lock. I use jstack export thread info. |
The Are you using OpenJDK or a custom distribution? If so, then you might ask them since it might be a bug there. |
Thanks. I'm using oracle jdk. Here is my jdk version:
|
hmm.. then I really don't know. I'm sorry, but there is not much to go on... You could try using |
OK. Thank you for your advice, I will have a try. |
Is that Linux kernel version old enough to be subject to the lost wake up bug mentioned in this thread? It looks like it and if so, you need to upgrade the OS |
June 2013 for 3.10 and the bug is confirmed in 3.19 which is from February 2015. I don’t know much but you should try to use a modern distribution |
Yes, I'm using centos 7, and it's kenel version is Linux 3.10.0, maybe it is the bug from the OS. I think there must be some case in my code trigger this bug. In general, use ReentrantLock can not trigger this bug. I try if I can reproduce this case. I guess maybe use synchronize(string.intern) and inside use Caffeine.put operate trigger this. If I can not reproduce, I will try upgrade the OS. |
when I change to use Guava cache,the problem still exist. It should not be Caffeine's problem.
All threads waiting for the lock, but I can not find any thread hold the lock |
Yes, this is what Gerrit also experienced with either library. I don’t know but it’s come up multiple times with different implementations and nothing in the code. There is some hardware/ OS / JVM bug causing havoc. |
In my case,I think this problem was caused by thread StackOverflowError. When this problem happened, I dumped the thread stack info and the heap info. In the heap I found the following phenomenon: (now I'm using guava, same as caffeine): I searched in the log, and this thread had throwed StackOverflowError because of recursive operations:
The thread was a thread in a threadpool. So I guess this was caused by thread StackOverflowError, because of this error, the thread exited without unlock, so the state for the reentrantLock was always 1, no chance for reset to 0. Then other threads operated the cache, exec lock(), and blocked. |
Good find! Maybe this will help @kentwang929 who recently had a similar problem. |
interesting, we also do observe StackOverFlow in our logs. good direction to debug, thanks! |
fyi, I believe JEP-270 should have resolved that but it is Java 9 onward. |
We are on java 14. This might be our code issue when the query is too large. |
Under heavy load of JVM, we are experiencing deadlocks in caffeine version 2.5.6.
Here is the threaddump of the two threads in deadlock:
and
Looking at the code, I can't imagine a scenario this could occur, but it does occur under heavy load of our JVMs.
Any ideas what we could do? For now, as a workaround, I've added a synchronized() block around the cache.put() methods to make sure only 1 thread is adding new values to the cache at a time.
The text was updated successfully, but these errors were encountered: