Deadlock in NonReentrantLock #203

agoston · 2017-11-14T09:52:38Z

Under heavy load of JVM, we are experiencing deadlocks in caffeine version 2.5.6.

Here is the threaddump of the two threads in deadlock:

"ajp-0.0.0.0-8109-59 (P6V7aMaZbbXkfaYzYzXxID0tvHylFQJumanAAABmA)":
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000032a600bb8> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315)
        at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1096)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1017)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1655)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1602)
        at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64)

and

"ajp-0.0.0.0-8109-78 (-jIkyghhBePEsLll9i5dnGr65Dx8PfahGe2gAABxE)":
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000032a600bb8> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315)
        at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1096)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1017)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1655)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1602)
        at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64)

Looking at the code, I can't imagine a scenario this could occur, but it does occur under heavy load of our JVMs.

Any ideas what we could do? For now, as a workaround, I've added a synchronized() block around the cache.put() methods to make sure only 1 thread is adding new values to the cache at a time.

The text was updated successfully, but these errors were encountered:

ben-manes · 2017-11-14T17:00:32Z

I’m sorry to hear this. I didn’t think this could happen either, at least with the ForkJoinPool. If a custom pool (even just commonPool()::execute) it uses a normal ReentrantLock. I think you could switch to that until we have a fix ready. See Caffeine.executor The difference is negligible so we could switch over, as I was perhaps overly aggressive when experimenting. I would like to figure out a unit test to assert the deadlock first, though.

…

On Tue, Nov 14, 2017 at 1:52 AM Agoston Horvath ***@***.***> wrote: Under heavy load of JVM, we are experiencing deadlocks in caffeine version 2.5.6. Here is the threaddump of the two threads in deadlock: "ajp-0.0.0.0-8109-59 (P6V7aMaZbbXkfaYzYzXxID0tvHylFQJumanAAABmA)": at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000032a600bb8> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315) at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78) at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1096) at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1017) at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1655) at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1602) at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64) and "ajp-0.0.0.0-8109-78 (-jIkyghhBePEsLll9i5dnGr65Dx8PfahGe2gAABxE)": at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000032a600bb8> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315) at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78) at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1096) at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1017) at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1655) at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1602) at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64) Looking at the code, I can't imagine a scenario this could occur, but it *does* occur under heavy load of our JVMs. Any ideas what we could do? For now, as a workaround, I've added a synchronized() block around the cache.put() methods to make sure only 1 thread is adding new values to the cache at a time. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#203>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAXG9tZtHiwrHd_nAkTkRnGwUXyTupGtks5s2WLngaJpZM4QdEQj> .

ben-manes · 2017-11-14T17:15:45Z

Looking at the stack, if these are the only threads blocked then it looks like a lost unlock signal. But of course I can't reproduce this using my Stresser utility. If this works fine with ReentrantLock, at least we'll know its a bug somewhere around this logic.

This code path should only occur when the write rate exceeds the drain rate, or that the drain task was lost such as due to the ForkJoinPool bug on older JDKs. The write buffer typically allows writes to make progress by batching the work onto the penalized thread, but if filled it will block on the lock. That improves performance by descheduling the threads, increasing overall throughput by reducing context switches.

A few questions to help investigate:

Does using Caffeine.executor(ForkJoinPool.commonPool()::execute) solve the problem (trick it to use a ReentrantLock by not trusting the FJP).
Are there any other stacktraces using the cache not shown?
What is the cache configuration? (e.g. Is a CacheWriter doing heavy work?)
What version of the JDK are you running?
Any other suggestions for how to reproduce?

agoston · 2017-11-15T09:54:28Z

I've checked ReentrantLock vs. NonreentrantLock, but in the end, they both end up in AbstractQueuedSynchronizer.acquire(), which is where the problem occurs.

All the stacktraces are in this code path, there is no difference. I do get a deadlock warning from JVM (this is the full output, the other 2 threads are not relevant here, they are waiting on locks held by the 2 threads deadlocked above):

Found one Java-level deadlock:
=============================
"ajp-0.0.0.0-8109-128 (-jIkyghhBePEsLll9ic8a19e79f354)":
  waiting to lock monitor 0x00007f73f881a308 (object 0x0000000508b1d908, a com.bol.order.CustomOrder),
  which is held by "ajp-0.0.0.0-8109-78 (-jIkyghhBePEsLll9i5dnGr65Dx8PfahGe2gAABxE)"
"ajp-0.0.0.0-8109-78 (-jIkyghhBePEsLll9i5dnGr65Dx8PfahGe2gAABxE)":
  waiting for ownable synchronizer 0x000000032a600bb8, (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync),
  which is held by "ajp-0.0.0.0-8109-36 (P6V7aMaZbbXkfaYzYzkTBzHO7PeW5fQUKx8QAAAao)"
"ajp-0.0.0.0-8109-36 (P6V7aMaZbbXkfaYzYzkTBzHO7PeW5fQUKx8QAAAao)":
  waiting to lock monitor 0x00007f73d40846a8 (object 0x000000048855c020, a com.bol.order.CustomOrder),
  which is held by "ajp-0.0.0.0-8109-59 (P6V7aMaZbbXkfaYzYzXxID0tvHylFQJumanAAABmA)"
"ajp-0.0.0.0-8109-59 (P6V7aMaZbbXkfaYzYzXxID0tvHylFQJumanAAABmA)":
  waiting for ownable synchronizer 0x000000032a600bb8, (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync),
  which is held by "ajp-0.0.0.0-8109-36 (P6V7aMaZbbXkfaYzYzkTBzHO7PeW5fQUKx8QAAAao)"

JVM version is 1.8.0u131.

Cache configuration:

    @Bean
    public Cache myFluffyCache() {
        return Caffeine.newBuilder()
                .maximumSize(10_000))
                .expireAfterWrite(1, TimeUnit.DAYS)
                .recordStats()
                .build();
    }

We can not reproduce this bug, it only happens on production, very sporadically (a few times a week, then nothing for months, then returns to few times a week).

I was thinking of replacing evictionLock construct by a simple synchronized(this). I'm not sure what the upside of using reentrantlock is, but is sure more error-prone. Make sure you check out this document:

https://bugs.openjdk.java.net/browse/JDK-8156765

which, in turn, is caused by a serious design flaw in java.util.concurrent that is only fixed by an epic hack in java9:

https://bugs.openjdk.java.net/browse/JDK-7011862

ben-manes · 2017-11-15T17:29:25Z

I've checked ReentrantLock vs. NonreentrantLock, but in the end, they both end up in AbstractQueuedSynchronizer.acquire(), which is where the problem occurs.

Yes, but perhaps my NonreentrantLock has a bug that is hard to trigger. It is a slimmed down ReentrantLock since I don't expect recursive locking, and tried to build out everything upfront and before I got demotivated by the weight of the project's complexity. While it is faster in a dedicated microbenchmark, it isn't a hot path and probably should be removed as unnecessary.

I was thinking of replacing evictionLock construct by a simple synchronized(this). I'm not sure what the upside of using reentrantlock is, but is sure more error-prone.

Unfortunately synchronized does not support try-lock semantics. When scheduling is performed (scheduleDrainBuffers), a tryLock is used to skip past it if another thread is taking care of it. That avoids unnecessary blocking and is simpler than trying to emulate it using a synchronized and CAS'd boolean, which might work but feel a little error prone.

All the stacktraces are in this code path, there is no difference.

Do your logs include any warning like Exception thrown when submitting maintenance task? This would be if the executor threw an exception unexpectedly, such as a RejectedExecutionException. That shouldn't occur on a FJP unless the system is in a bad state, as it allows for up to 64M tasks. A problem like OutOfMemoryError might occur though, but the tests should cover that case. But I don't know when the lock could be left in a blocked state, resulting in a pile-up, if assuming it is a bug within BoundedLocalCache.

So I'd still recommend trying the trick to force ReentrantLock and seeing if this occurs. But other than that guess, there isn't enough information for me to debug with.

agoston · 2017-11-20T08:39:31Z

Do your logs include any warning like Exception thrown when submitting maintenance task?

No, I see nothing of the sort.

Forcing the reentrantlock is not so simple, as the code in BoundedLocalCache goes like this:

    evictionLock = (builder.getExecutor() instanceof ForkJoinPool)
        ? new NonReentrantLock()
        : new ReentrantLock();

So the recommended FJP::commonPool() would still use a NonReentrantLock.

ben-manes · 2017-11-20T08:47:48Z

Nope, it would be an anonymous executor and use a ReentrantLock.

$ javarepl
Welcome to JavaREPL version 425 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type expression to evaluate, :help for more options or press tab to auto-complete.
Connected to local instance at http://localhost:53166
java> import java.util.concurrent.*
Imported java.util.concurrent.*
java> Executor e = ForkJoinPool.commonPool()::execute;
java.util.concurrent.Executor e = Evaluation$7adjq3cgmtr50ps9z6b4$$Lambda$211/634150796@6a757ee
java> e instanceof ForkJoinPool
java.lang.Boolean res1 = false

agoston · 2017-11-20T10:09:29Z

👏, neat :)

I will push a test to production. Note that since this issue is not reproducible manually, I won't be able to say if it's fixed or not.

ben-manes · 2017-11-20T19:03:32Z

Sounds good. I'll switch over regardless and close this when released, assuming this doesn't occur again. If it does occur afterwards, we should revisit with the new stacktrace. I won't get a chance to look at this until after the holidays, so that should give us some time to see if it reproduces.

levischuckeats · 2017-11-21T16:32:05Z

Commenting here, as we are also affected. It seems to only happen on production and at odd hours. Today (the third event) happened on a server with little to no load at these hours.

The stack trace is the same as far as I can tell above. This was with version 2.5.5.

  - parking to wait for  <0x00000002cdaff260> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315)
        at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1096)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1017)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1655)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1602)
        at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64)

We use an async weigher which uses a pessimistic (largest object of that class so far) at first, then in a thread pool it will re-insert the object with a thread local set, such that it is actually weighed within that thread pool. So most objects are inserted twice under the same key.

            AsyncWeigher<Object, Object> weigher = asyncReference.get();
            cache = Caffeine.newBuilder()
                    .maximumWeight(1024l * 1024l * 1024l * GIGS_OF_CACHE)
                    .weigher(weigher)
                    .build();
            weigher.setCache(cache);

ben-manes · 2017-11-21T17:16:09Z

Can you also try changing to use a ReentrantLock so we can see if it reoccurs?

Caffeine.newBuilder().executor(ForkJoinPool.commonPool()::executor)...

The inability to reproduce and eye balling the code, I haven't stumbled upon the likely cause. So I don't have much advice yet except bypass the most likely culprit.

Do you see IllegalMonitorStateException in your logs? Also, can you list which methods you use on the cache? Perhaps the exclusiveOwnerThread is not properly set/unset causing it to block. There are explicit lock() calls in a few non critical paths, so perhaps its a race between lock/tryLock.

levischuckeats · 2017-11-21T17:31:29Z

Hello @ben-manes
Due to the holidays, we won't be deploying significant code changes this week, so I won't be able to try this out.

As for IllegalMonitorStateException, there are no error logs that include that in the stacks from any of our servers.

We currently use the following functions (duplicates for specific types that we use, if it makes a difference):

invalidate(Long)
invalidate(Object)
put(String, Object)
put(Object, Object)
invalidateAll()
getIfPresent(Long)
getIfPresent(String)
getIfPresent(Object)
getAllPresent(Collection)
.asMap().remove(Long)

ben-manes · 2017-11-21T17:37:58Z

Yep, due to the holidays I don't plan on fixing it this week either =)

Thanks. invalidateAll() does use lock() since clearing is going to be an expensive operation and the alternative path would be slower. So next step would be to see if Stresser test could reproduce it with a mix of calls. However, this is done by MultiThreadedTest so I'd have expected it to fail during a CI run in the past.

levischuckeats · 2017-11-21T19:03:33Z

invalidateAll in our code is rarely used and really only comes to play in a single threaded situation where the server just finished running migrations before becoming available to the outside world.

So I wouldn't focus on that.

ben-manes · 2017-11-21T19:24:47Z

Right, I'm trying to think of lock interactions that could be stress tested in an attempt to reproduce the problem. Maybe I should write a dedicated stress test on the lock class itself to see if I can break it, so we are sure about where the bug lies.

levischuckeats · 2017-11-21T20:39:24Z

While preparing a censored jstack to share (if you so wish to see, reach out to [email protected] ), here are the counts I saw.

378 threads were stuck on com.github.benmanes.caffeine.cache.LocalManualCache.put
136 threads were stuck on com.github.benmanes.caffeine.cache.BoundedLocalCache.remove
4 threads were stuck on com.github.benmanes.caffeine.cache.LocalManualCache.invalidate
9 threads were stuck on com.github.benmanes.caffeine.cache.BoundedLocalCache.replace

There was one thread in a fork join pool that was made to do a clean up, but it didn't get anywhere

	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315)
	at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1096)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:2963)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.exec(BoundedLocalCache.java:2952)

levischuckeats · 2017-11-21T20:41:29Z

Interestingly.. as I was censoring the jstack and getting those numbers, the same server had a deadlock with caffeine again.

ben-manes · 2017-11-21T20:45:24Z

I'm sorry for the inconvenience this is causing. I'm surprised its happened all of a sudden, since that code wasn't changed recently and its had heavy usages and stress tests. I wonder if its a JDK change around memory fences, perhaps a newer optimization triggering this bug.

levischuckeats · 2017-11-21T20:49:41Z

It's really odd, because we have not updated in a while but it is only happening this last week and now.

These servers are running with about 10GB of cache (GIGS_OF_CACHE being set to 10 in the builder above), and are running with java -version

java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

jbduncan · 2017-11-21T20:51:32Z

@levischuckeats Am I right to assume that this Java version you're running is some sort of build of OpenJDK? :)

levischuckeats · 2017-11-21T20:56:19Z

@jbduncan

I don't think so, the folder our java bins are in is named java-8-oracle, the COPYRIGHT mentions oracle in name, there's a release file that says commercial under build type. The README just points to java.com

ben-manes · 2017-11-21T21:06:39Z

When I run the Stresser in write mode on 8_144, I see the lock holder change over time.

---------- 00:01:34 ----------
Pending reads: 0; writes: 1,024
Drain status = Processing -> Required (3)
Evictions = 325,998,568
Size = 5,136 (max: 4,096)
Lock = [Locked by thread pool-2-thread-1]
Pending tasks = 2
Max Memory = 3,817,865,216 bytes
Free Memory = 158,083,696 bytes
Allocated Memory = 481,820,672 bytes

---------- 00:01:39 ----------
Pending reads: 16; writes: 153
Drain status = Processing -> Required (3)
Evictions = 343,609,572
Size = 5,494 (max: 4,096)
Lock = [Locked by thread pool-2-thread-13]
Pending tasks = 0
Max Memory = 3,817,865,216 bytes
Free Memory = 155,875,704 bytes
Allocated Memory = 318,767,104 bytes

There are stacks of blocked threads, but that is because the write buffer was filled. Rather than being unbounded, it adds back pressure to block threads by helping to perform a full cleanup. In the common case writes are processed asynchronously to minimize user-facing latencies, and in the write storm it throttles when overwhelmed. The throttling improves throughput by reducing context switching.

"pool-2-thread-7" #24 daemon prio=1 os_prio=31 tid=0x00007fcc0584a000 nid=0x7003 waiting on condition [0x0000700011e01000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000006c0ca3650> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315)
	at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1100)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1021)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1658)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1606)
	at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64)
	at com.github.benmanes.caffeine.cache.Stresser.lambda$run$2(Stresser.java:103)
	at com.github.benmanes.caffeine.cache.Stresser$$Lambda$8/517380410.call(Unknown Source)
	at com.github.benmanes.caffeine.testing.ConcurrentTestHarness.lambda$timeTasks$0(ConcurrentTestHarness.java:83)
	at com.github.benmanes.caffeine.testing.ConcurrentTestHarness$$Lambda$9/1681433494.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I would have expected the lock holder to be stuck and, potentially, the throughput to drop as if all threads get blocked. So this is frustrating since I cannot reproduce...

levischuckeats · 2017-11-21T21:10:52Z

I understand the back pressure, though when the CPU for all cores is near 0, I don't think work is being done 😃

For our use case (an intermediate cache for entities, which are backed by the database), we'd prefer things getting dropped, rather than being blocked.

ben-manes · 2017-11-21T21:11:54Z

@agoston By any chance are you running this on Tomcat? In a private stacktrace from @levischuckeats, he mentioned the threads are from Apache Catalina (Tomcat's servlet pool). I know the Tomcat class loader can cause problems, especially when shared classes like FJP are in the mix. I wonder if it is an interaction because of the container...

ben-manes · 2017-11-21T21:18:19Z

Yep, if the system isn't generating writes then we'd all expect the cache to catch up and quickly be idle.

A write cannot be dropped, since those are add/update/remove operations already performed on the hash table. Think of it as a database write-ahead log. The writes are applied in-memory, appended to a log, and replayed on the persisted form. In this way the expensive operation (random fsync) are batched and asynchronous, helping to scale throughput. The cache is similar, where the locking around the eviction policy is expensive because every read/write is a policy mutation. We can and do drop reads, which is the common case and non-critical state. But writes may effect things like require an eviction.

agoston · 2017-11-22T10:39:59Z

@ben-manes
correct, we are running under tomcat (jboss to be even more precise). The thread names are telling it already (ajp-* are tomcat's AJP worker threads).

ben-manes · 2017-11-22T15:41:22Z

Any chance you have clearReferencesStopThreads enabled? That uses Thread.stop() and I’m wondering if a redeployment causes the lock to be left in a bad state due to halted threads.

Also what version of Tomcat and did you recently upgrade?

levischuckeats · 2017-11-22T16:45:59Z

On our end, we actually kill java when we redeploy, so every time Tomcat is running for the first time in the JVM.

agoston · 2017-11-22T18:06:57Z

I wonder if we work at the same company @levischuckeats :) we do the same thing, we do a rolling restart only, never redeploy.

ben-manes · 2017-11-22T20:56:14Z

I'm stuck then, because I can't work out the failure conditions.

All usages of the lock use the correct lock then try / finally unlock idiom
No calls have recursive acquisitions excluding foreign code
The shown configurations do not have foreign code that recurses under the lock
Tests against the cache (Stresser, MultiThreadedTest) don't show problems
An isolated stress test against the lock shows progress
No weird behavior like Thread.stop() to explain it away
Independent failures so not a user problem (e.g. infinite loop)

The only similarity so far is Tomcat which can be invasive, but not so much in your usages. I haven't used Tomcat in many years (preferring embedded Jetty) so I'm not good at debugging it anymore. But it also feels like a copout to presume its a weird interaction bug.

I'm back to having to wait until sometime after the holidays when you both can test using ReentrantLock to see if the problem disappears or a similar stack emerges. If you have any better suggestions please let me know.

ben-manes · 2017-11-23T00:08:31Z

It would also be helpful if you track the cache stats. Then we could know for certain that eviction halted. Otherwise my testing only shows threads backed up, but progress continues. If I add a shutdown hook, then they all quickly complete normally. I cannot reproduce any hanging.

ben-manes · 2018-07-12T07:14:51Z

I’m not sure how to help, unfortunately. What kernel and jdk?

agoston · 2018-07-16T09:33:58Z

Nah, I know there's not much we can do, this is more to keep track for other unfortunate souls.
I still believe this is a JVM/kernel bug, not a caffeine bug.

For us, java version is 1.8.0_161 and the kernel version is 2.6.32-696.23.1.el6.x86_64.

johnou · 2018-07-16T11:01:47Z

@agoston what JVM options? Are you monitoring your heap? Could an OOM have occurred?

agoston · 2018-07-23T20:18:30Z

JVM options are really super complex (a CMS GC tuned to the teeth), but I'm assured there was no OOM situation.

johnou · 2018-07-23T20:40:22Z

Can you dump the JVM options here so we can run the Stresser with them?

BunnyNichalos · 2021-01-08T06:26:26Z

I still have this problem. caffeine version 2.8.6. java version 1.8.0_201 and kernel version is Linux 3.10.0-862.14.4.el7.x86_64 x86_64

"XNIO-1 task-11" #1460 prio=5 os_prio=0 tid=0x00007fd2fc001800 nid=0x5ab waiting on condition [0x00007fd25e53a000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000c361ca80> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1445)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1366)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2007)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1955)
        at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:130)

ben-manes · 2021-01-08T06:28:05Z

Sorry, that isn't enough details. You probably need to send all of the stack traces to see the other threads blocked

ben-manes · 2021-01-08T06:32:23Z

The afterWrite is a hint though. It looks like the write buffer exceeded capacity (128 * NCPUs) which is quite generous. This means a lot of write operations are queued up and not being run. The cache then blocks writes on the eviction lock in order to deprioritize writes and have them help clean up. This indicates some starvation so that the owner of this lock is never progressing and everything is blocked.

Do you have a thread performing a compute? Maybe you have a live-lock on eviction due to recursive computations, so it cannot remove the entry.

BunnyNichalos · 2021-01-08T06:50:19Z

Thank you for your reply. Indeed in my application, a lot of write operations exist，but they were blocked, and I had not find the owner of this lock. I use jstack export thread info.
thread stack.

ben-manes · 2021-01-08T06:58:04Z

The ForkJoinWorkerThread is blocked waiting on the lock, as are all of the other threads. I don't see any ConcurrentHashMap operations, anything performing work under the lock, etc. My guess is it is a JVM / Linux / HW bug like this one, where the unlock signal is never communicated. That is too low level for me to debug...

Are you using OpenJDK or a custom distribution? If so, then you might ask them since it might be a bug there.

BunnyNichalos · 2021-01-08T07:04:21Z

Thanks. I'm using oracle jdk. Here is my jdk version:

java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)

ben-manes · 2021-01-08T07:12:44Z

hmm.. then I really don't know. I'm sorry, but there is not much to go on...

You could try using Caffeine.executor(Runnable::run) to see if that helps. It will perform the maintenance work on the caller threads.

BunnyNichalos · 2021-01-08T07:21:41Z

OK. Thank you for your advice, I will have a try.

ben-manes · 2021-01-08T08:09:16Z

Is that Linux kernel version old enough to be subject to the lost wake up bug mentioned in this thread? It looks like it and if so, you need to upgrade the OS

ben-manes · 2021-01-08T08:12:28Z

June 2013 for 3.10 and the bug is confirmed in 3.19 which is from February 2015. I don’t know much but you should try to use a modern distribution

BunnyNichalos · 2021-01-08T08:45:39Z

Yes, I'm using centos 7, and it's kenel version is Linux 3.10.0, maybe it is the bug from the OS. I think there must be some case in my code trigger this bug. In general, use ReentrantLock can not trigger this bug. I try if I can reproduce this case. I guess maybe use synchronize(string.intern) and inside use Caffeine.put operate trigger this. If I can not reproduce, I will try upgrade the OS.

BunnyNichalos · 2021-04-22T06:38:44Z

when I change to use Guava cache，the problem still exist. It should not be Caffeine's problem.

"DubboServerHandler-10.198.65.150:20880-thread-957" #1243 daemon prio=5 os_prio=0 tid=0x000000000195f800 nid=0x4d0 waiting on condition [0x00007f2e53f6a000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000c2abca68> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2771)
	at com.google.common.cache.LocalCache.put(LocalCache.java:4192)
	at com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4876)

All threads waiting for the lock, but I can not find any thread hold the lock

ben-manes · 2021-04-22T06:44:40Z

Yes, this is what Gerrit also experienced with either library. I don’t know but it’s come up multiple times with different implementations and nothing in the code. There is some hardware/ OS / JVM bug causing havoc.

BunnyNichalos · 2021-04-25T06:10:16Z

Yes, this is what Gerrit also experienced with either library. I don’t know but it’s come up multiple times with different implementations and nothing in the code. There is some hardware/ OS / JVM bug causing havoc.

In my case，I think this problem was caused by thread StackOverflowError. When this problem happened, I dumped the thread stack info and the heap info. In the heap I found the following phenomenon: (now I'm using guava, same as caffeine):

I searched in the log, and this thread had throwed StackOverflowError because of recursive operations:

2021-04-22 10:02:02.823 ERROR 1 --- [XNIO-1 task-2] com.li.dm.web.aop.LogAspect :

java.lang.StackOverflowError: null
	at com.google.common.cache.LocalCache$Segment.runLockedCleanup(LocalCache.java:3455)
	at com.google.common.cache.LocalCache$Segment.cleanUp(LocalCache.java:3444)
	at com.google.common.cache.LocalCache$Segment.postReadCleanup(LocalCache.java:3422)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2078)
	at com.google.common.cache.LocalCache.getIfPresent(LocalCache.java:3967)
	at com.google.common.cache.LocalCache$LocalManualCache.getIfPresent(LocalCache.java:4853)
	at com.li.dm.service.service.impl.DialogCacheHandler.getKeyword(DialogCacheHandler.java:114)
	at com.li.dm.service.service.impl.KeywordMatcher.findKeywords(KeywordMatcher.java:95)
	at com.li.dm.service.service.impl.KeywordMatcher.matchKeyword(KeywordMatcher.java:61)
	at com.li.dm.service.service.impl.KeywordMatcher.matchKeyword(KeywordMatcher.java:56)

The thread was a thread in a threadpool. So I guess this was caused by thread StackOverflowError, because of this error, the thread exited without unlock, so the state for the reentrantLock was always 1, no chance for reset to 0. Then other threads operated the cache, exec lock(), and blocked.

ben-manes · 2021-04-25T07:47:15Z

Good find! Maybe this will help @kentwang929 who recently had a similar problem.

kentwang929 · 2021-04-29T18:43:41Z

interesting, we also do observe StackOverFlow in our logs. good direction to debug, thanks!

ben-manes · 2021-04-29T18:46:40Z

fyi, I believe JEP-270 should have resolved that but it is Java 9 onward.

kentwang929 · 2021-04-29T19:19:25Z

We are on java 14. This might be our code issue when the query is too large.

ben-manes mentioned this issue Mar 15, 2021

putIfAbsent() regression? #506

Closed

ben-manes mentioned this issue Apr 6, 2021

Possible loadingCache deadlock with performCleanUp #527

Closed

ben-manes mentioned this issue Apr 28, 2021

Thread lock in LocalCache.lockedGetOrLoad google/guava#3602

Closed

ben-manes mentioned this issue Dec 20, 2021

All thread was blocked #634

Closed

jan-berge-ommedal mentioned this issue Feb 24, 2022

single slow load deadlocks entire cache #672

Closed

HashJang mentioned this issue Sep 4, 2022

Some new findings on the "dead lock" in the previous issues #768

Closed

ben-manes mentioned this issue Sep 11, 2022

LocalLoadingCache.getAll deadlock #771

Closed

Deadlock in NonReentrantLock #203

Deadlock in NonReentrantLock #203

Comments

agoston commented Nov 14, 2017

ben-manes commented Nov 14, 2017 via email

ben-manes commented Nov 14, 2017

agoston commented Nov 15, 2017 • edited Loading

ben-manes commented Nov 15, 2017

agoston commented Nov 20, 2017

ben-manes commented Nov 20, 2017

agoston commented Nov 20, 2017

ben-manes commented Nov 20, 2017

levischuckeats commented Nov 21, 2017

ben-manes commented Nov 21, 2017 • edited Loading

levischuckeats commented Nov 21, 2017

ben-manes commented Nov 21, 2017

levischuckeats commented Nov 21, 2017

ben-manes commented Nov 21, 2017

levischuckeats commented Nov 21, 2017

levischuckeats commented Nov 21, 2017

ben-manes commented Nov 21, 2017

levischuckeats commented Nov 21, 2017

jbduncan commented Nov 21, 2017

levischuckeats commented Nov 21, 2017

ben-manes commented Nov 21, 2017

levischuckeats commented Nov 21, 2017

ben-manes commented Nov 21, 2017

ben-manes commented Nov 21, 2017

agoston commented Nov 22, 2017

ben-manes commented Nov 22, 2017

levischuckeats commented Nov 22, 2017

agoston commented Nov 22, 2017

ben-manes commented Nov 22, 2017 • edited Loading

ben-manes commented Nov 23, 2017

ben-manes commented Jul 12, 2018

agoston commented Jul 16, 2018 • edited Loading

johnou commented Jul 16, 2018

agoston commented Jul 23, 2018

johnou commented Jul 23, 2018

BunnyNichalos commented Jan 8, 2021 • edited Loading

ben-manes commented Jan 8, 2021

ben-manes commented Jan 8, 2021

BunnyNichalos commented Jan 8, 2021

ben-manes commented Jan 8, 2021

BunnyNichalos commented Jan 8, 2021

ben-manes commented Jan 8, 2021

BunnyNichalos commented Jan 8, 2021

ben-manes commented Jan 8, 2021

ben-manes commented Jan 8, 2021

BunnyNichalos commented Jan 8, 2021 • edited Loading

BunnyNichalos commented Apr 22, 2021

ben-manes commented Apr 22, 2021

BunnyNichalos commented Apr 25, 2021

ben-manes commented Apr 25, 2021

kentwang929 commented Apr 29, 2021

ben-manes commented Apr 29, 2021

kentwang929 commented Apr 29, 2021

agoston commented Nov 15, 2017 •

edited

Loading

ben-manes commented Nov 21, 2017 •

edited

Loading

ben-manes commented Nov 22, 2017 •

edited

Loading

agoston commented Jul 16, 2018 •

edited

Loading

BunnyNichalos commented Jan 8, 2021 •

edited

Loading

BunnyNichalos commented Jan 8, 2021 •

edited

Loading