Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in NonReentrantLock #203

Closed
agoston opened this issue Nov 14, 2017 · 104 comments
Closed

Deadlock in NonReentrantLock #203

agoston opened this issue Nov 14, 2017 · 104 comments

Comments

@agoston
Copy link

agoston commented Nov 14, 2017

Under heavy load of JVM, we are experiencing deadlocks in caffeine version 2.5.6.

Here is the threaddump of the two threads in deadlock:

"ajp-0.0.0.0-8109-59 (P6V7aMaZbbXkfaYzYzXxID0tvHylFQJumanAAABmA)":
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000032a600bb8> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315)
        at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1096)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1017)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1655)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1602)
        at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64)

and

"ajp-0.0.0.0-8109-78 (-jIkyghhBePEsLll9i5dnGr65Dx8PfahGe2gAABxE)":
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000032a600bb8> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315)
        at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1096)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1017)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1655)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1602)
        at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64)

Looking at the code, I can't imagine a scenario this could occur, but it does occur under heavy load of our JVMs.

Any ideas what we could do? For now, as a workaround, I've added a synchronized() block around the cache.put() methods to make sure only 1 thread is adding new values to the cache at a time.

@ben-manes
Copy link
Owner

ben-manes commented Nov 14, 2017 via email

@ben-manes
Copy link
Owner

Looking at the stack, if these are the only threads blocked then it looks like a lost unlock signal. But of course I can't reproduce this using my Stresser utility. If this works fine with ReentrantLock, at least we'll know its a bug somewhere around this logic.

This code path should only occur when the write rate exceeds the drain rate, or that the drain task was lost such as due to the ForkJoinPool bug on older JDKs. The write buffer typically allows writes to make progress by batching the work onto the penalized thread, but if filled it will block on the lock. That improves performance by descheduling the threads, increasing overall throughput by reducing context switches.

A few questions to help investigate:

  • Does using Caffeine.executor(ForkJoinPool.commonPool()::execute) solve the problem (trick it to use a ReentrantLock by not trusting the FJP).
  • Are there any other stacktraces using the cache not shown?
  • What is the cache configuration? (e.g. Is a CacheWriter doing heavy work?)
  • What version of the JDK are you running?
  • Any other suggestions for how to reproduce?

@agoston
Copy link
Author

agoston commented Nov 15, 2017

I've checked ReentrantLock vs. NonreentrantLock, but in the end, they both end up in AbstractQueuedSynchronizer.acquire(), which is where the problem occurs.

All the stacktraces are in this code path, there is no difference. I do get a deadlock warning from JVM (this is the full output, the other 2 threads are not relevant here, they are waiting on locks held by the 2 threads deadlocked above):

Found one Java-level deadlock:
=============================
"ajp-0.0.0.0-8109-128 (-jIkyghhBePEsLll9ic8a19e79f354)":
  waiting to lock monitor 0x00007f73f881a308 (object 0x0000000508b1d908, a com.bol.order.CustomOrder),
  which is held by "ajp-0.0.0.0-8109-78 (-jIkyghhBePEsLll9i5dnGr65Dx8PfahGe2gAABxE)"
"ajp-0.0.0.0-8109-78 (-jIkyghhBePEsLll9i5dnGr65Dx8PfahGe2gAABxE)":
  waiting for ownable synchronizer 0x000000032a600bb8, (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync),
  which is held by "ajp-0.0.0.0-8109-36 (P6V7aMaZbbXkfaYzYzkTBzHO7PeW5fQUKx8QAAAao)"
"ajp-0.0.0.0-8109-36 (P6V7aMaZbbXkfaYzYzkTBzHO7PeW5fQUKx8QAAAao)":
  waiting to lock monitor 0x00007f73d40846a8 (object 0x000000048855c020, a com.bol.order.CustomOrder),
  which is held by "ajp-0.0.0.0-8109-59 (P6V7aMaZbbXkfaYzYzXxID0tvHylFQJumanAAABmA)"
"ajp-0.0.0.0-8109-59 (P6V7aMaZbbXkfaYzYzXxID0tvHylFQJumanAAABmA)":
  waiting for ownable synchronizer 0x000000032a600bb8, (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync),
  which is held by "ajp-0.0.0.0-8109-36 (P6V7aMaZbbXkfaYzYzkTBzHO7PeW5fQUKx8QAAAao)"

JVM version is 1.8.0u131.

Cache configuration:

    @Bean
    public Cache myFluffyCache() {
        return Caffeine.newBuilder()
                .maximumSize(10_000))
                .expireAfterWrite(1, TimeUnit.DAYS)
                .recordStats()
                .build();
    }

We can not reproduce this bug, it only happens on production, very sporadically (a few times a week, then nothing for months, then returns to few times a week).

I was thinking of replacing evictionLock construct by a simple synchronized(this). I'm not sure what the upside of using reentrantlock is, but is sure more error-prone. Make sure you check out this document:

https://bugs.openjdk.java.net/browse/JDK-8156765

which, in turn, is caused by a serious design flaw in java.util.concurrent that is only fixed by an epic hack in java9:

https://bugs.openjdk.java.net/browse/JDK-7011862

@ben-manes
Copy link
Owner

I've checked ReentrantLock vs. NonreentrantLock, but in the end, they both end up in AbstractQueuedSynchronizer.acquire(), which is where the problem occurs.

Yes, but perhaps my NonreentrantLock has a bug that is hard to trigger. It is a slimmed down ReentrantLock since I don't expect recursive locking, and tried to build out everything upfront and before I got demotivated by the weight of the project's complexity. While it is faster in a dedicated microbenchmark, it isn't a hot path and probably should be removed as unnecessary.

I was thinking of replacing evictionLock construct by a simple synchronized(this). I'm not sure what the upside of using reentrantlock is, but is sure more error-prone.

Unfortunately synchronized does not support try-lock semantics. When scheduling is performed (scheduleDrainBuffers), a tryLock is used to skip past it if another thread is taking care of it. That avoids unnecessary blocking and is simpler than trying to emulate it using a synchronized and CAS'd boolean, which might work but feel a little error prone.

All the stacktraces are in this code path, there is no difference.

Do your logs include any warning like Exception thrown when submitting maintenance task? This would be if the executor threw an exception unexpectedly, such as a RejectedExecutionException. That shouldn't occur on a FJP unless the system is in a bad state, as it allows for up to 64M tasks. A problem like OutOfMemoryError might occur though, but the tests should cover that case. But I don't know when the lock could be left in a blocked state, resulting in a pile-up, if assuming it is a bug within BoundedLocalCache.

So I'd still recommend trying the trick to force ReentrantLock and seeing if this occurs. But other than that guess, there isn't enough information for me to debug with.

@agoston
Copy link
Author

agoston commented Nov 20, 2017

Do your logs include any warning like Exception thrown when submitting maintenance task?

No, I see nothing of the sort.

Forcing the reentrantlock is not so simple, as the code in BoundedLocalCache goes like this:

    evictionLock = (builder.getExecutor() instanceof ForkJoinPool)
        ? new NonReentrantLock()
        : new ReentrantLock();

So the recommended FJP::commonPool() would still use a NonReentrantLock.

@ben-manes
Copy link
Owner

Nope, it would be an anonymous executor and use a ReentrantLock.

$ javarepl
Welcome to JavaREPL version 425 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type expression to evaluate, :help for more options or press tab to auto-complete.
Connected to local instance at http://localhost:53166
java> import java.util.concurrent.*
Imported java.util.concurrent.*
java> Executor e = ForkJoinPool.commonPool()::execute;
java.util.concurrent.Executor e = Evaluation$7adjq3cgmtr50ps9z6b4$$Lambda$211/634150796@6a757ee
java> e instanceof ForkJoinPool
java.lang.Boolean res1 = false

@agoston
Copy link
Author

agoston commented Nov 20, 2017

👏, neat :)

I will push a test to production. Note that since this issue is not reproducible manually, I won't be able to say if it's fixed or not.

@ben-manes
Copy link
Owner

Sounds good. I'll switch over regardless and close this when released, assuming this doesn't occur again. If it does occur afterwards, we should revisit with the new stacktrace. I won't get a chance to look at this until after the holidays, so that should give us some time to see if it reproduces.

@levischuckeats
Copy link

Commenting here, as we are also affected. It seems to only happen on production and at odd hours. Today (the third event) happened on a server with little to no load at these hours.

The stack trace is the same as far as I can tell above. This was with version 2.5.5.

  - parking to wait for  <0x00000002cdaff260> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315)
        at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1096)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1017)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1655)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1602)
        at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64)

We use an async weigher which uses a pessimistic (largest object of that class so far) at first, then in a thread pool it will re-insert the object with a thread local set, such that it is actually weighed within that thread pool. So most objects are inserted twice under the same key.

            AsyncWeigher<Object, Object> weigher = asyncReference.get();
            cache = Caffeine.newBuilder()
                    .maximumWeight(1024l * 1024l * 1024l * GIGS_OF_CACHE)
                    .weigher(weigher)
                    .build();
            weigher.setCache(cache);

@ben-manes
Copy link
Owner

ben-manes commented Nov 21, 2017

Can you also try changing to use a ReentrantLock so we can see if it reoccurs?

Caffeine.newBuilder().executor(ForkJoinPool.commonPool()::executor)...

The inability to reproduce and eye balling the code, I haven't stumbled upon the likely cause. So I don't have much advice yet except bypass the most likely culprit.

Do you see IllegalMonitorStateException in your logs? Also, can you list which methods you use on the cache? Perhaps the exclusiveOwnerThread is not properly set/unset causing it to block. There are explicit lock() calls in a few non critical paths, so perhaps its a race between lock/tryLock.

@levischuckeats
Copy link

Hello @ben-manes
Due to the holidays, we won't be deploying significant code changes this week, so I won't be able to try this out.

As for IllegalMonitorStateException, there are no error logs that include that in the stacks from any of our servers.

We currently use the following functions (duplicates for specific types that we use, if it makes a difference):

  • invalidate(Long)
  • invalidate(Object)
  • put(String, Object)
  • put(Object, Object)
  • invalidateAll()
  • getIfPresent(Long)
  • getIfPresent(String)
  • getIfPresent(Object)
  • getAllPresent(Collection)
  • .asMap().remove(Long)

@ben-manes
Copy link
Owner

Yep, due to the holidays I don't plan on fixing it this week either =)

Thanks. invalidateAll() does use lock() since clearing is going to be an expensive operation and the alternative path would be slower. So next step would be to see if Stresser test could reproduce it with a mix of calls. However, this is done by MultiThreadedTest so I'd have expected it to fail during a CI run in the past.

@levischuckeats
Copy link

invalidateAll in our code is rarely used and really only comes to play in a single threaded situation where the server just finished running migrations before becoming available to the outside world.

So I wouldn't focus on that.

@ben-manes
Copy link
Owner

Right, I'm trying to think of lock interactions that could be stress tested in an attempt to reproduce the problem. Maybe I should write a dedicated stress test on the lock class itself to see if I can break it, so we are sure about where the bug lies.

@levischuckeats
Copy link

While preparing a censored jstack to share (if you so wish to see, reach out to [email protected] ), here are the counts I saw.

  • 378 threads were stuck on com.github.benmanes.caffeine.cache.LocalManualCache.put
  • 136 threads were stuck on com.github.benmanes.caffeine.cache.BoundedLocalCache.remove
  • 4 threads were stuck on com.github.benmanes.caffeine.cache.LocalManualCache.invalidate
  • 9 threads were stuck on com.github.benmanes.caffeine.cache.BoundedLocalCache.replace

There was one thread in a fork join pool that was made to do a clean up, but it didn't get anywhere

	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315)
	at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1096)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:2963)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.exec(BoundedLocalCache.java:2952)

@levischuckeats
Copy link

Interestingly.. as I was censoring the jstack and getting those numbers, the same server had a deadlock with caffeine again.

@ben-manes
Copy link
Owner

I'm sorry for the inconvenience this is causing. I'm surprised its happened all of a sudden, since that code wasn't changed recently and its had heavy usages and stress tests. I wonder if its a JDK change around memory fences, perhaps a newer optimization triggering this bug.

@levischuckeats
Copy link

It's really odd, because we have not updated in a while but it is only happening this last week and now.

These servers are running with about 10GB of cache (GIGS_OF_CACHE being set to 10 in the builder above), and are running with java -version

java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

@jbduncan
Copy link

@levischuckeats Am I right to assume that this Java version you're running is some sort of build of OpenJDK? :)

@levischuckeats
Copy link

@jbduncan

I don't think so, the folder our java bins are in is named java-8-oracle, the COPYRIGHT mentions oracle in name, there's a release file that says commercial under build type. The README just points to java.com

@ben-manes
Copy link
Owner

When I run the Stresser in write mode on 8_144, I see the lock holder change over time.

---------- 00:01:34 ----------
Pending reads: 0; writes: 1,024
Drain status = Processing -> Required (3)
Evictions = 325,998,568
Size = 5,136 (max: 4,096)
Lock = [Locked by thread pool-2-thread-1]
Pending tasks = 2
Max Memory = 3,817,865,216 bytes
Free Memory = 158,083,696 bytes
Allocated Memory = 481,820,672 bytes

---------- 00:01:39 ----------
Pending reads: 16; writes: 153
Drain status = Processing -> Required (3)
Evictions = 343,609,572
Size = 5,494 (max: 4,096)
Lock = [Locked by thread pool-2-thread-13]
Pending tasks = 0
Max Memory = 3,817,865,216 bytes
Free Memory = 155,875,704 bytes
Allocated Memory = 318,767,104 bytes

There are stacks of blocked threads, but that is because the write buffer was filled. Rather than being unbounded, it adds back pressure to block threads by helping to perform a full cleanup. In the common case writes are processed asynchronously to minimize user-facing latencies, and in the write storm it throttles when overwhelmed. The throttling improves throughput by reducing context switching.

"pool-2-thread-7" #24 daemon prio=1 os_prio=31 tid=0x00007fcc0584a000 nid=0x7003 waiting on condition [0x0000700011e01000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000006c0ca3650> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at com.github.benmanes.caffeine.cache.NonReentrantLock$Sync.lock(NonReentrantLock.java:315)
	at com.github.benmanes.caffeine.cache.NonReentrantLock.lock(NonReentrantLock.java:78)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1100)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1021)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1658)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1606)
	at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64)
	at com.github.benmanes.caffeine.cache.Stresser.lambda$run$2(Stresser.java:103)
	at com.github.benmanes.caffeine.cache.Stresser$$Lambda$8/517380410.call(Unknown Source)
	at com.github.benmanes.caffeine.testing.ConcurrentTestHarness.lambda$timeTasks$0(ConcurrentTestHarness.java:83)
	at com.github.benmanes.caffeine.testing.ConcurrentTestHarness$$Lambda$9/1681433494.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I would have expected the lock holder to be stuck and, potentially, the throughput to drop as if all threads get blocked. So this is frustrating since I cannot reproduce...

@levischuckeats
Copy link

I understand the back pressure, though when the CPU for all cores is near 0, I don't think work is being done 😃

For our use case (an intermediate cache for entities, which are backed by the database), we'd prefer things getting dropped, rather than being blocked.

@ben-manes
Copy link
Owner

@agoston By any chance are you running this on Tomcat? In a private stacktrace from @levischuckeats, he mentioned the threads are from Apache Catalina (Tomcat's servlet pool). I know the Tomcat class loader can cause problems, especially when shared classes like FJP are in the mix. I wonder if it is an interaction because of the container...

@ben-manes
Copy link
Owner

Yep, if the system isn't generating writes then we'd all expect the cache to catch up and quickly be idle.

A write cannot be dropped, since those are add/update/remove operations already performed on the hash table. Think of it as a database write-ahead log. The writes are applied in-memory, appended to a log, and replayed on the persisted form. In this way the expensive operation (random fsync) are batched and asynchronous, helping to scale throughput. The cache is similar, where the locking around the eviction policy is expensive because every read/write is a policy mutation. We can and do drop reads, which is the common case and non-critical state. But writes may effect things like require an eviction.

@agoston
Copy link
Author

agoston commented Nov 22, 2017

@ben-manes
correct, we are running under tomcat (jboss to be even more precise). The thread names are telling it already (ajp-* are tomcat's AJP worker threads).

@ben-manes
Copy link
Owner

Any chance you have clearReferencesStopThreads enabled? That uses Thread.stop() and I’m wondering if a redeployment causes the lock to be left in a bad state due to halted threads.

Also what version of Tomcat and did you recently upgrade?

@levischuckeats
Copy link

On our end, we actually kill java when we redeploy, so every time Tomcat is running for the first time in the JVM.

@agoston
Copy link
Author

agoston commented Nov 22, 2017

I wonder if we work at the same company @levischuckeats :) we do the same thing, we do a rolling restart only, never redeploy.

@ben-manes
Copy link
Owner

ben-manes commented Nov 22, 2017

I'm stuck then, because I can't work out the failure conditions.

  • All usages of the lock use the correct lock then try / finally unlock idiom
  • No calls have recursive acquisitions excluding foreign code
  • The shown configurations do not have foreign code that recurses under the lock
  • Tests against the cache (Stresser, MultiThreadedTest) don't show problems
  • An isolated stress test against the lock shows progress
  • No weird behavior like Thread.stop() to explain it away
  • Independent failures so not a user problem (e.g. infinite loop)

The only similarity so far is Tomcat which can be invasive, but not so much in your usages. I haven't used Tomcat in many years (preferring embedded Jetty) so I'm not good at debugging it anymore. But it also feels like a copout to presume its a weird interaction bug.

I'm back to having to wait until sometime after the holidays when you both can test using ReentrantLock to see if the problem disappears or a similar stack emerges. If you have any better suggestions please let me know.

@ben-manes
Copy link
Owner

It would also be helpful if you track the cache stats. Then we could know for certain that eviction halted. Otherwise my testing only shows threads backed up, but progress continues. If I add a shutdown hook, then they all quickly complete normally. I cannot reproduce any hanging.

@ben-manes
Copy link
Owner

I’m not sure how to help, unfortunately. What kernel and jdk?

@agoston
Copy link
Author

agoston commented Jul 16, 2018

Nah, I know there's not much we can do, this is more to keep track for other unfortunate souls.
I still believe this is a JVM/kernel bug, not a caffeine bug.

For us, java version is 1.8.0_161 and the kernel version is 2.6.32-696.23.1.el6.x86_64.

@johnou
Copy link

johnou commented Jul 16, 2018

@agoston what JVM options? Are you monitoring your heap? Could an OOM have occurred?

@agoston
Copy link
Author

agoston commented Jul 23, 2018

JVM options are really super complex (a CMS GC tuned to the teeth), but I'm assured there was no OOM situation.

@johnou
Copy link

johnou commented Jul 23, 2018

Can you dump the JVM options here so we can run the Stresser with them?

@BunnyNichalos
Copy link

BunnyNichalos commented Jan 8, 2021

I still have this problem. caffeine version 2.8.6. java version 1.8.0_201 and kernel version is Linux 3.10.0-862.14.4.el7.x86_64 x86_64

"XNIO-1 task-11" #1460 prio=5 os_prio=0 tid=0x00007fd2fc001800 nid=0x5ab waiting on condition [0x00007fd25e53a000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000c361ca80> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1445)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1366)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2007)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1955)
        at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:130)

@ben-manes
Copy link
Owner

Sorry, that isn't enough details. You probably need to send all of the stack traces to see the other threads blocked

@ben-manes
Copy link
Owner

The afterWrite is a hint though. It looks like the write buffer exceeded capacity (128 * NCPUs) which is quite generous. This means a lot of write operations are queued up and not being run. The cache then blocks writes on the eviction lock in order to deprioritize writes and have them help clean up. This indicates some starvation so that the owner of this lock is never progressing and everything is blocked.

Do you have a thread performing a compute? Maybe you have a live-lock on eviction due to recursive computations, so it cannot remove the entry.

@BunnyNichalos
Copy link

Thank you for your reply. Indeed in my application, a lot of write operations exist,but they were blocked, and I had not find the owner of this lock. I use jstack export thread info.
thread stack.

@ben-manes
Copy link
Owner

The ForkJoinWorkerThread is blocked waiting on the lock, as are all of the other threads. I don't see any ConcurrentHashMap operations, anything performing work under the lock, etc. My guess is it is a JVM / Linux / HW bug like this one, where the unlock signal is never communicated. That is too low level for me to debug...

Are you using OpenJDK or a custom distribution? If so, then you might ask them since it might be a bug there.

@BunnyNichalos
Copy link

Thanks. I'm using oracle jdk. Here is my jdk version:

java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)

@ben-manes
Copy link
Owner

hmm.. then I really don't know. I'm sorry, but there is not much to go on...

You could try using Caffeine.executor(Runnable::run) to see if that helps. It will perform the maintenance work on the caller threads.

@BunnyNichalos
Copy link

OK. Thank you for your advice, I will have a try.

@ben-manes
Copy link
Owner

Is that Linux kernel version old enough to be subject to the lost wake up bug mentioned in this thread? It looks like it and if so, you need to upgrade the OS

@ben-manes
Copy link
Owner

June 2013 for 3.10 and the bug is confirmed in 3.19 which is from February 2015. I don’t know much but you should try to use a modern distribution

@BunnyNichalos
Copy link

BunnyNichalos commented Jan 8, 2021

Yes, I'm using centos 7, and it's kenel version is Linux 3.10.0, maybe it is the bug from the OS. I think there must be some case in my code trigger this bug. In general, use ReentrantLock can not trigger this bug. I try if I can reproduce this case. I guess maybe use synchronize(string.intern) and inside use Caffeine.put operate trigger this. If I can not reproduce, I will try upgrade the OS.

@BunnyNichalos
Copy link

when I change to use Guava cache,the problem still exist. It should not be Caffeine's problem.

"DubboServerHandler-10.198.65.150:20880-thread-957" #1243 daemon prio=5 os_prio=0 tid=0x000000000195f800 nid=0x4d0 waiting on condition [0x00007f2e53f6a000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000c2abca68> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2771)
	at com.google.common.cache.LocalCache.put(LocalCache.java:4192)
	at com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4876)

All threads waiting for the lock, but I can not find any thread hold the lock

@ben-manes
Copy link
Owner

Yes, this is what Gerrit also experienced with either library. I don’t know but it’s come up multiple times with different implementations and nothing in the code. There is some hardware/ OS / JVM bug causing havoc.

@BunnyNichalos
Copy link

Yes, this is what Gerrit also experienced with either library. I don’t know but it’s come up multiple times with different implementations and nothing in the code. There is some hardware/ OS / JVM bug causing havoc.

In my case,I think this problem was caused by thread StackOverflowError. When this problem happened, I dumped the thread stack info and the heap info. In the heap I found the following phenomenon: (now I'm using guava, same as caffeine):

heap_dump_lock

I searched in the log, and this thread had throwed StackOverflowError because of recursive operations:

2021-04-22 10:02:02.823 ERROR 1 --- [XNIO-1 task-2] com.li.dm.web.aop.LogAspect :

java.lang.StackOverflowError: null
	at com.google.common.cache.LocalCache$Segment.runLockedCleanup(LocalCache.java:3455)
	at com.google.common.cache.LocalCache$Segment.cleanUp(LocalCache.java:3444)
	at com.google.common.cache.LocalCache$Segment.postReadCleanup(LocalCache.java:3422)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2078)
	at com.google.common.cache.LocalCache.getIfPresent(LocalCache.java:3967)
	at com.google.common.cache.LocalCache$LocalManualCache.getIfPresent(LocalCache.java:4853)
	at com.li.dm.service.service.impl.DialogCacheHandler.getKeyword(DialogCacheHandler.java:114)
	at com.li.dm.service.service.impl.KeywordMatcher.findKeywords(KeywordMatcher.java:95)
	at com.li.dm.service.service.impl.KeywordMatcher.matchKeyword(KeywordMatcher.java:61)
	at com.li.dm.service.service.impl.KeywordMatcher.matchKeyword(KeywordMatcher.java:56)

The thread was a thread in a threadpool. So I guess this was caused by thread StackOverflowError, because of this error, the thread exited without unlock, so the state for the reentrantLock was always 1, no chance for reset to 0. Then other threads operated the cache, exec lock(), and blocked.

@ben-manes
Copy link
Owner

Good find! Maybe this will help @kentwang929 who recently had a similar problem.

@kentwang929
Copy link

interesting, we also do observe StackOverFlow in our logs. good direction to debug, thanks!

@ben-manes
Copy link
Owner

fyi, I believe JEP-270 should have resolved that but it is Java 9 onward.

@kentwang929
Copy link

We are on java 14. This might be our code issue when the query is too large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants