`:caffeine:stress` throws OOM on `v2.9.3` #694

bscSCORM · 2022-03-25T15:35:34Z

➜  caffeine git:(05a040c2) ./gradlew :caffeine:stress
checksum-dependency-plugin is disabled since checksumIgnore property is present

> Task :caffeine:stress
---------- 00:00 ----------
Pending reads: 0; writes: 0
Drain status = Idle (0)
Evictions = 1,044,480
Size = 4,096 (max: 4,096)
Lock = [Unlocked]
Pending tasks = 0
Max Memory = 7,635,730,432 bytes
Free Memory = 554,430,384 bytes
Allocated Memory = 649,592,832 bytes


.... snipped similar blocks


---------- 00:02:21 ----------
Pending reads: 1; writes: 1,105
Drain status = Processing -> Required (3)
Evictions = 33,995,088
Size = 6,122 (max: 4,096)
Lock = [Locked by thread ForkJoinPool.commonPool-worker-7]
Pending tasks = 32,148,931
Max Memory = 7,635,730,432 bytes
Free Memory = 957,449,576 bytes
Allocated Memory = 7,635,730,432 bytes

---------- 00:02:36 ----------
Pending reads: 0; writes: 855
Drain status = Processing -> Required (3)
Evictions = 34,006,320
Size = 5,832 (max: 4,096)
Lock = [Locked by thread ForkJoinPool.commonPool-worker-10]
Pending tasks = 32,165,066
Max Memory = 7,635,730,432 bytes
Free Memory = 958,969,640 bytes
Allocated Memory = 7,635,730,432 bytes

---------- 00:03:16 ----------
Pending reads: 1; writes: 56
Drain status = Processing -> Required (3)
Evictions = 34,022,425
Size = 4,311 (max: 4,096)
Lock = [Locked by thread ForkJoinPool.commonPool-worker-6]
Pending tasks = 32,187,931
Max Memory = 7,635,730,432 bytes
Free Memory = 955,241,008 bytes
Allocated Memory = 7,635,730,432 bytes

Exception in thread "pool-2-thread-1" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at com.github.benmanes.caffeine.cache.LocalLoadingCache$$Lambda$11/214200852.get$Lambda(Unknown Source)
        at java.lang.invoke.LambdaForm$DMH/13348746.invokeStatic_LJL3_L(LambdaForm$DMH)

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "pool-2-thread-1"
Exception in thread "main" NG [21m 54s]
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"

java -version
openjdk version "1.8.0_322"
OpenJDK Runtime Environment Corretto-8.322.06.1 (build 1.8.0_322-b06)
OpenJDK 64-Bit Server VM Corretto-8.322.06.1 (build 25.322-b06, mixed mode)

On MacOS 11.6.4 (20G417)

The text was updated successfully, but these errors were encountered:

bscSCORM · 2022-03-25T15:39:30Z

Of course I'm not primarily interested in getting the stress tool to run without error, but in an attempted first deployment to production using caffeine we're seeing many threads stuck here:

java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1461)
com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1382)
com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2481)
com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2394)
com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
....

From reading some other discussions it does seem relevant that our loader involves a DB hit and therefore is relatively slow, so we should try increasing the initial capacity and/or using AsyncCache. I was having trouble reproducing this particular problem locally though, so thought the stress tool might help determine if there were relevant differences between production and dev (besides load applied) and saw this behavior, which would seem to indicate a problem.

ben-manes · 2022-03-25T15:46:48Z

I think this is a side effect of working around JDK-8274349. It looks like writes are being processed but a pileups of tasks occurs as we try to compensate. I’m not sure what we can do as we try to avoid duplicate submissions but also have to protect against an executor that ignored our task. Under stress I think this leads to a buildup that wouldn’t occur naturally.

bscSCORM · 2022-03-25T15:59:46Z

To make sure I have what you're saying right -- it's expected that the stress will fail in this way due to the workaround for JDK-8274349. The workaround, not the issue itself -- note here the CPU in question is a 2.3 GHz 8-Core Intel Core i9 , so I'd expect availableProcessors to be 8.

And the stress tool creates such an abnormal level of stress it's not relevant to the production issue, where AsyncCache and/or a larger initial capacity are still likely to work.

ben-manes · 2022-03-25T16:12:31Z

I believe so, but I would have to investigate. There have been multiple bugs over the years where that thread pool has dropped or ignored tasks. For example JDK-8078490 was a multi-core race condition that bit us.

Currently we schedule the work based on a state machine flag that avoids duplicates. However once reset by any thread it allows for a new task submission. Assuming that the work is fast, a few duplicates would be flushed quickly. As we've improved our fallback handling to compensate, maybe we should make this stricter again to only allow for a single in-flight task. If that task is never run then we are degraded to always rely on our fallback, but the system is already likely in a problematic state regardless as its own work might have been ignored. So stricter task scheduling logic is likely the right fix for the stress case to avoid flooding, while also having writers assist when the pool is broken.

ben-manes · 2022-03-25T16:39:29Z

For your production use-case, please try AsyncCache. That should be unrelated to this issue and sounds more like the case we added a warning for recently.

ben-manes · 2022-03-25T20:11:34Z

I haven't had a chance to look into this yet, but one reason is this could be refresh tasks. In that commit, the utility is defaulted to perform refresh operations (merely the last touch). Unfortunately in 2.x this may dispatch a new CompletableFuture per call to LoadingCache.refresh(key) and the detection for duplications was not robust. In that version it was primarily enough so that implicit reloads from refreshAfterWrite were not duplicated. That was a pretty big change so the improvements went into 3.x in case of compatibility differences. In that rewrite we coordinate around a dedicated map of in-flight futures, so subsequent calls to refresh(key) will no-op like Guava does. This could explain why FJP was overwhelmed and be different from our cleanup task.

caffeine/caffeine/src/test/java/com/github/benmanes/caffeine/cache/Stresser.java

Line 68 in 05a040c

private static final Operation operation = Operation.REFRESH;

ben-manes · 2022-03-25T21:01:40Z

Confirmed. This is due to the naive LoadingCache.refresh(key) implementation in 2.x. That is expected, as 3.x does not exhibit this runaway growth. It correctly bounds to the number of distinct keys (1 million).

The drains from reads or writes do not show this problem in either version. That could be stricter as some duplicates are scheduled, but only up to 3-4 on my machine.

I should clean up this utility to be nicer using Picoli and param flags, so it is clearer what operations it is running, etc.

Thanks for the heads up on this, glad it is a non-issue.

ben-manes · 2022-03-27T00:35:23Z

On master, the stress tester is now parameterized so it can be run using,

./gradlew :caffeine:stress --workload=read

I'm sure the tool could be made better, but hopefully it won't be printing confusing details because of a hard coded default.

ben-manes closed this as completed Mar 25, 2022

ben-manes added a commit that referenced this issue Mar 27, 2022

Parameterize the stress tester (#694)

5bc7bcb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`:caffeine:stress` throws OOM on `v2.9.3` #694

`:caffeine:stress` throws OOM on `v2.9.3` #694

bscSCORM commented Mar 25, 2022

bscSCORM commented Mar 25, 2022

ben-manes commented Mar 25, 2022

bscSCORM commented Mar 25, 2022

ben-manes commented Mar 25, 2022

ben-manes commented Mar 25, 2022

ben-manes commented Mar 25, 2022

ben-manes commented Mar 25, 2022

ben-manes commented Mar 27, 2022

:caffeine:stress throws OOM on v2.9.3 #694

:caffeine:stress throws OOM on v2.9.3 #694

Comments

bscSCORM commented Mar 25, 2022

bscSCORM commented Mar 25, 2022

ben-manes commented Mar 25, 2022

bscSCORM commented Mar 25, 2022

ben-manes commented Mar 25, 2022

ben-manes commented Mar 25, 2022

ben-manes commented Mar 25, 2022

ben-manes commented Mar 25, 2022

ben-manes commented Mar 27, 2022

`:caffeine:stress` throws OOM on `v2.9.3` #694

`:caffeine:stress` throws OOM on `v2.9.3` #694