Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

:caffeine:stress throws OOM on v2.9.3 #694

Closed
bscSCORM opened this issue Mar 25, 2022 · 8 comments
Closed

:caffeine:stress throws OOM on v2.9.3 #694

bscSCORM opened this issue Mar 25, 2022 · 8 comments

Comments

@bscSCORM
Copy link

➜  caffeine git:(05a040c2) ./gradlew :caffeine:stress
checksum-dependency-plugin is disabled since checksumIgnore property is present

> Task :caffeine:stress
---------- 00:00 ----------
Pending reads: 0; writes: 0
Drain status = Idle (0)
Evictions = 1,044,480
Size = 4,096 (max: 4,096)
Lock = [Unlocked]
Pending tasks = 0
Max Memory = 7,635,730,432 bytes
Free Memory = 554,430,384 bytes
Allocated Memory = 649,592,832 bytes


.... snipped similar blocks


---------- 00:02:21 ----------
Pending reads: 1; writes: 1,105
Drain status = Processing -> Required (3)
Evictions = 33,995,088
Size = 6,122 (max: 4,096)
Lock = [Locked by thread ForkJoinPool.commonPool-worker-7]
Pending tasks = 32,148,931
Max Memory = 7,635,730,432 bytes
Free Memory = 957,449,576 bytes
Allocated Memory = 7,635,730,432 bytes

---------- 00:02:36 ----------
Pending reads: 0; writes: 855
Drain status = Processing -> Required (3)
Evictions = 34,006,320
Size = 5,832 (max: 4,096)
Lock = [Locked by thread ForkJoinPool.commonPool-worker-10]
Pending tasks = 32,165,066
Max Memory = 7,635,730,432 bytes
Free Memory = 958,969,640 bytes
Allocated Memory = 7,635,730,432 bytes

---------- 00:03:16 ----------
Pending reads: 1; writes: 56
Drain status = Processing -> Required (3)
Evictions = 34,022,425
Size = 4,311 (max: 4,096)
Lock = [Locked by thread ForkJoinPool.commonPool-worker-6]
Pending tasks = 32,187,931
Max Memory = 7,635,730,432 bytes
Free Memory = 955,241,008 bytes
Allocated Memory = 7,635,730,432 bytes

Exception in thread "pool-2-thread-1" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at com.github.benmanes.caffeine.cache.LocalLoadingCache$$Lambda$11/214200852.get$Lambda(Unknown Source)
        at java.lang.invoke.LambdaForm$DMH/13348746.invokeStatic_LJL3_L(LambdaForm$DMH)

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "pool-2-thread-1"
Exception in thread "main" NG [21m 54s]
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"
java -version
openjdk version "1.8.0_322"
OpenJDK Runtime Environment Corretto-8.322.06.1 (build 1.8.0_322-b06)
OpenJDK 64-Bit Server VM Corretto-8.322.06.1 (build 25.322-b06, mixed mode)

On MacOS 11.6.4 (20G417)

@bscSCORM
Copy link
Author

Of course I'm not primarily interested in getting the stress tool to run without error, but in an attempted first deployment to production using caffeine we're seeing many threads stuck here:

java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1461)
com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1382)
com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2481)
com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2394)
com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
....

From reading some other discussions it does seem relevant that our loader involves a DB hit and therefore is relatively slow, so we should try increasing the initial capacity and/or using AsyncCache. I was having trouble reproducing this particular problem locally though, so thought the stress tool might help determine if there were relevant differences between production and dev (besides load applied) and saw this behavior, which would seem to indicate a problem.

@ben-manes
Copy link
Owner

I think this is a side effect of working around JDK-8274349. It looks like writes are being processed but a pileups of tasks occurs as we try to compensate. I’m not sure what we can do as we try to avoid duplicate submissions but also have to protect against an executor that ignored our task. Under stress I think this leads to a buildup that wouldn’t occur naturally.

@bscSCORM
Copy link
Author

To make sure I have what you're saying right -- it's expected that the stress will fail in this way due to the workaround for JDK-8274349. The workaround, not the issue itself -- note here the CPU in question is a 2.3 GHz 8-Core Intel Core i9 , so I'd expect availableProcessors to be 8.

And the stress tool creates such an abnormal level of stress it's not relevant to the production issue, where AsyncCache and/or a larger initial capacity are still likely to work.

@ben-manes
Copy link
Owner

I believe so, but I would have to investigate. There have been multiple bugs over the years where that thread pool has dropped or ignored tasks. For example JDK-8078490 was a multi-core race condition that bit us.

Currently we schedule the work based on a state machine flag that avoids duplicates. However once reset by any thread it allows for a new task submission. Assuming that the work is fast, a few duplicates would be flushed quickly. As we've improved our fallback handling to compensate, maybe we should make this stricter again to only allow for a single in-flight task. If that task is never run then we are degraded to always rely on our fallback, but the system is already likely in a problematic state regardless as its own work might have been ignored. So stricter task scheduling logic is likely the right fix for the stress case to avoid flooding, while also having writers assist when the pool is broken.

@ben-manes
Copy link
Owner

For your production use-case, please try AsyncCache. That should be unrelated to this issue and sounds more like the case we added a warning for recently.

@ben-manes
Copy link
Owner

I haven't had a chance to look into this yet, but one reason is this could be refresh tasks. In that commit, the utility is defaulted to perform refresh operations (merely the last touch). Unfortunately in 2.x this may dispatch a new CompletableFuture per call to LoadingCache.refresh(key) and the detection for duplications was not robust. In that version it was primarily enough so that implicit reloads from refreshAfterWrite were not duplicated. That was a pretty big change so the improvements went into 3.x in case of compatibility differences. In that rewrite we coordinate around a dedicated map of in-flight futures, so subsequent calls to refresh(key) will no-op like Guava does. This could explain why FJP was overwhelmed and be different from our cleanup task.

private static final Operation operation = Operation.REFRESH;

@ben-manes
Copy link
Owner

Confirmed. This is due to the naive LoadingCache.refresh(key) implementation in 2.x. That is expected, as 3.x does not exhibit this runaway growth. It correctly bounds to the number of distinct keys (1 million).

The drains from reads or writes do not show this problem in either version. That could be stricter as some duplicates are scheduled, but only up to 3-4 on my machine.

I should clean up this utility to be nicer using Picoli and param flags, so it is clearer what operations it is running, etc.

Thanks for the heads up on this, glad it is a non-issue.

@ben-manes
Copy link
Owner

On master, the stress tester is now parameterized so it can be run using,

./gradlew :caffeine:stress --workload=read

I'm sure the tool could be made better, but hopefully it won't be printing confusing details because of a hard coded default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants