Leaking caffeine cache #103

Spikhalskiy · 2016-08-02T11:24:08Z

Hi Ben,

We have leak problem with caffeine cache in our project, looks like it can't cleanup for an unclear reason.
We init cache this way:

            caffeineCache = Caffeine.newBuilder()
                .expireAfterAccess(expirationInMinutes, TimeUnit.MINUTES)
                .maximumSize(maxSize)
                .build();

We use cache in two places, in one place it works and supportы maxSize absolutely fine, in another place from heap dump we see that caffeine retains about 12 Millions of objects with specified maxSize = 100_000.

Here you can find a related part of our heap.

The difference between 12273k SStAMS and 11873k Xt3 classes (real key/values) is because 400k is a size of the second cache that maintains fine.

When looking into heap we see that 99% of SStAMS instances having a 'key' pointing to "SSt#DEAD_STRONG_KEY".

Please, could you help us to troubleshoot this problem?

We use

java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

So, we potentially could suffer from JDK bug. Please confirm if it looks related.

The text was updated successfully, but these errors were encountered:

ben-manes · 2016-08-02T15:19:03Z

Yes, the JDK bug would be my first guess too. I would also suggest to be on my latest release which tries to be a little more resistant if it detects the problem. I'm sorry to hear you're running into this and hopefully an upgrade will resolve it.

Spikhalskiy · 2016-08-02T15:21:20Z

We already use last 2.3.1 now. Ok, will let you know if we have any updates.

ben-manes · 2016-08-02T17:49:31Z

I've been unable to reproduce a memory leak when experimenting with different configurations using Stresser, v2.3.1 and 1.8.0_92. Locally I'm also printing out the access/write order deque sizes (used for size/expiration ordering), in case there was a dangling reference there. Other than the deques and the hash table there shouldn't be any other reference holders. Since the metrics are correct and the GC stays stable under a write load, I'm hopeful that upgrading will resolve your problems.

Spikhalskiy · 2016-08-02T22:24:58Z

Looks like reason of this issue is absolutely different. We modified key sometimes after inserting it into cache. Looks like after that caffeine could mark it as dead and eligible for cleanup but can't actually cleanup it. Maybe because it makes lookup in second case, can't find element and do nothing. Didn't take a look deeply, because main issue is clear.

ben-manes · 2016-08-02T23:53:49Z

That makes perfect sense. I was concerned that the entry was dead but not removed, and forgot about that classic user error. For eviction we know the entry via a linked list and then use the key to remove it from the table. Because the hash map isn't under an exclusive lock, if a remove(key) fails we assume we raced with an explicit user operation (e.g. cache.invalidate(key)). Some hash tables might be able to detect this type of corruption, but at the cost of performance. Since this is usually a novice mistake that the Java Collection Framework doesn't defend against, we don't either.

I've been meaning to look and see if we should null out the value after marking the node as dead. It would have masked the memory leak for a lot longer, as your heavy-weight Xt3 value wouldn't have been retained but the nodes would keep accumulating. Do you think it is friendlier to fail faster due to the leak or slower such that it may not be easily detected?

Closing as the issue is solved, but I am interested in ways to help minimize the pain.

In #103 a user broke the map contract by accidentially modifying the key, causing lookups to not find the entry. This includes eviction, which finds the entry on a linked list ("Node") and removes from the hash table with the stashed key. While the node can be marked as dead, the removal might fail in practice if racing with an explicit invalidation. When an weak/soft value entry is dead we already clear the reference to reduce the GC burden. In the common case null'ing out the field would have no benefit, because the GC would trace live objects and not need the hint. In the above case it reduces the impact of a memory leak by collecting the heavy user objects. The node instances would still accumulate and the leak may take longer to fail. But since we do it for reference caching and it could reduce the likelihood of outages, it does seems friendlier to null the strong value field.

Spikhalskiy · 2016-08-03T08:27:32Z

@ben-manes I think it doesn't make a lot of difference how fast we will get an OOM :) It would be much better to fail or at least print loud logs in we can't remove dead entity from map. By current design looks like it could be hard. On the same time, I see an option to add additional flag like "deleted" to key and setup it even if we didn't find entity in the map (consider it already deleted for now) and make something if we find "deleted" entity on any list traversing after.

ben-manes · 2016-08-03T08:50:17Z

The internal lists would no longer have a deleted node, only the hash table holds it. That means currently we could only detect it on a clear() or asMap() iteration, which users rarely call and we have no other need to. So short of a periodic consistency check, there is no easy hook. Unfortunately proper eager detection would require directly embedding into the hash table to assert that the entry pointers are null when we mark the node dead. That's a clean solution, but a large undertaking.

ben-manes closed this as completed Aug 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaking caffeine cache #103

Leaking caffeine cache #103

Spikhalskiy commented Aug 2, 2016 •

edited

Loading

ben-manes commented Aug 2, 2016

Spikhalskiy commented Aug 2, 2016

ben-manes commented Aug 2, 2016

Spikhalskiy commented Aug 2, 2016

ben-manes commented Aug 2, 2016

Spikhalskiy commented Aug 3, 2016

ben-manes commented Aug 3, 2016

Leaking caffeine cache #103

Leaking caffeine cache #103

Comments

Spikhalskiy commented Aug 2, 2016 • edited Loading

ben-manes commented Aug 2, 2016

Spikhalskiy commented Aug 2, 2016

ben-manes commented Aug 2, 2016

Spikhalskiy commented Aug 2, 2016

ben-manes commented Aug 2, 2016

Spikhalskiy commented Aug 3, 2016

ben-manes commented Aug 3, 2016

Spikhalskiy commented Aug 2, 2016 •

edited

Loading