Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaking caffeine cache #103

Closed
Spikhalskiy opened this issue Aug 2, 2016 · 7 comments
Closed

Leaking caffeine cache #103

Spikhalskiy opened this issue Aug 2, 2016 · 7 comments

Comments

@Spikhalskiy
Copy link

Spikhalskiy commented Aug 2, 2016

Hi Ben,

We have leak problem with caffeine cache in our project, looks like it can't cleanup for an unclear reason.
We init cache this way:

            caffeineCache = Caffeine.newBuilder()
                .expireAfterAccess(expirationInMinutes, TimeUnit.MINUTES)
                .maximumSize(maxSize)
                .build();

We use cache in two places, in one place it works and supportы maxSize absolutely fine, in another place from heap dump we see that caffeine retains about 12 Millions of objects with specified maxSize = 100_000.

Here you can find a related part of our heap.
screen shot 2016-08-01 at 7 13 24 pm
The difference between 12273k SStAMS and 11873k Xt3 classes (real key/values) is because 400k is a size of the second cache that maintains fine.

When looking into heap we see that 99% of SStAMS instances having a 'key' pointing to "SSt#DEAD_STRONG_KEY".

Please, could you help us to troubleshoot this problem?

We use

java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

So, we potentially could suffer from JDK bug. Please confirm if it looks related.

@ben-manes
Copy link
Owner

Yes, the JDK bug would be my first guess too. I would also suggest to be on my latest release which tries to be a little more resistant if it detects the problem. I'm sorry to hear you're running into this and hopefully an upgrade will resolve it.

@Spikhalskiy
Copy link
Author

We already use last 2.3.1 now. Ok, will let you know if we have any updates.

@ben-manes
Copy link
Owner

I've been unable to reproduce a memory leak when experimenting with different configurations using Stresser, v2.3.1 and 1.8.0_92. Locally I'm also printing out the access/write order deque sizes (used for size/expiration ordering), in case there was a dangling reference there. Other than the deques and the hash table there shouldn't be any other reference holders. Since the metrics are correct and the GC stays stable under a write load, I'm hopeful that upgrading will resolve your problems.

@Spikhalskiy
Copy link
Author

Looks like reason of this issue is absolutely different. We modified key sometimes after inserting it into cache. Looks like after that caffeine could mark it as dead and eligible for cleanup but can't actually cleanup it. Maybe because it makes lookup in second case, can't find element and do nothing. Didn't take a look deeply, because main issue is clear.

@ben-manes
Copy link
Owner

That makes perfect sense. I was concerned that the entry was dead but not removed, and forgot about that classic user error. For eviction we know the entry via a linked list and then use the key to remove it from the table. Because the hash map isn't under an exclusive lock, if a remove(key) fails we assume we raced with an explicit user operation (e.g. cache.invalidate(key)). Some hash tables might be able to detect this type of corruption, but at the cost of performance. Since this is usually a novice mistake that the Java Collection Framework doesn't defend against, we don't either.

I've been meaning to look and see if we should null out the value after marking the node as dead. It would have masked the memory leak for a lot longer, as your heavy-weight Xt3 value wouldn't have been retained but the nodes would keep accumulating. Do you think it is friendlier to fail faster due to the leak or slower such that it may not be easily detected?

Closing as the issue is solved, but I am interested in ways to help minimize the pain.

ben-manes added a commit that referenced this issue Aug 3, 2016
In #103 a user broke the map contract by accidentially modifying the
key, causing lookups to not find the entry. This includes eviction,
which finds the entry on a linked list ("Node") and removes from the
hash table with the stashed key. While the node can be marked as dead,
the removal might fail in practice if racing with an explicit
invalidation.

When an weak/soft value entry is dead we already clear the reference to
reduce the GC burden. In the common case null'ing out the field would
have no benefit, because the GC would trace live objects and not need
the hint. In the above case it reduces the impact of a memory leak by
collecting the heavy user objects. The node instances would still
accumulate and the leak may take longer to fail. But since we do it for
reference caching and it could reduce the likelihood of outages, it
does seems friendlier to null the strong value field.
@Spikhalskiy
Copy link
Author

@ben-manes I think it doesn't make a lot of difference how fast we will get an OOM :) It would be much better to fail or at least print loud logs in we can't remove dead entity from map. By current design looks like it could be hard. On the same time, I see an option to add additional flag like "deleted" to key and setup it even if we didn't find entity in the map (consider it already deleted for now) and make something if we find "deleted" entity on any list traversing after.

@ben-manes
Copy link
Owner

The internal lists would no longer have a deleted node, only the hash table holds it. That means currently we could only detect it on a clear() or asMap() iteration, which users rarely call and we have no other need to. So short of a periodic consistency check, there is no easy hook. Unfortunately proper eager detection would require directly embedding into the hash table to assert that the entry pointers are null when we mark the node dead. That's a clean solution, but a large undertaking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants