Multiple Threads BLOCKED at BoundedLocalCache.removeWithWriter #340

fericit-bostan · 2019-08-30T00:44:58Z

We are using version 2.5.2 of caffeine. We are seeing an issue with threads being BLOCKED when calling BoundedLocalCache.remove.

We are initializing the cache with ExecutorService.newFixedThreadPool with a size of 2. I don't know that this has any bearing on the situation but I thought I'd mention it.

http-STAGING-10-181646-3 #476 daemon prio=5 os_prio=0 tid=0x00007f67b48af800 nid=0x4c92 waiting for monitor entry [0x00007f6031de4000] java.lang.Thread.State: BLOCKED (on object monitor) at com.github.benmanes.caffeine.cache.BoundedLocalCache.removeWithWriter(BoundedLocalCache.java:1810) at com.github.benmanes.caffeine.cache.BoundedLocalCache.remove(BoundedLocalCache.java:1734) at com.github.benmanes.caffeine.cache.LocalManualCache.invalidate(LocalManualCache.java:74)

This occurs when load is high on our server with 6000 concurrent users. We are performing a get and are able to locate the object container. We then check the expirationTime of the object and if it is expired we then call remove. This is where the blocking occurs and appears to be related to the BoundedLocalCache.removeWithWriter method.

`
long now = expirationTicker().read();
RemovalCause[] cause = new RemovalCause[1];

data.computeIfPresent(nodeFactory.newLookupKey(key), (k, n) -> {
  synchronized (n) {
    oldValue[0] = n.getValue();
    if (oldValue[0] == null) {
      cause[0] = RemovalCause.COLLECTED;
    } else if (hasExpired(n, now)) {
      cause[0] = RemovalCause.EXPIRED;
    } else {
      cause[0] = RemovalCause.EXPLICIT;
    }
    writer.delete(castKey, oldValue[0], cause[0]);
    n.retire();
  }
  node[0] = n;
  return null;
});

if (cause[0] != null) {
  afterWrite(new RemovalTask(node[0]), now);
  if (hasRemovalListener()) {
    notifyRemoval(castKey, oldValue[0], cause[0]);
  }
}

`

The above code acquires the current timestamp (now) prior to the synchronized block, which seems as if it might be problematic if multiple threads were to be blocked on the object monitor.

Our theory is that if multiple threads are requesting the same cache key and the associated cache item is expired, each thread will be calling into BoundedLocalCache.remove, which calls into BoundedLocalCache.removeWithWriter, and when the synchronized block is hit only one thread will succeed while the other threads wait. However, at some point while this is occurring a new object is loaded into the cache for this key. As the remaining BLOCKED threads are waiting to acquire the monitor, at some point one will succeed and it will then remove the object that was just added as the evaluation checks the timestamp (now) value that was acquired prior to the synchronized block.

I can't imagine that our situation is unique but this is the only block of code in our execution that contains a synchronized block.

Thanks for the assistance.

The text was updated successfully, but these errors were encountered:

ben-manes · 2019-08-30T01:07:43Z

Can you please upgrade to the latest release. The timestamp is within the computation which should address part of your concern. I think this fixes it but you'd have to monitor to determine if it shifts the problem to another area to investigate. This seems to have been fixed in b3e5412 for #212, and released in version v2.6.1. Please take a look at the code for the current release below.

If it resolves your issues that would be wonderful. If not then let's gather some more information after the upgrade and investigate a fix. The subsequent versions should be backwards compatible with only improvements and bug fixes.

We are performing a get and are able to locate the object container. We then check the expirationTime of the object and if it is expired we then call remove. This is where the blocking occurs and appears to be related to the BoundedLocalCache.removeWithWriter method.

Does this mean you are maintaining an external expiration mechanism in addition to expiration support in Caffeine? It seems irrelevant to this problem, but I am curious about your setup in general.

caffeine/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java

Lines 2128 to 2161 in c06c5cc

    
           @Nullable V removeWithWriter(Object key) { 
        
             @SuppressWarnings("unchecked") 
        
             K castKey = (K) key; 
        
             @SuppressWarnings({"unchecked", "rawtypes"}) 
        
             Node<K, V>[] node = new Node[1]; 
        
             @SuppressWarnings("unchecked") 
        
             V[] oldValue = (V[]) new Object[1]; 
        
             RemovalCause[] cause = new RemovalCause[1]; 
        
             data.computeIfPresent(nodeFactory.newLookupKey(key), (k, n) -> { 
        
               synchronized (n) { 
        
                 oldValue[0] = n.getValue(); 
        
                 if (oldValue[0] == null) { 
        
                   cause[0] = RemovalCause.COLLECTED; 
        
                 } else if (hasExpired(n, expirationTicker().read())) { 
        
                   cause[0] = RemovalCause.EXPIRED; 
        
                 } else { 
        
                   cause[0] = RemovalCause.EXPLICIT; 
        
                 } 
        
                 writer.delete(castKey, oldValue[0], cause[0]); 
        
                 n.retire(); 
        
               } 
        
               node[0] = n; 
        
               return null; 
        
             }); 
        
             if (cause[0] != null) { 
        
               afterWrite(new RemovalTask(node[0])); 
        
               if (hasRemovalListener()) { 
        
                 notifyRemoval(castKey, oldValue[0], cause[0]); 
        
               } 
        
             } 
        
             return (cause[0] == RemovalCause.EXPLICIT) ? oldValue[0] : null; 
        
           }

fericit-bostan · 2019-08-30T18:12:00Z

Thank you for the response and thanks for pointing out the commit that may fix our issue. I'll upgrade to version 2.8.0 and exercise the code.

Yes, the code is maintaining the object expiration, not that I implemented it - I just inherited it. It seems the code is not using the expiration functionality of Caffeine, based upon the builder configuration. The following is from our Cache Monitor that initializes Caffeine.

public LocalCacheMonitorImpl(Weigher<K, V> weigher, int maximumWeight, ExecutorService threadPool) {
        this.subscribers = new ConcurrentHashMap<>();
        this.lruCache = Caffeine.<CacheKey<K>, V>newBuilder()
            .weigher((Weigher<CacheKey<K>, V>) (key, value) -> weigher.weigh(key.getKey(), value))
            .maximumWeight(maximumWeight)
            .writer(new NotifySubscribersOnRemoval<>(subscribers))
            .executor(threadPool)
            .build();
    }

The implementation seems odd to me as it does not appear to be utilizing Caffeine for the cache but rather for notification of subscribers. The Cache Monitor is invoked for all operations but doesn't appear to return any of the values retrieved from Caffeine. There is a consumer of the Cache Monitor which is actually performing the caching and retrieval of objects. (Very odd)

Thanks again for the assistance on this issue.

ben-manes · 2019-08-30T18:26:56Z

That is a bit odd. If you do not use the built-in expiration then that stale timestamp wouldn't matter, as the cache never expires an entry.

However you did say earlier that the cache has just 2 entries. You might be running into #218 (fixed in 2.6.2), where I incorrectly set the initial capacity to 0. If you read that issue, you'll see that ConcurrentHashMap serializes all of the writes through a single lock until it has resized to a larger table. This could be causing all of your writes to block and be slow. The new default of 16 may still not be perfect for you, so you could set initialCapacity to ensure more locks are available in the underlying map.

I would recommend using the cache's expiration logic (fixed or variable), rather than manually removing the stale entry. By default the removal occurs lazily when other activity on the cache triggers maintenance. If you have promptness requirements, like to notify even if no cache activity is occuring, then you can use Caffeine.scheduler to provide a background scheduling thread (e.g. see Scheduler.systemScheduler().

A slight nit is that you can remove that casting to let the cache builder properly infer types, e.g.

this.lruCache = Caffeine.newBuilder()
            .weigher((Weigher<CacheKey<K> key, V value) -> weigher.weigh(key.getKey(), value))

It's also good to keep somewhat up to date on dependency versions. If you use Gradle then you might find my gradle-versions-plugin helpful. Maven has this built-in, others have similar I'm sure. Just a good practice.

fericit-bostan · 2019-08-30T20:56:46Z

I think the reason for not using the cache's expiration logic is because each cache item can have a different expiration value. However, I was just reviewing the docs on Caffeine and I see that it does indeed have a varying expiration policy. I think this would work for us but would require some changes on our end to define a common interface for all objects placed into the cache. (more work to do)

The cache contains more entires than 2, many more.... it is the ThreadPool size that was set to 2.

Thanks for the information on the scheduler. That indeed does look interesting and my allow us to evict items in a non-lazy fashion. I'll investigate that further.

We are currently using maven so the dependencies should be pulled in properly. It appears that version 2.5.2 was the latest version when it was included in the project but it has never been updated since it was added. (Surprise!) I'll need to add it to our list of libraries that need to be maintained on a regular basis.

Thank you again for all of your assistance.

ben-manes · 2019-08-30T21:34:05Z

In addition to evaluating via Expiry, there are some ad hoc methods via cache.policy(). That gives us an escape hatch / dumping ground for methods like put(key, value, duration) that only make sense in an expirable cache. We add more specialized methods based on requests, so while not preferred it is there to try and not block users from getting their job done.

For Maven, the task versions:display-dependency-updates is incredibly helpful. It's useful to run periodically or, if no one normally pays attention, setup a CI job to email them once a quarter.

I'm going to close this for now, but feel free to re-open or create new issues. I'm hoping that upgrading, getting more familiarity with the code, and refactorings might resolve this naturally. If I can be of help or whatever, please do ping me.

ben-manes closed this as completed Aug 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple Threads BLOCKED at BoundedLocalCache.removeWithWriter #340

Multiple Threads BLOCKED at BoundedLocalCache.removeWithWriter #340

fericit-bostan commented Aug 30, 2019

ben-manes commented Aug 30, 2019

fericit-bostan commented Aug 30, 2019

ben-manes commented Aug 30, 2019

fericit-bostan commented Aug 30, 2019

ben-manes commented Aug 30, 2019

Multiple Threads BLOCKED at BoundedLocalCache.removeWithWriter #340

Multiple Threads BLOCKED at BoundedLocalCache.removeWithWriter #340

Comments

fericit-bostan commented Aug 30, 2019

ben-manes commented Aug 30, 2019

fericit-bostan commented Aug 30, 2019

ben-manes commented Aug 30, 2019

fericit-bostan commented Aug 30, 2019

ben-manes commented Aug 30, 2019