Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread lock in LocalCache.lockedGetOrLoad #3602

Closed
paladox opened this issue Sep 11, 2019 · 25 comments
Closed

Thread lock in LocalCache.lockedGetOrLoad #3602

paladox opened this issue Sep 11, 2019 · 25 comments
Labels
P4 no SLO package=cache status=triaged type=other Miscellaneous activities not covered by other type= labels

Comments

@paladox
Copy link

paladox commented Sep 11, 2019

Hi, Wikimedia use gerrit 2.15.14. Since earlier this year we have experenced alot of thread locks on LocalCache.lockedGetOrLoad which to resolve we have to restart gerrit.

We use version 24.1.1-jre.

https://github.com/GerritCodeReview/gerrit/blob/stable-2.15/lib/guava.bzl#L1

We are not the only one to experence this issue (they tried replacing guava with caffeine but had to revert due to other problems) (https://bugs.chromium.org/p/gerrit/issues/detail?id=7645#c17).

Our task downstream https://phabricator.wikimedia.org/T224448.

Thread dump https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDkvMTEvLS1nZXJyaXQtanN0YWNrLnR4dC0tMTEtNTEtNDI=

I'm not sure if a newer version fixes this.

Also i'm not sure what other information you would like us to collect?

@hashar
Copy link

hashar commented Sep 11, 2019

We notably have the issue with threads handling http requests and the ones emitting emails to wait for a ReentrantLock of some sort.

I could recheck from all the thread dumps we have, but the last methods called from Guava is always com.google.common.cache.LocalCache$Segment.lockedGetOrLoad.

"HTTP-12344" #12344 prio=5 os_prio=0 tid=0x00007fdd0c269000 nid=0x7a8f waiting on condition [0x00007fdcd8f88000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x000000056eb8d848> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2089)
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2046)
    at com.google.common.cache.LocalCache.get(LocalCache.java:3943)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3967)
    at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4952)
    at com.google.gerrit.server.account.AccountCacheImpl.get(AccountCacheImpl.java:85)
...

We obviously haven't managed to reproduce the issue. It occurs roughly once per week, but there is no firm frequencies, that happened 3 times today for now good reason we can tell.

I do not know anything about java, but could it be that a lock is not released? For example if a thread is abruptly aborted?

@cpovirk
Copy link
Member

cpovirk commented Sep 11, 2019

they tried replacing guava with caffeine but had to revert due to other problems

For reference: GerritCodeReview/gerrit@b4cb044

@ben-manes
Copy link
Contributor

I believe the right answer, regardless of the caching library, is to not rely on the cache's internal locking. Gerrit's caches have circular dependencies which can become problematic, e.g. ConcurrentHashMap does not support this in its computations (may livelock or throw an exception).

It was unclear whether this problem is due to actual application deadlock due to circular dependencies, or was due to too coarse grained locking / bugs in the cache implementation. If they have an application induced deadlock (A req B, C req A, B req C) this has to be fixed within their logic. The cache cannot do anything but try to detect it if possible.

If the resource are loaded in a safe order, then there are two options that they might consider.

Future-based cache

This is similar to how Guava's cache works, which has hashmap locking and entry locking. In Caffeine this can be done using AsyncCache / AsyncLoadingCache, which stores a CompletableFuture<V> as the map entry. The intent would be to have the map only lock for adding the entry and, outside of this, lock the entry for loading. A timeout can be applied to a future to cancel it,, which might be used as a failsafe to break a deadlock.

Use Striped for locking

The locking can be managed entirely by the application and the cache treated as a simple, bounded Map. The lock is determined by the key, e.g. an array of locks to hash into or a Map<K, Lock>. Optionally a tryLock can be used for a timeout to cancel and break a deadlock. This could be written as,

final Striped<Lock> locks = Striped.lazyWeakLock(1024);

// Avoid locking on a cache hit
V value = cache.getIfPresent(key);
if (value != null) {
  return value;
}

Lock lock = locks.get(key);
lock.lock();
try {
  // Raced and computed by other caller
  value = cache.getIfPresent(key);
  if (value != null) {
    return value;
  }

  // Load / compute the value
  value = ...
  cache.put(key, value);
  return value;
} finally {
  lock.unlock();
}

@cgdecker cgdecker added P4 no SLO package=cache status=triaged type=other Miscellaneous activities not covered by other type= labels labels Sep 12, 2019
@cgdecker
Copy link
Member

FWIW, we don't have anyone who's an expert on cache and are unlikely to be able to make any changes that would address this, though it's not necessarily impossible if it were possible to demonstrate very clearly what the issue is and that it is in fact a bug in cache and not a usage issue.

We've been generally pointing people toward caffeine since it's better maintained and Ben does know this stuff.

@paladox
Copy link
Author

paladox commented Oct 3, 2019

@ben-manes hi! Is there anything we can do to https://github.com/GerritCodeReview/gerrit/blob/stable-2.15/gerrit-server/src/main/java/com/google/gerrit/server/account/AccountCacheImpl.java#L85 in your opinion to try and make sure it dosen't deadlock? Or if it tries to, a timeout happens?

@hashar
Copy link

hashar commented Oct 3, 2019

@ben-manes thank you for the detailed explanation, unfortunately it is really above my league.

From a today thread dump, I noticed threads were blocked on that thread:

"SendEmail-2" #243 prio=5 os_prio=0 tid=0x00007fd830004800 nid=0x2740 waiting on condition [0x00007fd8e4321000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000002c0218158> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
    - <0x00000002ca2807c8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

And there is no occurrence of the lock it is waiting for (<0x00000002c0218158> , a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject ).

I was hoping to eventually found a good old deadlock between threads as you have mentioned: A req B, C req A, B req C. But the JVM never notices any deadlock and I have no idea how to figure out what could have set that object.

Given you recommend and

Given @cgdecker comment that cache is less maintained nowadays and everyone pointed at Caffeine, we will find a way to get work done toward using Caffeine in Gerrit 2.16 (2.15 is EOL soon). Ie revive the commit @cpovirk pointed. I have been chatting with @paladox and that seems to be the best solution :-]

I would like to thank you all for the very quick triage, useful leads and information. That is very much appreciated.

@paladox
Copy link
Author

paladox commented Oct 3, 2019

I wonder if changing it to:

  @Override
  public AccountState get(Account.Id accountId) {
    V value = cache.getIfPresent(accountId);
    if (value != null) {
      return value;
    }

   return missing(accountId);
  }

will workaround the issue?

I'm not sure if it still needs the try {} catch part too?

@ben-manes ^

@ben-manes
Copy link
Contributor

ben-manes commented Oct 3, 2019

@paladox That change would mean the cache never loads the value. Currently it calls get(key) which performs a cache load on a miss. In Caffeine we don't support recursive computations into the same hash table, as this is disallowed by ConcurrentHashMap.

You could either take control of the locking yourself, e.g. using Striped, to support your recursive logic. That could be most optimal performance-wise, but adds complexity.

@hashar, @paladox
If you remove atomic computations and allow racy computes, e.g.

  @Override
  public AccountState get(Account.Id accountId) {
    V value = cache.getIfPresent(accountId);
    if (value != null) {
      return value;
    }
    value = // loadById(accountId)
    if (value != null) {
      cache.put(accountId, value);
      return value;
    }
    return missing(accountId);
  }

This would mean multiple threads could perform the expensive load at once, but avoids deadlocking. That's likely the simplest solution and good enough?

@paladox
Copy link
Author

paladox commented Oct 3, 2019

Thank you sooooo much @ben-manes!!

Question about the second part, i doin't see 'loadById' in the code to use?

@ben-manes
Copy link
Contributor

ben-manes commented Oct 3, 2019

Somewhere the LoadingCache is being setup with a loading function. This is injected:

@Named(BYID_NAME) LoadingCache<Account.Id, Optional<AccountState>> byId

Instead you would need to call that loading function, e.g. loadById. Reading the code, I guess that is the ByIdLoader class. Instead of having a LoadingCache call it, use a normal Cache and invoke the ByIdLoader#load(id) directly.

@paladox
Copy link
Author

paladox commented Oct 3, 2019

@ben-manes when doing:

  @Override
  public AccountState get(Account.Id accountId) {
    AccountState value = byId.getIfPresent(accountId);
    if (value != null) {
      return value;
    }
    value = byId.load(accountId);
    if (value != null) {
      byId.put(accountId, value);
      return value;
    }
    return missing(accountId);
  }

i get:

gerrit-server/src/main/java/com/google/gerrit/server/account/AccountCacheImpl.java:88: error: cannot find symbol
    value = byId.load(accountId);
                ^
  symbol:   method load(Id)
  location: variable byId of type LoadingCache<Id,Optional<AccountState>>
gerrit-server/src/main/java/com/google/gerrit/server/account/AccountCacheImpl.java:90: error: incompatible types: AccountState cannot be converted to Optional<AccountState>
      byId.put(accountId, value);

Using loadById also failed.

Also using ByIdLoader directly failed too.

@ben-manes
Copy link
Contributor

You probably want someone who is familiar with the Gerrit code to make the changes.

In short, you want to add ByIdLoader as an injected field to the class's constructor. Then call the field's load method.

@paladox
Copy link
Author

paladox commented Oct 3, 2019

@ben-manes i managed to figure out how to get it to work:

https://gerrit-review.googlesource.com/c/gerrit/+/239436

@ben-manes
Copy link
Contributor

perfect. Then I think this can be closed from the Guava side as won't fix, per @cgdecker's comment. Hope that solves your problems.

@paladox
Copy link
Author

paladox commented Oct 3, 2019

Yup! Thanks!

@paladox paladox closed this as completed Oct 3, 2019
@paladox
Copy link
Author

paladox commented Oct 4, 2019

@ben-manes hi, i'm wondering if you know how i can convert:

  @Override
  public Optional<AccountState> get(Account.Id accountId) {
    try {
      return Optional.ofNullable(byId.get(accountId));
    } catch (ExecutionException e) {
      if (!(e.getCause() instanceof AccountNotFoundException)) {
        logger.atWarning().withCause(e).log("Cannot load AccountState for %s", accountId);
      }
      return Optional.empty();
    }
  }

To use your workaround? (this is for gerrit master which this file differs to 2.15)

I tried:

  @Override
  public Optional<AccountState> get(Account.Id accountId) {
    try {
      AccountState value = byId.getIfPresent(accountId);
      if (value != null) {
        return Optional.of(value);
      }
      value = byIdLoader.load(accountId);
      if (value != null) {
        byId.put(accountId, value);
        return Optional.of(value);
      }

      return Optional.empty();
    } catch (ExecutionException e) {
      if (!(e.getCause() instanceof AccountNotFoundException)) {
        logger.atWarning().withCause(e).log("Cannot load AccountState for %s", accountId);
      }
      return Optional.empty();
    }
  }

But the tests fail with:

  1. getCache(com.google.gerrit.acceptance.rest.config.GetCacheIT)
    expected not to be: null
    at com.google.gerrit.acceptance.rest.config.GetCacheIT.getCache(GetCacheIT.java:36)
  2. listCaches(com.google.gerrit.acceptance.rest.config.ListCachesIT)
    expected not to be: null
    at com.google.gerrit.acceptance.rest.config.ListCachesIT.listCaches(ListCachesIT.java:45)

I'm guessing i'm missing something obvious, or maybe i did it wrong slightly?

@ben-manes
Copy link
Contributor

assertThat(result.averageGet).isNotNull();

This is because you are no longer loading through the cache, so CacheStats does not produce an averageLoadPenalty() and similar values. Presumably if absent this becomes null in the REST API. I would remove those assertions as no longer valid or capture the statistics manually.

@paladox
Copy link
Author

paladox commented Oct 4, 2019

Thank you!

https://gerrit-review.googlesource.com/c/gerrit/+/239494 <-- updated that :)

@XuPengfei-1020
Copy link

May I ask, is there any conclusion about this issue?

@hashar
Copy link

hashar commented Apr 28, 2021

@flying1020 at Wikimedia we encountered the issue in Gerrit (a java based code review system written by Google).

The issue disappeared for us after we have we have migrated to a new hardware AND upgraded the underlying Debian operating system (from Jessie to Buster). We probably raised the JVM heap at the same time.

Maybe it is has been caused by a Linux kernel issue, the openjdk or a system library. But surely we haven't seen it again!

@XuPengfei-1020
Copy link

@hashar Thank you for you reply. I encountered this issue today, the stacks of thread as same as you comment above, thousands of thread is blocked at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad, and the thread that hold the lock has done it's job and returnd threadpool. I hive no idea why the lock is not released correctly.
It's not jvm or os bug, I think, its guava's bug.

@davido
Copy link

davido commented Apr 28, 2021

The issue disappeared for us after we have we have migrated to a new hardware AND upgraded the underlying Debian operating system (from Jessie to Buster).

Note, though, that guava cache backend was replaced in gerrit with caffeine: [1].

There is one exception, though, recursive cache loader: PatchListLoader wasn't replaced and is tracked here: [2].

[1] https://gerrit-review.googlesource.com/c/gerrit/+/244612
[2] https://crbug.com/gerrit/11900

@ben-manes
Copy link
Contributor

One user found that recursive operations that lead to a StackOverflowError left the lock in a bad state. JEP 270 should have resolved this, though. Regardless neither cache supports recursive computations.

@MapleMaster
Copy link

yes, we have StackOverflowError and lock not release, just same as hashar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P4 no SLO package=cache status=triaged type=other Miscellaneous activities not covered by other type= labels
Projects
None yet
Development

No branches or pull requests

10 participants