org.elasticsearch.common.cache.Cache lock contention severely #69646

maosuhan · 2021-02-26T02:49:49Z

In our product environment, when we do a pressure test to launch 10k+ QPS queries to ES cluster, every datanode reach almost 100% cpu usage and search queue becomes full and begin to reject request.
If we check jstack of datanode process, we find that more than half of search thread is waiting by cache lock.

Test Env: ES 7.6.2
Cluster size: 12 client and 24 datanode

According to jstack, there is totally 73 search threads and 45 are waiting cache lock like below:

"elasticsearch[10.149.56.78(1)][search][T#73]" #494 daemon prio=5 os_prio=0 cpu=1160039.32ms elapsed=30864.17s tid=0x00007f8704134000 nid=0x1636d4 waiting on condition  [0x00007f0e33291000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00000010f8c64e10> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
        at java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:267)
        at org.elasticsearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:55)
        at org.elasticsearch.common.cache.Cache.promote(Cache.java:731)
        at org.elasticsearch.common.cache.Cache.get(Cache.java:369)
        at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:391)
        at org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:123)
        at org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1406)
        at org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1340)
        at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:348)
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:370)
        at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:355)
        at org.elasticsearch.search.SearchService$$Lambda$3776/0x00000008017ee040.apply(Unknown Source)
        at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:146)
        at org.elasticsearch.action.ActionListener$$Lambda$3226/0x000000080164d840.accept(Unknown Source)
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)
        at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58)
        at org.elasticsearch.action.ActionRunnable$$Lambda$3787/0x00000008017f0c40.accept(Unknown Source)
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run([email protected]/Thread.java:830)

More than half of the search threads are waiting there, so the search pool becomes full easily and request rejection occurs.

In org.elasticsearch.common.cache.Cache, the lock is needed to modify the LRU list and hashmap. It seems it is already a performance bottleneck now. Shall we adopt lock free programming in the part?

My proposal is leveraging the design of Caffeine to implement high performance cache.
https://github.com/ben-manes/caffeine

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-03-01T08:57:13Z

Pinging @elastic/es-perf (Team:Performance)

ben-manes · 2021-03-02T05:51:12Z

Please see the benchmarks were provided in #16802. At that time, this cache was slower than a synchronized LinkedHashMap (2.5M vs 7.5M ops/s). The ES one is still provided in Caffeine's JMH benchmarks and trace simulator if you want to verify those findings.

maosuhan · 2021-03-02T06:15:46Z

@ben-manes Thanks for your information.
In #16802, caffeine works much better than current cache. And it seems there's no progress after benchmark.

Do you think it is a benefit to use caffeine instead now?

ben-manes · 2021-03-02T06:21:00Z

I am biased and believe so, as you experienced a production issue. The prior argument against was reasonable, where the author did not believe that the ES cache is used in any critical sections to require high performance. If that was the case then it may be over-engineered compared to a simpler LHM, but otherwise isn't worth the engineering effort to replace. You're experience shows otherwise, making some type of improvement here justified.

elasticmachine · 2021-03-02T14:48:10Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

pgomulka · 2021-03-03T16:49:05Z

@maosuhan it is an interesting find.
Would you be able to give us detailed steps on how to reproduce? Ideally using https://github.com/elastic/rally
We specifically need to know what type of queries you use.

Also note that coffeine library is extensively using Unsafe which we forbid in ES and won't use libraries that use that class

ben-manes · 2021-03-03T17:31:10Z

@pgomulka v3.0 does not require Unsafe (but does require JDK11 for VarHandles)

maosuhan · 2021-03-04T03:40:50Z

@pgomulka We have a benchmark toolkit that will replay online lightweight aggregation queries in ES. The query is much like filter + bucket agg
When the workload of ES becomes very high, like 90% CPU usage, and the QPS reach 16k then the issue occurs.

pgomulka · 2021-03-10T09:46:53Z

@maosuhan so is this reproducible? can you share the specific details of that benchmark?

pgomulka · 2021-04-27T12:00:05Z

closing due to lack of feedback

maosuhan added >enhancement needs:triage Requires assignment of a team area label labels Feb 26, 2021

maosuhan changed the title ~~org.elasticsearch.common.cache lock contention severely~~ org.elasticsearch.common.cache.Cache lock contention severely Feb 26, 2021

pgomulka added the :Performance All issues related to Elasticsearch performance including regressions and investigations label Mar 1, 2021

elasticmachine added the Team:Performance Meta label for performance team label Mar 1, 2021

pgomulka removed the needs:triage Requires assignment of a team area label label Mar 1, 2021

pgomulka added the :Core/Infra/Core Core issues without another label label Mar 2, 2021

elasticmachine added the Team:Core/Infra Meta label for core/infra team label Mar 2, 2021

pgomulka removed the :Performance All issues related to Elasticsearch performance including regressions and investigations label Mar 2, 2021

elasticmachine removed the Team:Performance Meta label for performance team label Mar 2, 2021

rjernst added the team-discuss label Mar 2, 2021

pgomulka removed the team-discuss label Mar 3, 2021

pgomulka added the feedback_needed label Mar 17, 2021

pgomulka closed this as completed Apr 27, 2021

ben-manes mentioned this issue Nov 1, 2021

Tiered caching, off-heap cache support opensearch-project/OpenSearch#1211

Open

ben-manes mentioned this issue Jan 16, 2022

Explore TinyLFU cache #16802

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

org.elasticsearch.common.cache.Cache lock contention severely #69646

org.elasticsearch.common.cache.Cache lock contention severely #69646

maosuhan commented Feb 26, 2021 •

edited

Loading

elasticmachine commented Mar 1, 2021

ben-manes commented Mar 2, 2021

maosuhan commented Mar 2, 2021

ben-manes commented Mar 2, 2021

elasticmachine commented Mar 2, 2021

pgomulka commented Mar 3, 2021

ben-manes commented Mar 3, 2021

maosuhan commented Mar 4, 2021

pgomulka commented Mar 10, 2021

pgomulka commented Apr 27, 2021

org.elasticsearch.common.cache.Cache lock contention severely #69646

org.elasticsearch.common.cache.Cache lock contention severely #69646

Comments

maosuhan commented Feb 26, 2021 • edited Loading

elasticmachine commented Mar 1, 2021

ben-manes commented Mar 2, 2021

maosuhan commented Mar 2, 2021

ben-manes commented Mar 2, 2021

elasticmachine commented Mar 2, 2021

pgomulka commented Mar 3, 2021

ben-manes commented Mar 3, 2021

maosuhan commented Mar 4, 2021

pgomulka commented Mar 10, 2021

pgomulka commented Apr 27, 2021

maosuhan commented Feb 26, 2021 •

edited

Loading