Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

org.elasticsearch.common.cache.Cache lock contention severely #69646

Closed
maosuhan opened this issue Feb 26, 2021 · 10 comments
Closed

org.elasticsearch.common.cache.Cache lock contention severely #69646

maosuhan opened this issue Feb 26, 2021 · 10 comments
Labels
:Core/Infra/Core Core issues without another label >enhancement feedback_needed Team:Core/Infra Meta label for core/infra team

Comments

@maosuhan
Copy link

maosuhan commented Feb 26, 2021

In our product environment, when we do a pressure test to launch 10k+ QPS queries to ES cluster, every datanode reach almost 100% cpu usage and search queue becomes full and begin to reject request.
If we check jstack of datanode process, we find that more than half of search thread is waiting by cache lock.

Test Env: ES 7.6.2
Cluster size: 12 client and 24 datanode

According to jstack, there is totally 73 search threads and 45 are waiting cache lock like below:

"elasticsearch[10.149.56.78(1)][search][T#73]" #494 daemon prio=5 os_prio=0 cpu=1160039.32ms elapsed=30864.17s tid=0x00007f8704134000 nid=0x1636d4 waiting on condition  [0x00007f0e33291000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00000010f8c64e10> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
        at java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:267)
        at org.elasticsearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:55)
        at org.elasticsearch.common.cache.Cache.promote(Cache.java:731)
        at org.elasticsearch.common.cache.Cache.get(Cache.java:369)
        at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:391)
        at org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:123)
        at org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1406)
        at org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1340)
        at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:348)
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:370)
        at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:355)
        at org.elasticsearch.search.SearchService$$Lambda$3776/0x00000008017ee040.apply(Unknown Source)
        at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:146)
        at org.elasticsearch.action.ActionListener$$Lambda$3226/0x000000080164d840.accept(Unknown Source)
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)
        at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58)
        at org.elasticsearch.action.ActionRunnable$$Lambda$3787/0x00000008017f0c40.accept(Unknown Source)
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run([email protected]/Thread.java:830)

More than half of the search threads are waiting there, so the search pool becomes full easily and request rejection occurs.

In org.elasticsearch.common.cache.Cache, the lock is needed to modify the LRU list and hashmap. It seems it is already a performance bottleneck now. Shall we adopt lock free programming in the part?

My proposal is leveraging the design of Caffeine to implement high performance cache.
https://github.com/ben-manes/caffeine

@maosuhan maosuhan added >enhancement needs:triage Requires assignment of a team area label labels Feb 26, 2021
@maosuhan maosuhan changed the title org.elasticsearch.common.cache lock contention severely org.elasticsearch.common.cache.Cache lock contention severely Feb 26, 2021
@pgomulka pgomulka added the :Performance All issues related to Elasticsearch performance including regressions and investigations label Mar 1, 2021
@elasticmachine elasticmachine added the Team:Performance Meta label for performance team label Mar 1, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-perf (Team:Performance)

@pgomulka pgomulka removed the needs:triage Requires assignment of a team area label label Mar 1, 2021
@ben-manes
Copy link

Please see the benchmarks were provided in #16802. At that time, this cache was slower than a synchronized LinkedHashMap (2.5M vs 7.5M ops/s). The ES one is still provided in Caffeine's JMH benchmarks and trace simulator if you want to verify those findings.

@maosuhan
Copy link
Author

maosuhan commented Mar 2, 2021

@ben-manes Thanks for your information.
In #16802, caffeine works much better than current cache. And it seems there's no progress after benchmark.

Do you think it is a benefit to use caffeine instead now?

@ben-manes
Copy link

I am biased and believe so, as you experienced a production issue. The prior argument against was reasonable, where the author did not believe that the ES cache is used in any critical sections to require high performance. If that was the case then it may be over-engineered compared to a simpler LHM, but otherwise isn't worth the engineering effort to replace. You're experience shows otherwise, making some type of improvement here justified.

@pgomulka pgomulka added the :Core/Infra/Core Core issues without another label label Mar 2, 2021
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Mar 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@pgomulka pgomulka removed the :Performance All issues related to Elasticsearch performance including regressions and investigations label Mar 2, 2021
@elasticmachine elasticmachine removed the Team:Performance Meta label for performance team label Mar 2, 2021
@pgomulka
Copy link
Contributor

pgomulka commented Mar 3, 2021

@maosuhan it is an interesting find.
Would you be able to give us detailed steps on how to reproduce? Ideally using https://github.com/elastic/rally
We specifically need to know what type of queries you use.

Also note that coffeine library is extensively using Unsafe which we forbid in ES and won't use libraries that use that class

@ben-manes
Copy link

@pgomulka v3.0 does not require Unsafe (but does require JDK11 for VarHandles)

@maosuhan
Copy link
Author

maosuhan commented Mar 4, 2021

@pgomulka We have a benchmark toolkit that will replay online lightweight aggregation queries in ES. The query is much like filter + bucket agg
When the workload of ES becomes very high, like 90% CPU usage, and the QPS reach 16k then the issue occurs.

@pgomulka
Copy link
Contributor

@maosuhan so is this reproducible? can you share the specific details of that benchmark?

@pgomulka
Copy link
Contributor

closing due to lack of feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >enhancement feedback_needed Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

No branches or pull requests

5 participants