-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
org.elasticsearch.common.cache.Cache lock contention severely #69646
Comments
Pinging @elastic/es-perf (Team:Performance) |
Please see the benchmarks were provided in #16802. At that time, this cache was slower than a synchronized |
@ben-manes Thanks for your information. Do you think it is a benefit to use caffeine instead now? |
I am biased and believe so, as you experienced a production issue. The prior argument against was reasonable, where the author did not believe that the ES cache is used in any critical sections to require high performance. If that was the case then it may be over-engineered compared to a simpler LHM, but otherwise isn't worth the engineering effort to replace. You're experience shows otherwise, making some type of improvement here justified. |
Pinging @elastic/es-core-infra (Team:Core/Infra) |
@maosuhan it is an interesting find. Also note that coffeine library is extensively using |
@pgomulka v3.0 does not require Unsafe (but does require JDK11 for VarHandles) |
@pgomulka We have a benchmark toolkit that will replay online lightweight aggregation queries in ES. The query is much like filter + bucket agg |
@maosuhan so is this reproducible? can you share the specific details of that benchmark? |
closing due to lack of feedback |
In our product environment, when we do a pressure test to launch 10k+ QPS queries to ES cluster, every datanode reach almost 100% cpu usage and search queue becomes full and begin to reject request.
If we check jstack of datanode process, we find that more than half of search thread is waiting by cache lock.
Test Env: ES 7.6.2
Cluster size: 12 client and 24 datanode
According to jstack, there is totally 73 search threads and 45 are waiting cache lock like below:
More than half of the search threads are waiting there, so the search pool becomes full easily and request rejection occurs.
In org.elasticsearch.common.cache.Cache, the lock is needed to modify the LRU list and hashmap. It seems it is already a performance bottleneck now. Shall we adopt lock free programming in the part?
My proposal is leveraging the design of Caffeine to implement high performance cache.
https://github.com/ben-manes/caffeine
The text was updated successfully, but these errors were encountered: