-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Factor GC overhead in circuit breakers #40115
Comments
Pinging @elastic/es-core-infra |
Summarizing the proposal
@not-napoleon Please share your thoughts on the same |
Hi @not-napoleon please let me know what you think on the proposal. I'll be more that happy to work on the PR |
@jaymode Would you mind commenting on this? I labeled it when it came in, but it's not really in my focus area and I don't have a good sense of what the right direction is. |
Hi @Bukhtawar, thanks for opening the issue and providing an idea. We'll plan to discuss this as a team and summarize our thoughts after we've discussed. @danielmitterdorfer @dakrone you may be interested in this based on the work you've done on circuit breakers in the past. |
Hi @jaymode @danielmitterdorfer @dakrone, did we get a chance to review this. I'm awaiting a response on the proposal. |
We discussed it in FixitFriday and made the following points:
As a consequence we decided not to move forward on this and close the issue. Thank you for bringing this up though @Bukhtawar! |
Recovery requests are likely to trip the cirucit breaker under heavy load which might lead to unfortunate side-effects in nodes under pressure. A node temporarily under high load will be less likely to fail recovery leading to permanent changes in allocation. Moreover, the requests sent by the recovery mechanism are all bound in byte size by design. If they are sent in rapid succession then the memory used by strong references to them will be bounded but the memory used by unreferenced objects that resulted from them could heavily fluctuate. This makes the real-memory circuit breaker's memory use estimation that does not account for GC particularly inefficient when it comes to recoveries (elastic#40115). Turn off the circuit breaker for recoveries. Users have other means of limiting the memory use of recoveries by setting the recovery chunk size and parallelism. Given the bounded amount of memory used by recoveries a user can either lower the amount of resources allocated to recoveries in the settings or adjust the real-memory circuit breaker limits slightly to account for this change. I think its a fair assumption that the number of clusters that would see nodes running out of memory as a result of this change is small. Also it would be a subset of those clusters that currently see recovery failures as a result of the circuit breaker and those should be fixed regardless. Unfortunately, I don't think we can use the same solution for the case of replication requests tripping the circuit breaker as those are not naturally bounded in size. Relates elastic#44484
Problem Description
There are a bunch of circuit breakers that track for estimated bytes consumed for major memory contributors but there is still room for unaccounted memory #35564, #20837, #20250 etc. The newly introduced real circuit breaker #31767 isn't foolproof as large time gaps between reservation and actual allocation can cause few bulky concurrent request to cause OOM.
Proposal
Since large GC overhead is an early sign of struggling GC and non-collectible heap which may keep on growing. The proposal is to factor in overhead in addition to actual heap consumption based on user configurable setting
gc_overhead_threshold
to detect symptoms early. If the node is already running a high GC overhead based on weighed average of past few GC runs we start to trip requests on the node.The text was updated successfully, but these errors were encountered: