-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should the CircuitBreakingException to RestStatus mapping be more fine-grained? #31986
Comments
Pinging @elastic/es-core-infra |
So here are the breakers we have:
So out of those, the ones that spring to mind as "retryable" to me are For the others, For What do you think? |
Your reasoning makes sense to me. Looking at the HTTP status codes I think we should probably always use 503 and only indicate retryable conditions with the
It is tricky indeed. For a moment I thought we could make this dependent on the child circuit breaker that causes the parent to break but I think we should not do this because this could be just pure coincidence and thus be misleading. Your suggestion of defaulting to retryable seems reasonable to me especially considering that the new default is to base this on real memory usage. |
We discussed this in Fix-it Friday. The main points are:
|
With this commit we disable the real-memory circuit breaker in REST tests as this breaker is based on real memory usage over which we have no (full) control in tests and the REST client is not yet ready to retry on circuit breaker exceptions. This is only meant as a temporary measure to avoid spurious test failures while we ensure that the REST client can handle those situations appropriately. Closes elastic#32050 Relates elastic#31767 Relates elastic#31986
With this commit we disable the real-memory circuit breaker in REST tests as this breaker is based on real memory usage over which we have no (full) control in tests and the REST client is not yet ready to retry on circuit breaker exceptions. This is only meant as a temporary measure to avoid spurious test failures while we ensure that the REST client can handle those situations appropriately. Closes #32050 Relates #31767 Relates #31986 Relates #32074
I would say that this is what currently happens, as 503s are retried by our clients.
That part of the discussion was around search, hence |
The python/.net clients also have bulk helpers that will retry on operations returning a The goal of the failover in the regular API mappings is to fail fast though. None of the clients provide a backoff mechanism in the 1-1 API call mappings. None of the clients, therefore retry Returning a For circuit breaker conditions unlikely to change (in the near future) perhaps returning a
|
I put some thought into how we can classify DefinitionsThere are two classes regarding the persistence of a tripped circuit breaker:
Child circuit breakersWe can attribute the class Parent circuit breakerFor the parent circuit breaker the assignment to one or another class is not so clear. The following is tracked by the (real memory) parent circuit breaker:
This itemisation is just conceptual; in reality the real memory circuit breaker only measures the total of all three. The crux of the matter is that we know nothing about untracked memory but our goal is still to categorize it. We also want to avoid false positives (actual class is There are several strategies how we can attack this:
This will lead to false positives and thus too many retries by clients.
The parent breaker is always called in the context of one of the child circuit breakers. Therefore we could provide the information whether it is of class
If the parent circuit breaker trips, we check the relative reserved memory of all child circuit breakers:
The
The rationale behind this formula is: If the currently tracked temporary memory usage ( Consequently we assume for the real memory parent circuit breaker that the untracked memory has the same ratio as tracked memory (of circuit breakage classes |
In the previous discussion HTTP status codes 429 and 503 have been mentioned as potentially appropriate response status codes on DefinitionsRFC 7231 defines HTTP status code 503 (Service Unavailable) as:
RFC 6585 defines HTTP status code 429 (Too Many Requests):
AnalysisLet's do a thought experiment to decide whether 429 or 503 is more appropriate. Consider Elasticsearch as a black box. Any client request may succeed (status code 2xx) or fail (status code >= 400). Suppose a client gets status code 429 after sending their first request: Now let's consider Elasticsearch as a white box and assume that multiple clients pushed the server close to overload. When the request from our previous example had arrived, it tripped the in flight requests circuit breaker because Elasticsearch was busy processing multiple large bulk requests at that time. Detecting this state clearly requires "global" knowledge that only the server but no individual client can have. In other words: This state is emerging behavior on the server-side and not caused by any individual client doing something "wrong" (e.g. sending too many or too large requests). This brings us into status code 5xx land. Out of the available status codes, 503 seems most appropriate: "The 503 status code indicates that the server is currently unable to handle the request due to a temporary overload". This is exactly what we want to convey to the client: Your request is probably fine but we just cannot handle it at the moment. Please come back later. Therefore, I argue we should stick to HTTP 503. For permanent conditions we will omit |
I thought about this a bit over the weekend, I couldn't find a situation where the formula didn't work, so I think that it sounds like a reasonable way forward. I'm definitely in agreement with the RFC findings for keeping a 503 response with an optional |
I think your analysis and thought expirement are spot on @danielmitterdorfer but not the most pragmatic out in the wild. I would like to propose we use This also ties into our existing retry behaviour across the clients and the ingest products. Retry behaviour☑️ = fast failover/retry no exponential backoff Client helpers are methods coordinating multiple request e.g bulk helpers, scroll helpers.
Note:
Issueing a
Therefor if |
I personally don't mind which status code/header we use, but 429s are currently handled by bulk helpers, which are part of high-level REST clients. These helpers are API specific, while I think in this issue we are trying to come up with a generic mechanism that could be applied to low-level REST clients (like the retry with back-off that we already have). Or are we discussing some mechanism to be applied to specific API only (e.g. search)? When we generally talk about retries in the low-level clients we mean that the the node which returned a 503 error will be blacklisted and the request will be retried on another node, and so on. What is suggested in this issue for temporary failures is very different: retry that same request only on that same node, after a certain amount of time (possibly returned by the server). I am not convinced though that retrying on the same node only is a good strategy. Also should that node be blacklisted or should it keep on receiving other requests in the meantime? I would consider reducing the scope of this issue by addressing first the temporary vs permanent failure and doing the right thing out-of-the-box in the low-level clients, which is already not a trivial task. I would leave returning a proper retry interval to be applied in the clients for later, if we still want to do that. We should fix the current behaviour as our low-level clients currently end up retrying the same request on all nodes and marking nodes dead for both temporary and permanent failures, which is not a good behaviour either way. |
+1 for That said, I think there is a justification for using Would it also be possible to return a circuit breaker categorisation as well, As I understand, not all operations are equal and some may be safe to send to the node (e.g. ingest) whilst others might not (e.g. search). If we have the categorisation we could potentially be smarter in the clients about how we route requests and keep the nodes 'alive' in the client for longer. Of course, we can ignore this header, but it would give us a choice in the future. How is the back-pressure / circuit breaker information currently surfaced to the clients? Maybe this information is useful in a Just my 2 cents. |
This is exactly what I set out to achieve with my proposal.
The rules I am proposing would apply for all API's in the lifetime of a single request whether from a low or high-level client.
In the scope of a single request, the clients do not have a backoff period. We retry a request by failing over to the next node immediately. When a node is marked dead it is not considered as a target for requests that follow for duration
Clients should never do exponential retries. Helpers we ship with the clients or our ingest products should/could/will which is why I listed these as a seperate enitity
This is a point that needs discussing. From the clients perspective I am off mind that the client should not do this OOTB. Again our helpers or ingest products should. @elastic/es-clients please weigh in. Since 503 is already a fast failover condition in the clients and classifying a 503 without a retry header as permanent will be hard using |
After discussions with the clients team we settled on using HTTP 429 for those exceptions in general. The durability (permanent vs. transient) will be returned in a dedicated new field. |
With this commit we differentiate between permanent circuit breaking exceptions (which require intervention from an operator and should not be automatically retried) and transient ones (which may heal themselves eventually and should be retried). Furthermore, the parent circuit breaker will categorize a circuit breaking exception as either transient or permanent based on the categorization of memory usage of its child circuit breakers. Closes elastic#31986
With this commit we differentiate between permanent circuit breaking exceptions (which require intervention from an operator and should not be automatically retried) and transient ones (which may heal themselves eventually and should be retried). Furthermore, the parent circuit breaker will categorize a circuit breaking exception as either transient or permanent based on the categorization of memory usage of its child circuit breakers. Closes #31986 Relates #34460
We spent some time on this issue differentiating between failures which should be retried and those that should not. This was all from the perspective of clients outside of elasticsearch but would it not also make sense to consider internal retry policies across replicas? |
Currently
CircuitBreakingException
is mapped to HTTP status code 503 (service unavailable). Speaking to @dakrone, the original intention of this was that fixing the cause of a circuit breaker tripping requires human intervention on the server-side by an administrator / developer (e.g. when field data are too large).However, we have several more circuit breakers, e.g. the in-flight request circuit breaker, where the condition is only temporary because it depends on the current load on the system. In that case we could either map it to a different status code or still return HTTP 503 but provide a
Retry-After
header in the response indicating that this condition is only temporary (see RFC2616).I am bringing this up for discussion because due to #31767 we expect that Elasticsearch will exercise back-pressure in more situations instead of dying with
OutOfMemoryError
. Therefore clients should get a hint from the server how to handle this situation.The text was updated successfully, but these errors were encountered: