Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should the CircuitBreakingException to RestStatus mapping be more fine-grained? #31986

Closed
danielmitterdorfer opened this issue Jul 12, 2018 · 15 comments
Assignees
Labels
:Core/Infra/REST API REST infrastructure and utilities >enhancement

Comments

@danielmitterdorfer
Copy link
Member

Currently CircuitBreakingException is mapped to HTTP status code 503 (service unavailable). Speaking to @dakrone, the original intention of this was that fixing the cause of a circuit breaker tripping requires human intervention on the server-side by an administrator / developer (e.g. when field data are too large).

However, we have several more circuit breakers, e.g. the in-flight request circuit breaker, where the condition is only temporary because it depends on the current load on the system. In that case we could either map it to a different status code or still return HTTP 503 but provide a Retry-After header in the response indicating that this condition is only temporary (see RFC2616).

I am bringing this up for discussion because due to #31767 we expect that Elasticsearch will exercise back-pressure in more situations instead of dying with OutOfMemoryError. Therefore clients should get a hint from the server how to handle this situation.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@dakrone
Copy link
Member

dakrone commented Jul 12, 2018

So here are the breakers we have:

  • total
  • fielddata
  • request
  • accounting
  • inflight requests

So out of those, the ones that spring to mind as "retryable" to me are request, and inflight_requests, since those are likely to be decremented by ES automatically rather than requiring admin intervention to lower them.

For the others, fielddata is not expired by default, so it'd require someone to clear the cache (or change mappings, etc) to fix. accounting is lucene memory, so it won't change unless the shards on the node are merged or relocated in some way. That's why I figured they are good candidates for staying in 503 response code territory.

For total, it's a little tricky, I could go either way, because it depends on what is actually causing it to trip, we could maybe default to it being retryable.

What do you think?

@danielmitterdorfer
Copy link
Member Author

Your reasoning makes sense to me. Looking at the HTTP status codes I think we should probably always use 503 and only indicate retryable conditions with the Retry-After header instead of using a different status code.

For total, it's a little tricky, I could go either way, because it depends on what is actually causing it to trip, we could maybe default to it being retryable.

It is tricky indeed. For a moment I thought we could make this dependent on the child circuit breaker that causes the parent to break but I think we should not do this because this could be just pure coincidence and thus be misleading. Your suggestion of defaulting to retryable seems reasonable to me especially considering that the new default is to base this on real memory usage.

@danielmitterdorfer
Copy link
Member Author

We discussed this in Fix-it Friday. The main points are:

  • We need to be careful from which circuit-breakers we want to allow retries because if clients start to retry requests that are not retryable this puts more load on the cluster than necessary.
  • The 429 status code is probably preferable because (some of) the language clients handle this already (for example the Java client if you use the bulk processor)
  • We also need to coordinate with the clients team so they are aware of this change.

danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue Jul 16, 2018
With this commit we disable the real-memory circuit breaker in REST
tests as this breaker is based on real memory usage over which we have
no (full) control in tests and the REST client is not yet ready to retry
on circuit breaker exceptions.

This is only meant as a temporary measure to avoid spurious test
failures while we ensure that the REST client can handle those
situations appropriately.

Closes elastic#32050
Relates elastic#31767
Relates elastic#31986
danielmitterdorfer added a commit that referenced this issue Jul 16, 2018
With this commit we disable the real-memory circuit breaker in REST
tests as this breaker is based on real memory usage over which we have
no (full) control in tests and the REST client is not yet ready to retry
on circuit breaker exceptions.

This is only meant as a temporary measure to avoid spurious test
failures while we ensure that the REST client can handle those
situations appropriately.

Closes #32050
Relates #31767
Relates #31986 
Relates #32074
@javanna
Copy link
Member

javanna commented Jul 16, 2018

We need to be careful from which circuit-breakers we want to allow retries because if clients start to retry requests that are not retryable this puts more load on the cluster than necessary.

I would say that this is what currently happens, as 503s are retried by our clients.

The 429 status code is probably preferable because (some of) the language clients handle this already (for example the Java client if you use the bulk processor)

That part of the discussion was around search, hence BulkProcessor does not come into the picture as it is used only when indexing. We said it sounds better not to retry on all nodes a request that will fail on any node. The hard part is to determine which circuit breaker exceptions shall be retried and which ones should not. The clients generally retry 5xx error codes, and whenever a node returns 5xx it is blacklisted for a while and retried a while later (applying some back-off).

@Mpdreamz
Copy link
Member

Mpdreamz commented Jul 16, 2018

The python/.net clients also have bulk helpers that will retry on operations returning a 429 with exponential backoffs.

The goal of the failover in the regular API mappings is to fail fast though. None of the clients provide a backoff mechanism in the 1-1 API call mappings. None of the clients, therefore retry 429 responses OOTB. Since the assumption is that the cluster will not be immediately free to handle the request again. (see also #21141)

Returning a 429 seems like the most appropriate response for requests that are likely to succeed in the very near future.

For circuit breaker conditions unlikely to change (in the near future) perhaps returning a 500 is more appropriate since the clients & logstash retry on 503.

Retry-After on 503 and 429 would be massively useful and might warrant its own ticket. E.g logstash and client helpers can use it to seed the initial retry value. Client API calls can use it to decide NOT to retry.

@danielmitterdorfer
Copy link
Member Author

danielmitterdorfer commented Jul 24, 2018

I put some thought into how we can classify CircuitBreakingException as either temporary or persistent. At this point I am not talking about how this is mapped to HTTP status codes or headers as this is a separate problem.

Definitions

There are two classes regarding the persistence of a tripped circuit breaker:

  • Ct: the condition is temporary.
  • Cp: the condition is permanent, i.e. fixing it requires human intervention.

CircuitBreakingExceptions of class Ct should be retried for a "reasonable" number of time with a backoff, CircuitBreakingExceptions of class Cp should lead to an immediate error.

Child circuit breakers

We can attribute the class Ct to the request and inflight requests circuit breakers and the class Cp to fielddata and accounting.

Parent circuit breaker

For the parent circuit breaker the assignment to one or another class is not so clear. The following is tracked by the (real memory) parent circuit breaker:

  1. Memory usage attributed to class Ct (by the child circuit breakers request and inflight requests)
  2. Memory usage attributed to class Cp (by the child circuit breakers fielddata and accounting)
  3. Memory usage not tracked by any of the child circuit breakers. We cannot reason about that type of memory usage so we cannot know whether it is of class Ct or class Cp.

This itemisation is just conceptual; in reality the real memory circuit breaker only measures the total of all three.

The crux of the matter is that we know nothing about untracked memory but our goal is still to categorize it. We also want to avoid false positives (actual class is Cp but we categorize as Ct) and false negatives (actual class is Ct but we categorize as Cp).

There are several strategies how we can attack this:

  1. A tripped parent circuit breaker is always of class Ct

This will lead to false positives and thus too many retries by clients.

  1. The parent's class depends on the current child circuit breaker

The parent breaker is always called in the context of one of the child circuit breakers. Therefore we could provide the information whether it is of class Ct or Cp to the parent circuit breaker and use that. However, this penalizes the current circuit breaker and does not take into consideration how much memory is taken up at this point.

  1. The parent's class depends on current usage of child circuit breakers

If the parent circuit breaker trips, we check the relative reserved memory of all child circuit breakers:

  • Let R be the amount of memory (including overhead) that is about to be reserved (in bytes)
  • Let Uparent be the total memory usage determined by the parent circuit breaker (in bytes)
  • Let Ut be the sum of the memory tracked by all breakers of class Ct (in bytes)
  • Let Up be the sum of the memory tracked by all breakers of class Cp (in bytes)

The CircuitBreakingExceptions thrown by the parent circuit breaker is of class Ct if and only if:

R <= Uparent * (Ut / Up)

The rationale behind this formula is: If the currently tracked temporary memory usage (Ut) makes up for more than we are about to reserve than that reservation should succeed in the future once Ut is lower. In the other case (memory usage is dominated by Up) future attempts to reserve memory will also not succeed and thus the condition is of class Cp.

Consequently we assume for the real memory parent circuit breaker that the untracked memory has the same ratio as tracked memory (of circuit breakage classes Ct and Cp). Note that this formula applies as well for the special case of the classic circuit breaker, just that we don't consider real memory usage in that case.

@danielmitterdorfer
Copy link
Member Author

In the previous discussion HTTP status codes 429 and 503 have been mentioned as potentially appropriate response status codes on CircuitBreakingException . I propose that we always use 503 for CircuitBreakingException . If and only if the condition is temporary (according to the definitions in my comment above), we should also set the Retry-After header.

Definitions

RFC 7231 defines HTTP status code 503 (Service Unavailable) as:

The 503 (Service Unavailable) status code indicates that the server
is currently unable to handle the request due to a temporary overload
or scheduled maintenance, which will likely be alleviated after some
delay.

RFC 6585 defines HTTP status code 429 (Too Many Requests):

The 429 status code indicates that the user has sent too many
requests in a given amount of time.

Analysis

Let's do a thought experiment to decide whether 429 or 503 is more appropriate.

Consider Elasticsearch as a black box. Any client request may succeed (status code 2xx) or fail (status code >= 400). Suppose a client gets status code 429 after sending their first request: GET /. This is clearly puzzling behavior as this status code indicates that the client has sent too many of them. From the client perspective nothing was wrong with that request. The behavior is non-deterministic and there is no explicit contract between client and server that would provide knowledge to the client how many requests it is allowed to send in a given time period.

Now let's consider Elasticsearch as a white box and assume that multiple clients pushed the server close to overload. When the request from our previous example had arrived, it tripped the in flight requests circuit breaker because Elasticsearch was busy processing multiple large bulk requests at that time. Detecting this state clearly requires "global" knowledge that only the server but no individual client can have. In other words: This state is emerging behavior on the server-side and not caused by any individual client doing something "wrong" (e.g. sending too many or too large requests). This brings us into status code 5xx land. Out of the available status codes, 503 seems most appropriate: "The 503 status code indicates that the server is currently unable to handle the request due to a temporary overload". This is exactly what we want to convey to the client: Your request is probably fine but we just cannot handle it at the moment. Please come back later.

Therefore, I argue we should stick to HTTP 503. For permanent conditions we will omit Retry-After, for temporary ones we will set it.

@dakrone
Copy link
Member

dakrone commented Jul 30, 2018

I thought about this a bit over the weekend, I couldn't find a situation where the formula didn't work, so I think that it sounds like a reasonable way forward.

I'm definitely in agreement with the RFC findings for keeping a 503 response with an optional Retry-After header, I think consistent response codes will help also (I imagine a number of people will never look at the Retry-After header and would be confused if we sometimes changed the status code)

@Mpdreamz
Copy link
Member

Mpdreamz commented Jul 31, 2018

I think your analysis and thought expirement are spot on @danielmitterdorfer but not the most pragmatic out in the wild.

I would like to propose we use 429 as a general overload HTTP status code. There is precedence for this in other API's e.g:

https://docs.newrelic.com/docs/apis/rest-api-v2/requirements/api-overload-protection-handling-429-errors

This also ties into our existing retry behaviour across the clients and the ingest products.

Retry behaviour

☑️ = fast failover/retry no exponential backoff
✅ = failover/retry with exponential backoff

Client helpers are methods coordinating multiple request e.g bulk helpers, scroll helpers.

Status Clients Ingest Client Helpers
429
500
501 ☑️ ✅☑️ ☑️
502 ☑️ ✅☑️ ☑️
503 ☑️ ✅☑️ ☑️
504 ☑️ ✅☑️ ☑️

Note:

  • ✅☑️ => if the ingest product uses an official client the call will failover and retry fast before returning and then the product may retry exponentially outside of the client
  • 503 are currently retried even for the Cp case. This needs fixing.
  • 503 are also quite often and liberally returned by proxies in the wild and quite often these proxies do header removal. Making taking decisions on the nature of the 503 hard to do.

Issueing a 429 will tie in with our existing handling already.

Status Behaviour
429 Exponential backoff retries in ingest products and client helpers
500 return immediately
501 failover and retry
502 failover and retry
503 failover and retry
504 failover and retry

Therefor if CircuitBreakingException returns a 429 with Retry-After for the Ct case and 500 for the Cp case it doesn't complicate our current handling.

@javanna
Copy link
Member

javanna commented Jul 31, 2018

I personally don't mind which status code/header we use, but 429s are currently handled by bulk helpers, which are part of high-level REST clients. These helpers are API specific, while I think in this issue we are trying to come up with a generic mechanism that could be applied to low-level REST clients (like the retry with back-off that we already have). Or are we discussing some mechanism to be applied to specific API only (e.g. search)?

When we generally talk about retries in the low-level clients we mean that the the node which returned a 503 error will be blacklisted and the request will be retried on another node, and so on. What is suggested in this issue for temporary failures is very different: retry that same request only on that same node, after a certain amount of time (possibly returned by the server). I am not convinced though that retrying on the same node only is a good strategy. Also should that node be blacklisted or should it keep on receiving other requests in the meantime?

I would consider reducing the scope of this issue by addressing first the temporary vs permanent failure and doing the right thing out-of-the-box in the low-level clients, which is already not a trivial task. I would leave returning a proper retry interval to be applied in the clients for later, if we still want to do that. We should fix the current behaviour as our low-level clients currently end up retrying the same request on all nodes and marking nodes dead for both temporary and permanent failures, which is not a good behaviour either way.

@codebrain
Copy link
Contributor

+1 for Retry-After, and +1 for using 503.

That said, I think there is a justification for using 429 on bulk operations. A bulk operation feels different than sending a flurry of individual requests - which may be originating from multiple clients that are unaware of each other.

Would it also be possible to return a circuit breaker categorisation as well, Retry-Category or something similar? I'm not usually a fan of custom headers, but if we have additional information it could be useful to include this.

As I understand, not all operations are equal and some may be safe to send to the node (e.g. ingest) whilst others might not (e.g. search). If we have the categorisation we could potentially be smarter in the clients about how we route requests and keep the nodes 'alive' in the client for longer. Of course, we can ignore this header, but it would give us a choice in the future.

How is the back-pressure / circuit breaker information currently surfaced to the clients? Maybe this information is useful in a Retry-Category header.

Just my 2 cents.

@Mpdreamz
Copy link
Member

I would consider reducing the scope of this issue by addressing first the temporary vs permanent failure and doing the right thing out-of-the-box in the clients, which is already not a trivial task.

This is exactly what I set out to achieve with my proposal.

I personally don't mind which status code/header we use, but an important point is that 429s are currently handled by bulk helpers, which are part of high-level REST clients. These helpers are API specific, while I think in this issue we are trying to come up with a generic mechanism that could be applied to low-level REST clients (like the retry with back-off that we already have). Or are we discussing some mechanism to be applied to specific API only (e.g. search)?.

The rules I am proposing would apply for all API's in the lifetime of a single request whether from a low or high-level client.

  • low level client, clients sending bytes/streams/dynamic objects, maps, hashes.
  • high level client, client sending and receiving typed responses.
  • helpers, various useful extensions that help coordinate over requests and do the right thing.

(like the retry with back-off that we already have)

In the scope of a single request, the clients do not have a backoff period. We retry a request by failing over to the next node immediately. When a node is marked dead it is not considered as a target for requests that follow for duration X. When that duration finishes the node is resurrected but if it still fails we exponentially increase X. This is not a retry with back-off.

I would leave returning a proper retry interval to be applied in the clients for later.

Clients should never do exponential retries. Helpers we ship with the clients or our ingest products should/could/will which is why I listed these as a seperate enitity

What we want to achieve here with temporary failures is rather different: retry that same request only on that same node, after a certain amount of time (possibly returned by the server).

This is a point that needs discussing. From the clients perspective I am off mind that the client should not do this OOTB. Again our helpers or ingest products should. @elastic/es-clients please weigh in.

Since 503 is already a fast failover condition in the clients and classifying a 503 without a retry header as permanent will be hard using 429 will make the clients return fast while our helpers and ingest products can decide the appropiate course of action (likely retry). A 500 is also not retried and will return fast from the client and again we can yield the decision what to do to the helpers and ingest products (likely not retry).

@danielmitterdorfer
Copy link
Member Author

After discussions with the clients team we settled on using HTTP 429 for those exceptions in general. The durability (permanent vs. transient) will be returned in a dedicated new field.

@danielmitterdorfer danielmitterdorfer self-assigned this Oct 15, 2018
danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue Oct 15, 2018
With this commit we differentiate between permanent circuit breaking
exceptions (which require intervention from an operator and should not
be automatically retried) and transient ones (which may heal themselves
eventually and should be retried). Furthermore, the parent circuit
breaker will categorize a circuit breaking exception as either transient
or permanent based on the categorization of memory usage of its child
circuit breakers.

Closes elastic#31986
danielmitterdorfer added a commit that referenced this issue Nov 2, 2018
With this commit we differentiate between permanent circuit breaking
exceptions (which require intervention from an operator and should not
be automatically retried) and transient ones (which may heal themselves
eventually and should be retried). Furthermore, the parent circuit
breaker will categorize a circuit breaking exception as either transient
or permanent based on the categorization of memory usage of its child
circuit breakers.

Closes #31986
Relates #34460
@markharwood
Copy link
Contributor

We spent some time on this issue differentiating between failures which should be retried and those that should not. This was all from the perspective of clients outside of elasticsearch but would it not also make sense to consider internal retry policies across replicas?
It looks like we retry failed searches exhaustively across all replicas and I can't see any distinction drawn between failures that might be safe to retry on another node and those that might be dangerous (e.g. memory hogs).
The advantage of automated replica-retries is they can deliver success if only a single node is experiencing memory pressures but the disadvantage is the impact of an abusive query is amplified on a cluster. Have we thought about differentiating the replica-retry policy for different types of failure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/REST API REST infrastructure and utilities >enhancement
Projects
None yet
Development

No branches or pull requests

7 participants