Elasticsearch throws 503 SERVICE_UNAVAILABLE instead of 429 TOO_MANY_REQUESTS #38586

jainankitk · 2019-02-07T21:14:31Z

Elasticsearch version (bin/elasticsearch --version): 6.2.3

Plugins installed: []

JVM version (java -version): 10.0.2

OS version (uname -a if on a Unix-like system): 4.14.62-65.117.amzn1.x86_64

Description of the problem including expected versus actual behavior:
Elasticsearch returns 503 SERVICE_UNAVAILABLE instead of 429 TOO_MANY_REQUESTS if EsRejectedExecutionException happens during the fetch phase.
This is not server side issue and 429 is the right status code to let client know about throttling

Steps to reproduce:

Since, error is hard to reproduce without sending lot of requests, below approach can be used to mock the issue.

Add following code snippet to line 99 of EsThreadPoolExecutor.java

for (StackTraceElement ste : Thread.currentThread().getStackTrace()) {
      if (ste.toString().contains("org.elasticsearch.action.search.FetchSearchPhase")) {
          throw new EsRejectedExecutionException("Thread pool exception during fetch phase");
      }
}

This ensures that execution for fetch phase is rejected irrespective of thread_pool and queue size

Compile the jar using gradle and start elasticsearch cluster with new jar
Create test-index, add some data and run query requesting documents from stored fields
503 Unavailable error comes

* Connected to localhost (127.0.0.1) port 9200 (#0)
> POST /test-index/_search?request_cache=false&pretty HTTP/1.1
> Host: localhost:9200
> User-Agent: curl/7.53.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 343
> 
} [343 bytes data]
* upload completely sent off: 343 out of 343 bytes
< HTTP/1.1 503 Service Unavailable
< content-type: application/json; charset=UTF-8
< content-length: 340
< 
{ [340 bytes data]
* Connection #0 to host localhost left intact
{
  "error" : {
    "root_cause" : [ ],
    "type" : "search_phase_execution_exception",
    "reason" : "",
    "phase" : "fetch",
    "grouped" : true,
    "failed_shards" : [ ],
    "caused_by" : {
      "type" : "es_rejected_execution_exception",
      "reason" : "Thread pool exception during fetch phase"
    }
  },
  "status" : 503
}

Provide logs (if relevant):
Below is the stack trace

[2019-02-07T17:52:43,917][WARN ][r.suppressed             ] path: /test-index/_search, params: {pretty=, index=test-index, request_cache=false}
org.elasticsearch.action.search.SearchPhaseExecutionException:
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:274) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:92) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.onRejection(AbstractRunnable.java:63) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onRejection(ThreadContext.java:662) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.onRejection(TimedRunnable.java:50) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:109) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor.doExecute(QueueResizingEsThreadPoolExecutor.java:88) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:93) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.execute(AbstractSearchAsyncAction.java:289) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.FetchSearchPhase.run(FetchSearchPhase.java:81) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:146) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:140) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:243) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase.successfulShardExecution(InitialSearchPhase.java:251) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase.onShardResult(InitialSearchPhase.java:239) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:49) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase$2.lambda$innerOnResponse$0(InitialSearchPhase.java:212) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase.maybeFork(InitialSearchPhase.java:171) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase.access$000(InitialSearchPhase.java:49) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase$2.innerOnResponse(InitialSearchPhase.java:212) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:45) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:29) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:68) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:36) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:46) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:518) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1091) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1160) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1150) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1139) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchTransportService$6$1.onResponse(SearchTransportService.java:376) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchTransportService$6$1.onResponse(SearchTransportService.java:372) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:316) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:312) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.search.SearchService$3.doRun(SearchService.java:1002) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.3.jar:6.2.3]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: Thread pool exception during fetch phase
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:101) ~[elasticsearch-6.2.3.jar:6.2.3]
        ... 37 more

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-07T22:19:07Z

Pinging @elastic/es-core-infra

jimczi · 2019-03-06T10:08:55Z

Regardless of the fix proposed in #39200 I wonder if we should not force the execution of the fetch phase in the coordinating node like we do for the search phase:

elasticsearch/server/src/main/java/org/elasticsearch/action/search/InitialSearchPhase.java

Line 176 in 9e70696

public boolean isForceExecution() {

I don't think we should fail the execution just because we fork a thread to reduce the results coming from different shards, rejection should happen at the shard level and not at the coord level where the execution of the query already started. @jpountz, @s1monw what do you think ?

jainankitk · 2019-03-06T20:52:34Z

On a similar note, I find query validation happening on non-coordinator nodes kind of confusing and inefficient.

jpountz · 2019-03-06T21:11:54Z

@jimczi I agree that it feels wrong to reject fetch phases when we already likely did most of the work with the query phase. I'm wondering that maybe we should only allow rejections on the first phase that does something significant (ie. not can_match)? I need to think more about it but never rejecting on the coordinating node feels potentially dangerous to me?

atris · 2019-03-07T05:34:24Z

+1 on rejecting early than in FetchPhase. It is way too expensive to discard all of the QueryPhase's processing.

On that note, would it help if we gave FetchPhase a higher priority than SearchPhase during execution, since both are scheduled on the same threadpool? This situation would typically occur in a search heavy workload, so the coordinator node has to keep all the shard responses in memory until the FetchPhase completes. If FetchPhase were to complete faster, then the probability of throttling triggering during a FetchPhase's lifetime decreases.

Thoughts?

jpountz · 2019-03-07T08:19:49Z

Giving fetch tasks a higher priority to fetch tasks sounds desirable to me too. I had opened #29366 with data nodes in mind but it probably makes sense for coordinating nodes as well.

atris · 2019-03-07T08:37:28Z

@jpountz I can help take a crack at both, if it helps. It sems we would have the same priority assignment mechanism across both the coordinator and data nodes.

Per our discussion on the other issue, the task right now is to generate significant search workload so that this issue becomes relevant. I can imagine some ways of artifically simulating the scenario, but I doubt if that is the right way to test.

Any thoughts on generating the workload? Maybe a more intensive nyc_taxis track would help?

When ESRejectedExecutionException gets thrown on the coordinating node while trying to fetch hits, the resulting exception will hold no shard failures, hence `503` is used as the response status code. In that case, `429` should be returned instead. Also, the status code should be taken from the cause if available whenever there are no shard failures instead of blindly returning `503` like we currently do. Closes #38586

atris · 2019-03-08T16:22:42Z

@jpountz @javanna Should I open a separate issue for the discussion above?

jpountz · 2019-03-08T16:29:13Z

Please do.

When ESRejectedExecutionException gets thrown on the coordinating node while trying to fetch hits, the resulting exception will hold no shard failures, hence `503` is used as the response status code. In that case, `429` should be returned instead. Also, the status code should be taken from the cause if available whenever there are no shard failures instead of blindly returning `503` like we currently do. Closes #38586

gwbrown added >bug :Core/Infra/REST API REST infrastructure and utilities labels Feb 7, 2019

This was referenced Feb 19, 2019

Fixing 503 Service Unavailable errors during fetch phase #39086

Merged

Fixing 503 Service Unavailable errors during fetch phase #39200

Closed

javanna closed this as completed in #39086 Mar 8, 2019

atris mentioned this issue Mar 11, 2019

Prioritize FetchPhase over QueryPhase Under High Search Workload #39891

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch throws 503 SERVICE_UNAVAILABLE instead of 429 TOO_MANY_REQUESTS #38586

Elasticsearch throws 503 SERVICE_UNAVAILABLE instead of 429 TOO_MANY_REQUESTS #38586

jainankitk commented Feb 7, 2019 •

edited

Loading

elasticmachine commented Feb 7, 2019

jimczi commented Mar 6, 2019

jainankitk commented Mar 6, 2019 •

edited

Loading

jpountz commented Mar 6, 2019

atris commented Mar 7, 2019

jpountz commented Mar 7, 2019

atris commented Mar 7, 2019

atris commented Mar 8, 2019

jpountz commented Mar 8, 2019

Elasticsearch throws 503 SERVICE_UNAVAILABLE instead of 429 TOO_MANY_REQUESTS #38586

Elasticsearch throws 503 SERVICE_UNAVAILABLE instead of 429 TOO_MANY_REQUESTS #38586

Comments

jainankitk commented Feb 7, 2019 • edited Loading

elasticmachine commented Feb 7, 2019

jimczi commented Mar 6, 2019

jainankitk commented Mar 6, 2019 • edited Loading

jpountz commented Mar 6, 2019

atris commented Mar 7, 2019

jpountz commented Mar 7, 2019

atris commented Mar 7, 2019

atris commented Mar 8, 2019

jpountz commented Mar 8, 2019

jainankitk commented Feb 7, 2019 •

edited

Loading

jainankitk commented Mar 6, 2019 •

edited

Loading