Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch throws 503 SERVICE_UNAVAILABLE instead of 429 TOO_MANY_REQUESTS #38586

Closed
jainankitk opened this issue Feb 7, 2019 · 9 comments
Labels
>bug :Core/Infra/REST API REST infrastructure and utilities

Comments

@jainankitk
Copy link
Contributor

jainankitk commented Feb 7, 2019

Elasticsearch version (bin/elasticsearch --version): 6.2.3

Plugins installed: []

JVM version (java -version): 10.0.2

OS version (uname -a if on a Unix-like system): 4.14.62-65.117.amzn1.x86_64

Description of the problem including expected versus actual behavior:
Elasticsearch returns 503 SERVICE_UNAVAILABLE instead of 429 TOO_MANY_REQUESTS if EsRejectedExecutionException happens during the fetch phase.
This is not server side issue and 429 is the right status code to let client know about throttling

Steps to reproduce:

Since, error is hard to reproduce without sending lot of requests, below approach can be used to mock the issue.

  1. Add following code snippet to line 99 of EsThreadPoolExecutor.java
for (StackTraceElement ste : Thread.currentThread().getStackTrace()) {
      if (ste.toString().contains("org.elasticsearch.action.search.FetchSearchPhase")) {
          throw new EsRejectedExecutionException("Thread pool exception during fetch phase");
      }
}

This ensures that execution for fetch phase is rejected irrespective of thread_pool and queue size

  1. Compile the jar using gradle and start elasticsearch cluster with new jar
  2. Create test-index, add some data and run query requesting documents from stored fields
  3. 503 Unavailable error comes
* Connected to localhost (127.0.0.1) port 9200 (#0)
> POST /test-index/_search?request_cache=false&pretty HTTP/1.1
> Host: localhost:9200
> User-Agent: curl/7.53.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 343
> 
} [343 bytes data]
* upload completely sent off: 343 out of 343 bytes
< HTTP/1.1 503 Service Unavailable
< content-type: application/json; charset=UTF-8
< content-length: 340
< 
{ [340 bytes data]
* Connection #0 to host localhost left intact
{
  "error" : {
    "root_cause" : [ ],
    "type" : "search_phase_execution_exception",
    "reason" : "",
    "phase" : "fetch",
    "grouped" : true,
    "failed_shards" : [ ],
    "caused_by" : {
      "type" : "es_rejected_execution_exception",
      "reason" : "Thread pool exception during fetch phase"
    }
  },
  "status" : 503
}

Provide logs (if relevant):
Below is the stack trace

[2019-02-07T17:52:43,917][WARN ][r.suppressed             ] path: /test-index/_search, params: {pretty=, index=test-index, request_cache=false}
org.elasticsearch.action.search.SearchPhaseExecutionException:
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:274) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:92) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.onRejection(AbstractRunnable.java:63) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onRejection(ThreadContext.java:662) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.onRejection(TimedRunnable.java:50) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:109) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor.doExecute(QueueResizingEsThreadPoolExecutor.java:88) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:93) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.execute(AbstractSearchAsyncAction.java:289) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.FetchSearchPhase.run(FetchSearchPhase.java:81) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:146) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:140) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:243) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase.successfulShardExecution(InitialSearchPhase.java:251) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase.onShardResult(InitialSearchPhase.java:239) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:49) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase$2.lambda$innerOnResponse$0(InitialSearchPhase.java:212) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase.maybeFork(InitialSearchPhase.java:171) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase.access$000(InitialSearchPhase.java:49) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.InitialSearchPhase$2.innerOnResponse(InitialSearchPhase.java:212) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:45) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:29) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:68) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:36) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:46) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:518) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1091) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1160) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1150) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1139) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchTransportService$6$1.onResponse(SearchTransportService.java:376) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.action.search.SearchTransportService$6$1.onResponse(SearchTransportService.java:372) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:316) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:312) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.search.SearchService$3.doRun(SearchService.java:1002) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.3.jar:6.2.3]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: Thread pool exception during fetch phase
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:101) ~[elasticsearch-6.2.3.jar:6.2.3]
        ... 37 more
@gwbrown gwbrown added >bug :Core/Infra/REST API REST infrastructure and utilities labels Feb 7, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@jimczi
Copy link
Contributor

jimczi commented Mar 6, 2019

Regardless of the fix proposed in #39200 I wonder if we should not force the execution of the fetch phase in the coordinating node like we do for the search phase:


I don't think we should fail the execution just because we fork a thread to reduce the results coming from different shards, rejection should happen at the shard level and not at the coord level where the execution of the query already started. @jpountz, @s1monw what do you think ?

@jainankitk
Copy link
Contributor Author

jainankitk commented Mar 6, 2019

On a similar note, I find query validation happening on non-coordinator nodes kind of confusing and inefficient.

@jpountz
Copy link
Contributor

jpountz commented Mar 6, 2019

@jimczi I agree that it feels wrong to reject fetch phases when we already likely did most of the work with the query phase. I'm wondering that maybe we should only allow rejections on the first phase that does something significant (ie. not can_match)? I need to think more about it but never rejecting on the coordinating node feels potentially dangerous to me?

@atris
Copy link

atris commented Mar 7, 2019

+1 on rejecting early than in FetchPhase. It is way too expensive to discard all of the QueryPhase's processing.

On that note, would it help if we gave FetchPhase a higher priority than SearchPhase during execution, since both are scheduled on the same threadpool? This situation would typically occur in a search heavy workload, so the coordinator node has to keep all the shard responses in memory until the FetchPhase completes. If FetchPhase were to complete faster, then the probability of throttling triggering during a FetchPhase's lifetime decreases.

Thoughts?

@jpountz
Copy link
Contributor

jpountz commented Mar 7, 2019

Giving fetch tasks a higher priority to fetch tasks sounds desirable to me too. I had opened #29366 with data nodes in mind but it probably makes sense for coordinating nodes as well.

@atris
Copy link

atris commented Mar 7, 2019

@jpountz I can help take a crack at both, if it helps. It sems we would have the same priority assignment mechanism across both the coordinator and data nodes.

Per our discussion on the other issue, the task right now is to generate significant search workload so that this issue becomes relevant. I can imagine some ways of artifically simulating the scenario, but I doubt if that is the right way to test.

Any thoughts on generating the workload? Maybe a more intensive nyc_taxis track would help?

javanna pushed a commit that referenced this issue Mar 8, 2019
When ESRejectedExecutionException gets thrown on the coordinating node while trying to fetch hits, the resulting exception will hold no shard failures, hence `503` is used as the response status code. In that case, `429` should be returned instead. Also, the status code should be taken from the cause if available whenever there are no shard failures instead of blindly returning `503` like we currently do.

Closes #38586
@atris
Copy link

atris commented Mar 8, 2019

@jpountz @javanna Should I open a separate issue for the discussion above?

@jpountz
Copy link
Contributor

jpountz commented Mar 8, 2019

Please do.

javanna pushed a commit that referenced this issue Mar 11, 2019
When ESRejectedExecutionException gets thrown on the coordinating node while trying to fetch hits, the resulting exception will hold no shard failures, hence `503` is used as the response status code. In that case, `429` should be returned instead. Also, the status code should be taken from the cause if available whenever there are no shard failures instead of blindly returning `503` like we currently do.

Closes #38586
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/REST API REST infrastructure and utilities
Projects
None yet
Development

No branches or pull requests

6 participants