-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch throws 503 SERVICE_UNAVAILABLE instead of 429 TOO_MANY_REQUESTS #38586
Comments
Pinging @elastic/es-core-infra |
Regardless of the fix proposed in #39200 I wonder if we should not force the execution of the fetch phase in the coordinating node like we do for the search phase: elasticsearch/server/src/main/java/org/elasticsearch/action/search/InitialSearchPhase.java Line 176 in 9e70696
I don't think we should fail the execution just because we fork a thread to reduce the results coming from different shards, rejection should happen at the shard level and not at the coord level where the execution of the query already started. @jpountz, @s1monw what do you think ? |
On a similar note, I find query validation happening on non-coordinator nodes kind of confusing and inefficient. |
@jimczi I agree that it feels wrong to reject fetch phases when we already likely did most of the work with the query phase. I'm wondering that maybe we should only allow rejections on the first phase that does something significant (ie. not can_match)? I need to think more about it but never rejecting on the coordinating node feels potentially dangerous to me? |
+1 on rejecting early than in FetchPhase. It is way too expensive to discard all of the QueryPhase's processing. On that note, would it help if we gave FetchPhase a higher priority than SearchPhase during execution, since both are scheduled on the same threadpool? This situation would typically occur in a search heavy workload, so the coordinator node has to keep all the shard responses in memory until the FetchPhase completes. If FetchPhase were to complete faster, then the probability of throttling triggering during a FetchPhase's lifetime decreases. Thoughts? |
Giving fetch tasks a higher priority to fetch tasks sounds desirable to me too. I had opened #29366 with data nodes in mind but it probably makes sense for coordinating nodes as well. |
@jpountz I can help take a crack at both, if it helps. It sems we would have the same priority assignment mechanism across both the coordinator and data nodes. Per our discussion on the other issue, the task right now is to generate significant search workload so that this issue becomes relevant. I can imagine some ways of artifically simulating the scenario, but I doubt if that is the right way to test. Any thoughts on generating the workload? Maybe a more intensive nyc_taxis track would help? |
When ESRejectedExecutionException gets thrown on the coordinating node while trying to fetch hits, the resulting exception will hold no shard failures, hence `503` is used as the response status code. In that case, `429` should be returned instead. Also, the status code should be taken from the cause if available whenever there are no shard failures instead of blindly returning `503` like we currently do. Closes #38586
Please do. |
When ESRejectedExecutionException gets thrown on the coordinating node while trying to fetch hits, the resulting exception will hold no shard failures, hence `503` is used as the response status code. In that case, `429` should be returned instead. Also, the status code should be taken from the cause if available whenever there are no shard failures instead of blindly returning `503` like we currently do. Closes #38586
Elasticsearch version (
bin/elasticsearch --version
): 6.2.3Plugins installed: []
JVM version (
java -version
): 10.0.2OS version (
uname -a
if on a Unix-like system): 4.14.62-65.117.amzn1.x86_64Description of the problem including expected versus actual behavior:
Elasticsearch returns 503 SERVICE_UNAVAILABLE instead of 429 TOO_MANY_REQUESTS if EsRejectedExecutionException happens during the fetch phase.
This is not server side issue and 429 is the right status code to let client know about throttling
Steps to reproduce:
Since, error is hard to reproduce without sending lot of requests, below approach can be used to mock the issue.
This ensures that execution for fetch phase is rejected irrespective of thread_pool and queue size
Provide logs (if relevant):
Below is the stack trace
The text was updated successfully, but these errors were encountered: