-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discuss: ES query cancellation #1384
Comments
Here are some test scripts I wrote to experiment with this setting: From what I can see, enabling I think it's also important to ensure that the timeout is not purely cosmetic on the ES side. If the CPU is still being used then this is essentially useless, but if the timeout in fact cancels the worker thread then this is promising. |
One final thought is that the |
Nice, I knew the ES team had been talking about this for a long time, it's good to see they actually made some progress. Cancelling requests after a timeout is definitely the best solution here, if it actually works. It looks like the low level cancellations are enabled as of ES 7.3/8.0 if I'm reading the issue labels correctly. So we'll have to wait a while longer to take advantage of it. But perhaps there's no harm in setting timeout values now? |
I think they've been available (albeit off by default) since v5, so we could take advantage of them immediately. |
That's cool, a timeout is definitively a good idea. I'm a bit worried on the impact on phrase queries, in the benchmark, it seems that they are a bit slower from the 50th to 99th percentile and they are used in many queries... In the other hand, if we have a lot of queries that use CPU for no reasons... 🤔 I think, it would be interesting to know the wanted timeout value and the % of requests that exceed the timeout and their time (such as it's 2x the timeout, 10x the timeout...) |
For years now we've been having issues with long-running search queries 'bouncing around the cluster' and causing performance issues for all other queries executed within the same timeframe.
The contemporary thought here is that this is unavoidable due to a lack of a proper cancellation method within elasticsearch itself which would allow cancelling of these renegade queries.
As a result, we have some configuration options such as
"requestTimeout": "120000"
(within pelias/config).The idea here is that we pass the delay back to the user in order to prevent retries before the ES cluster has finished processing the request.
This always felt wrong to me, so I decided to revisit the topic last night and see what the current state of play is...
It turns out that elasticsearch does in fact have an internal query-cancellation method documented here which allows us to specify a timeout either globally (via
search.default_search_timeout
) or per-request (using thetimeout
query property).If the query times out you get a message in the response:
This is great in theory but if you try it out you'll find that queries can run well over the prescribed timeout and there are lots of people online complaining about this behaviour.
My understanding of the reasons for this is that, by default, the setting is only enforced at 'segment boundaries', which is to say that the query continues to run until it reaches a segment boundary, then it checks and only then does it error or continue processing.
For fairly obvious reasons that means that the query can continue to run for a lot longer than the prescribed timeout on large segments. In fact, if you do a force merge with
_forcemerge?max_num_segments=1
after generating a build (as is recommended for performance by the ES team) then you're getting no timeout at all, because there's only one segment!Luckily there is another setting which can be enabled called
search.low_level_cancellation
which "improves cancellation responsiveness". My understanding of this is that (in addition to the segment boundary checks) there are additional checks performed at the completion of more granular operations such as sub-query sections likematch
orterms
(this seems to be confirmed in the benchmarks linked below).The caveat to this is that it comes with a warning about enabling the setting due to the performance penalties which would obviously apply when adding these sorts of checks all over the place.
So that's almost enough to scare me away from the idea, but then I found some old GitHub issues going back several years from the elasticsearch team discussing whether they wanted to turn
search.low_level_cancellation
on by default, and asking how severe the impact of this setting might be once enabled.This issue contains a suite of benchmarks performed by one of the ES team members which appear to show that the impact of enabling this setting is indistinguishable from test noise.
The resolution was that they finally enabled this setting by default earlier this year in this PR.
So... the question for the group is.. can we change the way we handle timeouts so that we can better protect the cluster from long-running queries and in the process tighten up some of these long timeouts which have been in our config for years?
cc/ @pelias/core @pelias/contributors
The text was updated successfully, but these errors were encountered: