Add a hard time limit for the entire search request #30897

fantapsody · 2018-05-28T04:09:35Z

I wonder if there is any method to limit the service time of the entire search request, respond to the client with a timeout error and cancel all remaining search requests on shards automatically. As I have done a few researches on official documents, issues and discussions, even the source code, and found the search timeout parameter only limits the execution of the search on each shard, not entire service time of the search request(time in the queue, time parsing the query, time rendering the response, etc.).

This feature is extremely important for online services using es as the data engine: imagine an online service serving tens of thousands requests per second, and each request will make a search request to the es cluster. The service request comes with a timeout, such as 5 seconds, and if the time used to fulfill the request is longer than the timeout, another request will be made to do the retry. In this case, even if only 1% of search request executes longer than the timeout, such as 10 seconds, it may result in a big trouble to the whole service cluster: if we limit the number of outstanding search requests on each application server, the search request pool may be saturated quickly by retries, as the long service requests never succeed, and outstanding search requests take long to release, even if the result of the request is useless. If we don't limit the number of outstanding search requests on each application server, the heavy search requests may accumulate in the es cluster, which may have impact on normal search requests, or even bring down es nodes. In fact, this is what happened to our cluster. I think this can be improved by having a hard search time limit mechanism that releases all resources of the search contexts after the timeout.

So, do you have any plan for this, a hard time limit for the search request? Thanks!

cbuescher · 2018-05-28T08:20:55Z

@fantapsody thanks for raising this point, it makes a lot of sense to me. However, it looks very similar to an ongoing dicussion about default search timeouts we are having already in issues like #26308. Can you take a look at that discussion and see if this is similar to your needs? In this case I would suggest closing this issue as a duplicate since the issue is already tracked.

elasticmachine · 2018-05-28T08:21:21Z

Pinging @elastic/es-search-aggs

fantapsody · 2018-05-28T09:54:04Z

@cbuescher thanks for your reply, I have read the origin and related issues (#26238, #26258, #25776) you mentioned above, and read the patch in #25776 briefly, it seems to me that the timeout discussed in these topics was the execution time on each shard, while the time limit I suggested is the total service time of the entire search request.

For example, if the search queue of the node has hundreds of pending requests, it may take tens of seconds before the execution of the search on that node, and even the search execution took only 1ms on the shard, it may still take tens of seconds before the client receive the response. As a result, the search timeout seems to be meaningless to users, since it give no promise on when the response returns.

cbuescher · 2018-05-28T10:03:25Z

@fantapsody thanks for checking, will leave this open and discuss it further with the team then.

jpountz · 2018-07-02T16:15:03Z

@simonw suggested that taking queue time into account should be easy by computing the current time when receiving the shard-level request

fantapsody · 2019-04-06T13:33:07Z

Hi, @jpountz @cbuescher , after had been bothered by the problem for quite a while, we decided to improve the query timeout mechanism by ourselves. After deployed our solution to our production environment, it worked quite well. So I would like to share our works and experiences with you, and discuss whether it could be a general and long term solution for the problem.

The key change in our solution is the timer for a search starts when the request first reach the coordinator node. So it covers much more phases other than shard queries, possibly including pre-filter, dfs, shard queries(includes queue time), fetch, reduce, etc. It also includes the time used in the message transmission between nodes. We believe it provides a more meaningful semantic of search timeout to users.

The mechanism is designed on a best effort basis, which means the timeout is guaranteed to be triggered after the given timeout value, but could be delayed significantly by various factors such as GC, network, etc.

The key ideas of the implementation are:

Extend the CancellableTask as the AutoCancellableTask, which can schedule a runnable to cancel the task after the timeout.
Add checks for task cancellation in key points in the control flow of search. And the existing checks of cancellation during shard queries are fully reused, especially the low level cancellation.
The timeout propagates from the parent task to shard tasks. For example, if the total timeout of the search is 3s, and it takes 0.5s before a shard request is sent, then the timeout for the shard request is 2.5s.
Only relative time is used to prevent uncertainty from clock skews.

Compared with the current query timeout implementation from the community, the requests will not accumulate in queues in our solution, the cluster has a better chance to recover from a bunch of anomaly slow requests, and the users will experience a more clear search timeout behaviour.

Of course, the timer registered by the search task introduced extra costs. However, as the number of search requests per second is only several thousands at max, and the time complexity of adding and removing a task to a scheduler is O(log(n)), the impact on the performance is negligible.

Nevertheless, for users who have a higher level demand on the timeout accuracy, such as services that has a timeout for each RPC request, they may have to release the query context on the client side before a success or timeout search response returns. In this case, the server side timeout mechanism is basically used as a way to release server side search contexts asynchronously.

fantapsody · 2019-04-19T03:35:22Z

Hi, @jpountz @cbuescher @jimczi ,
I hope you have read what I posted above, and I would like to know your opinions on it. As I see there are a few issues about search timeout & cancelation that have been opened for a long time, I think it would be great to see an official feature that has a good support of it, because task timeout & cancelation is really an important function for massive online queries from my experiences and observations in production environment.

cbuescher · 2019-05-24T10:39:57Z

Thanks @fantapsody for the detailed description of your solution. We agree search timeout & cancelation is an important topic with some ongoing discussions.
Adding some thoughts from #26308 before closing it in favour of continuing the discussion here:

There was some concern about allowing returning partial results on timeouts, so we are leaning towards adding global or index level settings for a hard timeout. It could be activated by default with a high value. Some benchmarks in Consider enabling low-level search cancellation by default #26258 and Search - enable low_level_cancellation by default. #42291 have led us to enable low-level search cancellation checks by default.

javanna · 2020-05-12T08:43:37Z

From #54056 :

Currently the timeout is only applied at the shard level. Shards are expected to wrap up ongoing work as soon as they hit the timeout and return a partial response. This is usually good enough, unless shard requests spend time in the search queue. For instance if a timeout of 1s is configured but the shard search request waits for 10s in the search queue, then Elasticsearch wouldn't return a response in less than 10s despite the timeout.

This hasn't been much of an issue until now, but I believe that the introduction of slower options like searchable snapshots and schema on read is going to make this issue worse, as it'll make it more likely to have shard requests waiting in the queue.

lizozom · 2020-09-02T14:09:09Z

We at the @elastic/kibana-app-arch team are working on adding a dedicated search timeout to Kibana.

However, due to performance improvements in Elasticsearch, we are going to set that timeout to quite a high value in 7.10 (TBD) and support disabling it altogether (which is going to be the default starting 7.11).

javanna · 2023-02-10T23:13:03Z

Several of our APIs, including the search API, support now automatic cancellation on connection close. That means that the client can set a timeout to its request, and when that expires close the connection. Elasticsearch will react by cancelling all the ongoing shard requests caused by the request that was cancelled. This achieves the original goal of this issue which was killing a search request if it takes longer than a certain timeout.

cbuescher added the :Search/Search Search-related issues that do not fall into other categories label May 28, 2018

cbuescher added the feedback_needed label May 28, 2018

cbuescher added team-discuss and removed feedback_needed labels May 28, 2018

jpountz added help wanted adoptme and removed team-discuss labels Jul 2, 2018

jimczi mentioned this issue Apr 9, 2019

Should we have a default timeout? #26308

Closed

rjernst added the Team:Search Meta label for search team label May 4, 2020

This was referenced May 7, 2020

SearchRequest cancellation based on timeout #56258

Closed

Apply timeout on the coordinating node too #54056

Closed

DaveCTurner mentioned this issue Jul 27, 2020

Add timeout for Search Network Action to Improve Cluster Resistance #60037

Closed

javanna added >enhancement and removed help wanted adoptme labels Jul 8, 2022

javanna closed this as completed Feb 10, 2023

snikoyo mentioned this issue Jun 26, 2023

[DOC] Explain the differences between search timeouts opensearch-project/opensearch-java#530

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a hard time limit for the entire search request #30897

Add a hard time limit for the entire search request #30897

fantapsody commented May 28, 2018

cbuescher commented May 28, 2018

elasticmachine commented May 28, 2018

fantapsody commented May 28, 2018

cbuescher commented May 28, 2018

jpountz commented Jul 2, 2018

fantapsody commented Apr 6, 2019

fantapsody commented Apr 19, 2019

cbuescher commented May 24, 2019

javanna commented May 12, 2020

lizozom commented Sep 2, 2020

javanna commented Feb 10, 2023

Add a hard time limit for the entire search request #30897

Add a hard time limit for the entire search request #30897

Comments

fantapsody commented May 28, 2018

cbuescher commented May 28, 2018

elasticmachine commented May 28, 2018

fantapsody commented May 28, 2018

cbuescher commented May 28, 2018

jpountz commented Jul 2, 2018

fantapsody commented Apr 6, 2019

fantapsody commented Apr 19, 2019

cbuescher commented May 24, 2019

javanna commented May 12, 2020

lizozom commented Sep 2, 2020

javanna commented Feb 10, 2023