-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a hard time limit for the entire search request #30897
Comments
@fantapsody thanks for raising this point, it makes a lot of sense to me. However, it looks very similar to an ongoing dicussion about default search timeouts we are having already in issues like #26308. Can you take a look at that discussion and see if this is similar to your needs? In this case I would suggest closing this issue as a duplicate since the issue is already tracked. |
Pinging @elastic/es-search-aggs |
@cbuescher thanks for your reply, I have read the origin and related issues (#26238, #26258, #25776) you mentioned above, and read the patch in #25776 briefly, it seems to me that the timeout discussed in these topics was the execution time on each shard, while the time limit I suggested is the total service time of the entire search request. For example, if the search queue of the node has hundreds of pending requests, it may take tens of seconds before the execution of the search on that node, and even the search execution took only 1ms on the shard, it may still take tens of seconds before the client receive the response. As a result, the search timeout seems to be meaningless to users, since it give no promise on when the response returns. |
@fantapsody thanks for checking, will leave this open and discuss it further with the team then. |
@simonw suggested that taking queue time into account should be easy by computing the current time when receiving the shard-level request |
Hi, @jpountz @cbuescher , after had been bothered by the problem for quite a while, we decided to improve the query timeout mechanism by ourselves. After deployed our solution to our production environment, it worked quite well. So I would like to share our works and experiences with you, and discuss whether it could be a general and long term solution for the problem. The key change in our solution is the timer for a search starts when the request first reach the coordinator node. So it covers much more phases other than shard queries, possibly including pre-filter, dfs, shard queries(includes queue time), fetch, reduce, etc. It also includes the time used in the message transmission between nodes. We believe it provides a more meaningful semantic of search timeout to users. The mechanism is designed on a best effort basis, which means the timeout is guaranteed to be triggered after the given timeout value, but could be delayed significantly by various factors such as GC, network, etc. The key ideas of the implementation are:
Compared with the current query timeout implementation from the community, the requests will not accumulate in queues in our solution, the cluster has a better chance to recover from a bunch of anomaly slow requests, and the users will experience a more clear search timeout behaviour. Of course, the timer registered by the search task introduced extra costs. However, as the number of search requests per second is only several thousands at max, and the time complexity of adding and removing a task to a scheduler is O(log(n)), the impact on the performance is negligible. Nevertheless, for users who have a higher level demand on the timeout accuracy, such as services that has a timeout for each RPC request, they may have to release the query context on the client side before a success or timeout search response returns. In this case, the server side timeout mechanism is basically used as a way to release server side search contexts asynchronously. |
Hi, @jpountz @cbuescher @jimczi , |
Thanks @fantapsody for the detailed description of your solution. We agree search timeout & cancelation is an important topic with some ongoing discussions.
|
From #54056 : Currently the timeout is only applied at the shard level. Shards are expected to wrap up ongoing work as soon as they hit the timeout and return a partial response. This is usually good enough, unless shard requests spend time in the search queue. For instance if a timeout of 1s is configured but the shard search request waits for 10s in the search queue, then Elasticsearch wouldn't return a response in less than 10s despite the timeout. This hasn't been much of an issue until now, but I believe that the introduction of slower options like searchable snapshots and schema on read is going to make this issue worse, as it'll make it more likely to have shard requests waiting in the queue. |
We at the @elastic/kibana-app-arch team are working on adding a dedicated search timeout to Kibana. However, due to performance improvements in Elasticsearch, we are going to set that timeout to quite a high value in 7.10 (TBD) and support disabling it altogether (which is going to be the default starting 7.11). |
Several of our APIs, including the search API, support now automatic cancellation on connection close. That means that the client can set a timeout to its request, and when that expires close the connection. Elasticsearch will react by cancelling all the ongoing shard requests caused by the request that was cancelled. This achieves the original goal of this issue which was killing a search request if it takes longer than a certain timeout. |
I wonder if there is any method to limit the service time of the entire search request, respond to the client with a timeout error and cancel all remaining search requests on shards automatically. As I have done a few researches on official documents, issues and discussions, even the source code, and found the search timeout parameter only limits the execution of the search on each shard, not entire service time of the search request(time in the queue, time parsing the query, time rendering the response, etc.).
This feature is extremely important for online services using es as the data engine: imagine an online service serving tens of thousands requests per second, and each request will make a search request to the es cluster. The service request comes with a timeout, such as 5 seconds, and if the time used to fulfill the request is longer than the timeout, another request will be made to do the retry. In this case, even if only 1% of search request executes longer than the timeout, such as 10 seconds, it may result in a big trouble to the whole service cluster: if we limit the number of outstanding search requests on each application server, the search request pool may be saturated quickly by retries, as the long service requests never succeed, and outstanding search requests take long to release, even if the result of the request is useless. If we don't limit the number of outstanding search requests on each application server, the heavy search requests may accumulate in the es cluster, which may have impact on normal search requests, or even bring down es nodes. In fact, this is what happened to our cluster. I think this can be improved by having a hard search time limit mechanism that releases all resources of the search contexts after the timeout.
So, do you have any plan for this, a hard time limit for the search request? Thanks!
The text was updated successfully, but these errors were encountered: