Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: add search request queue flush API #67129

Closed
etki opened this issue Jan 7, 2021 · 5 comments
Closed

Feature Request: add search request queue flush API #67129

etki opened this issue Jan 7, 2021 · 5 comments
Labels
>enhancement feedback_needed :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@etki
Copy link
Contributor

etki commented Jan 7, 2021

Sometimes we have following situation in production: something breaks, we receive a retry storm, ES receives ton of requests, starts queueing them, then rejecting (and i think that's a common situation). Even if load is completely disabled ES takes some time to process all those queries, which are usually completely irrelevant at that moment; that can increase incident response time (i.e. if cause was fixed faster than queues were fully processed). My suggestion is to add new POST _cluster/??? endpoint, which would tell all nodes to flush everything they have in queues.

@etki etki added >enhancement needs:triage Requires assignment of a team area label labels Jan 7, 2021
@gwbrown gwbrown added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Jan 13, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jan 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Jan 14, 2021

@etki Thank you for submitting a feature request.
We automatically cancel search requests when corresponding http channels get closed. We also have a mechanism to cancel running tasks .
There is also a way to cancel async searches that are supposed to be long running searches.
All other searches are supposed to be fast and should be fast.
Are these options not enough for your use case?

I can see several problem with flushing the search queue on a particular node. It will cause a flurry of errors as a coordination node is waiting for responses from these nodes. Also even if we moved requests for system indices to a dedicated thread pool, there may be some requests from other system applications (kibana, monitoring etc) that go through the search thread pool which we don't want to kill.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Jan 15, 2021

We've discussed this within the team, and would like to know more about your setup.

  • what version of elasticsearch you are using? We are interested because from v 7.4 we automatically cancel search requests when a http channel gets closed. The process of cancelling a queued search task should be very fast.
  • what exactly is breaking in "something breaks"? is it the client losing connection with the server? is it possible that you are keeping a broken connection too long before dropping it, and possibly should dial down corresponding keep alive, keep_count settings?

@etki
Copy link
Contributor Author

etki commented Jan 25, 2021

@mayya-sharipova

Sorry for late reply. We're on 6.7 as for now, so behavior I've seen may be way outdated.

It will cause a flurry of errors as a coordination node is waiting for responses from these nodes.

I see, didn't think that it may cause much work outside of functionality. Following is a question for my curiosity only: do I understand it right that there is no possibility to cancel search from coordinator node? I've heard that actually executing search task is uninterruptible and has to complete.

something breaks

I've meant client application there. IIRC in very case it was a retry storm from client application generating excessive load, while application receiving those requests was passing them down to ES successfully (until queue was filled and rejections have started). I don't remember whether we've preventively restarted application talking to ES then, so HTTP channels could be still opened.

All other searches are supposed to be fast and should be fast.

Yes, but in reality developers usually face worst-case scenarios with heavy queries (e.g. with lots of nested documents) or just abnormal and unexpected loads.

We also have a mechanism to cancel running tasks

Do I get it right that

POST _tasks/_cancel?actions=*search

Is functionally the same that I've proposed? To be clear, I'm not familiar with cancel API and thought it relates only to maintenance tasks before.

@mayya-sharipova
Copy link
Contributor

@etki Please find the answers for your below.

Following is a question for my curiosity only: do I understand it right that there is no possibility to cancel search from coordinator node? I've heard that actually executing search task is uninterruptible and has to complete.

No, that's not correct. _tasks/_cancel API is designed specifically to cancel task. If you cancel a task on the coordinating node, all its children tasks that originated from it on other nodes will also be cancelled.
In 6.8 version you can also set the cluster setting search.low_level_cancellation to have more lower level check for cancellation.

One thing to note though that a request first is put into the queue of the search thread poll. When it is its turn to be dequeued and be processed, only after that we create a task for it or check for cancellation.

Do I get it right that ...POST _tasks/_cancel?actions=*search

Yes, cancel API is designed to cancel tasks including search tasks.
But you should not cancel all search tasks, as there could be system search tasks running. You can use X-Opaque-Id header to cancel only the tasks started by a specific client.

We're on 6.7 as for now, so behavior I've seen may be way outdated.

We have done some changes, including automatic task cancellation from v7.4 when the connection gets close.

As there are ways to cancel search tasks, I am closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement feedback_needed :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

4 participants