-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically cancel aggregation if search task is cancelled #71021
Comments
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Pinging @elastic/es-search (Team:Search) |
@danielwhsu would you mind running hot_threads after the task is cancelled and before reduce phase ends and posting the results here? |
@imotov Sure, here is the hot_threads API result after I cancel the task using the task api and before the final reduce finishes:
|
@imotov I have actually already made a change in my company's own fork of the Elasticsearch repo, and we've tested this change and it is able to let us stop searches during the reduce phase if the task is cancelled. If it would help I could create a pull request with the changes that we have been using, what do you think? |
If it has tests in it, it will most likely speed up things. |
@imotov I've created my PR here (#71714), and I've included a unit test that raises an exception if the task has been cancelled. I had attempted to add an integration test a few weeks ago in my company's fork of Elasticsearch that would test cancelling the SearchTask on a running search aggregation query, but it was a flaky test since it was difficult to control the timing such that the cancel request would happen during the reduce phase. Please let me know if my unit test is sufficient, or where you'd like to see more testing. |
@danielwhsu yeah, an integration test is a bit tricky here, you basically need to block in reduce phase before cancelling and unblock after cancellation is propagated to all shards. So, I suspect the simplest way to achieve this reliably would be with something like ScriptedBlockPlugin called in reduce_script of a scripted metric aggregation. So we would block there, cancel the task, ensure cancellation is propagated and then unblock. |
@imotov Got it, I'll take a look at using the script for the integration test. Before I start writing more tests though, I think it might be useful if we discuss whether the approach in the pull request looks reasonable? |
I think the cancellation check itself is reasonable, but the current way of plumbing the search task to it has a race condition. SearchPhaseController is a singleton, so using it to pass a search task will lead to all sorts of weird issues on busy systems. We need to find some other way of making the current search task available in the reduce phase.
The integration test that I suggested is very generic, so it should help regardless the actual implementation we will end up using. I think it is a good place to start working on this issue. |
Hi @imotov , I've picked up this task again and I have a few questions about the integration tests for it. I'm trying to get an integration test to work with the ScriptedBlockPlugin and scripted metric aggregation, but the aggregation is not triggering the ScriptedBlockPlugin for some reason. This is the integration test I added in SearchCancellationIT.java:
For debugging purposes I have made every single script field of the scripted aggregation metric use I don't understand why there are zero plugin hits though, because my aggregation should have at the very least run the ScriptedBlockPlugin script from I'm also wondering if using the ScriptedBlockPlugin script in reduce_script will allow us to test a potential code change. Because regardless of whichever code change we use, the code change will include some logic to cancel the aggregation during the reduce phase. However if we replace the reduce phase of the aggregation with a ScriptedBlockPlugin script, then the actual code change in the reduce phase will never be triggered...? |
@danielwhsu could you push these changes into your PR so I can take a look? |
…71714) This change raises a TaskCancelledException to stop the search query if it is detected that the SearchTask has been cancelled during the reduce phase. Issue: #71021 Co-authored-by: Daniel Hsu <[email protected]> Co-authored-by: Igor Motov <[email protected]>
Closed by #71714 |
…lastic#71714) This change raises a TaskCancelledException to stop the search query if it is detected that the SearchTask has been cancelled during the reduce phase. Issue: elastic#71021 Co-authored-by: Daniel Hsu <[email protected]> Co-authored-by: Igor Motov <[email protected]>
The SearchCancellationIT#testCancellationDuringAggregation only works when real reduce takes place and therefore needs at least 2 shards to be present. Relates to elastic#71021
The SearchCancellationIT#testCancellationDuringAggregation only works when real reduce takes place and therefore needs at least 2 shards to be present. Relates to #71021
The SearchCancellationIT#testCancellationDuringAggregation only works when real reduce takes place and therefore needs at least 2 shards to be present. Relates to elastic#71021
…led (#78583) This change raises a TaskCancelledException to stop the search query if it is detected that the SearchTask has been cancelled during the reduce phase. Issue: #71021 Co-authored-by: Daniel Hsu <[email protected]> Co-authored-by: Igor Motov <[email protected]>
Elasticsearch version:
7.6.2
Description of the problem including expected versus actual behavior:
This is a followup to #70347
We often run search requests that have long-running aggregations, and we would like to be able to terminate the search while its in the middle of calculating the aggregations.
We would like to be able to terminate a long-running aggregation of a search request in two ways:
Currently if we close the connection or cancel the search task during the reduce phase, the search task is cancelled but the aggregation search still runs to completion. We would like to change this behavior so that if we close the connection or cancel the search task during the reduce phase, the aggregation will immediately terminate and send a response to the user.
Steps to reproduce:
Query body:
Request params:
Provide logs (if relevant):
If I let the request run to completion without cancelling the task, the response is
If I cancel the search task during the reduce phase, I observe the following logs on the server
and the response is
So even though I have cancelled the search task during the reduce phase, the long-running aggregation still runs to completion instead of terminating.
Proposed change:
SearchTask
intoInternalAggregation.java
'sReduceContext
.TaskCancelledException
inInternalAggregation.java
'sconsumeBucketsAndMaybeBreak()
if theSearchTask
is cancelled.This should allow the aggregation reduce to immediately terminate the entire search aggregation request if it detects that the task has been cancelled each time it consumes buckets. And since closing the client connection cancels the task (#43332), this change will also allow closing the client connection to terminate the aggregation.
The text was updated successfully, but these errors were encountered: