-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repo test kit to use more than 1 snapshot thread #92520
Repo test kit to use more than 1 snapshot thread #92520
Conversation
Pinging @elastic/es-distributed (Team:Distributed) |
Hmm this sounds like it could be a genuine bug with one thread tho - at least, increasing to two threads shouldn't fix this because we don't interrupt that 1B/s is unrealistically slow - we check for cancellation on every read from the simulated stream but we're using |
... or use a |
Hi @DaveCTurner ! Thanks for the comment. Indeed you're right. I had tried many variations (see #90749 ) but ultimately discussed with @original-brownbear that simply setting to 2 threads would be preferable to make sure the test really tests a very (unrealistically) slow repos. I think tinkering with the speeds will also make it prone to future errors if H/W characteristics change the speed of how things are executing. I had a really hard time finding back at the time some settings that made it work. The main cause is that I understand the scary part is the stuck thread through. It is stuck due to the 1b/s rate limit. Your idea of |
You're right, this is something of a general problem. However recoveries already use |
Oh I see, we time out in the transport layer so we consider this worker to have completed even though it's still running. So really as well as the |
... and note that the default for |
OK, @DaveCTurner , I've been reading the code around the repos analysis and the cancellable tasks and what you mentioned. I see that the |
Oh good point, I missed that in this specific case we have a workaround for #66992. This will ban new tasks, mark all child tasks as cancelled and then wait for them to complete. If each child were to register a listener with |
@DaveCTurner , I think we are getting closer :) I did try that, by wrapping the main functionality inside the beginning of the
However, I did not see it triggered. So I think I am missing something. I still haven't found the spot in the code that |
I still haven't found where this is happening |
Hmm, it seems I had to register the listener on the parent task rather than the the child task itself. Because this works:
I was mistakenly thinking that the cancellation of the parent task would send "cancellation requests" to the children tasks. I'll re-work the PR. |
It's here, when the receiving node receives the ban request it cancels all the running children of the banned parent: elasticsearch/server/src/main/java/org/elasticsearch/tasks/TaskCancellationService.java Lines 320 to 322 in 2d2b82b
I think this will only work if the parent and child tasks are on the same node. Otherwise |
OK, so my thinking was correct. Then, there is some bug or some issue I'm ignoring because adding a listener on the child task did not work. Maybe there's some sort of optimization for local tasks? I'll try to figure out why I'm not getting the notification by working my way in the code from the method you mentioned (thanks!). |
Hmm, that code path you mentioned is not called. Specifically I see the following order of executions in the log:
You will notice that the |
Huh, yes, this looks like a bug. |
I discussed further with @DaveCTurner . We think it makes sense to make some small infrastructure to also send a new Action to cancel a remote child task when There's a more general conversation that we could extend such a logic to generic tasks, but we thought it was complex to handle for now. |
Closing in favor of #92588 |
Fixes #90353
The original issue surfaced when we started testing with 1 processor in 8.5+. It resulted in 1 SNAPSHOT thread and made the timeout of test as explained in this comment.
In 8.6 at some point the SNAPSHOT threads were turned to max 10, so this issue stopped. The issue was solved for the 8.5 branch with #91226 that made the test use 2 SNAPSHOT threads.
But #92392 made it so that there can be 1 SNAPSHOT thread in cases of low max memory, so the issue resurfaced in 8.6 branch and main. This PR makes the test use 2 SNAPSHOT threads in 8.6 and main.