Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repo test kit to use more than 1 snapshot thread #92520

Closed

Conversation

kingherc
Copy link
Contributor

Fixes #90353

The original issue surfaced when we started testing with 1 processor in 8.5+. It resulted in 1 SNAPSHOT thread and made the timeout of test as explained in this comment.

In 8.6 at some point the SNAPSHOT threads were turned to max 10, so this issue stopped. The issue was solved for the 8.5 branch with #91226 that made the test use 2 SNAPSHOT threads.

But #92392 made it so that there can be 1 SNAPSHOT thread in cases of low max memory, so the issue resurfaced in 8.6 branch and main. This PR makes the test use 2 SNAPSHOT threads in 8.6 and main.

@kingherc kingherc added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. auto-backport-and-merge v8.6.0 v8.7.0 labels Dec 22, 2022
@kingherc kingherc self-assigned this Dec 22, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor

Hmm this sounds like it could be a genuine bug with one thread tho - at least, increasing to two threads shouldn't fix this because we don't interrupt that sleep() anywhere so if this does fix things then we're not waiting for that upload to complete for some reason.

1B/s is unrealistically slow - we check for cancellation on every read from the simulated stream but we're using Streams.copy which has an 8kiB buffer, so we'll only check for cancellation every 8k seconds. Could we instead set the minimum upload speed to something that does allow the timeout to work even with a single thread?

@DaveCTurner
Copy link
Contributor

... or use a CancellableThreads to interrupt the sleep. That'd be nice IMO.

@kingherc
Copy link
Contributor Author

Hi @DaveCTurner ! Thanks for the comment. Indeed you're right. I had tried many variations (see #90749 ) but ultimately discussed with @original-brownbear that simply setting to 2 threads would be preferable to make sure the test really tests a very (unrealistically) slow repos.

I think tinkering with the speeds will also make it prone to future errors if H/W characteristics change the speed of how things are executing. I had a really hard time finding back at the time some settings that made it work.

The main cause is that org.elasticsearch.repositories.blobstore.testkit.RepositoryAnalyzeAction.AsyncAction#onWorkerCompletion() handles the timeout exception via the SNAPSHOT thread pool, which in this case has 1 thread which is stuck in sleep() in the RateLimiter as you mentioned. We could use another thread pool here as an alternative solution?

I understand the scary part is the stuck thread through. It is stuck due to the 1b/s rate limit. Your idea of CancellableThreads is interesting, I will investigate the code to see how these work and whether we can use it. But also, just to mention, that the test is unrealistic. I am not sure this can happen in real life. I mean, if this could happen, then this would be a problem for all threads using a RateLimiter, such as recoveries and snapshots, not only for the repos analysis component, right? Theoretically all those threads could get stuck in the sleep() of the RateLimiter I fear. And thus I am also not sure how much effort these unrealistic scenarios are worth putting in?

@DaveCTurner
Copy link
Contributor

You're right, this is something of a general problem. However recoveries already use CancellableThreads to abort IO (and rate-limiting sleeps) if necessary, and snapshots don't really have such a strong notion of prompt cancellation, so I think it's not as bad as it might first appear.

@DaveCTurner
Copy link
Contributor

Oh I see, we time out in the transport layer so we consider this worker to have completed even though it's still running. So really as well as the CancellableThreads change we'd need to address #66992.

@DaveCTurner
Copy link
Contributor

... and note that the default for ?concurrency in this API is 10, which is also the snapshot threadpool size, so if this is a problem in this test then it will be a problem in production too (although in practice we don't sleep for very long in prod, and IO should never get stuck forever).

@kingherc
Copy link
Contributor Author

OK, @DaveCTurner , I've been reading the code around the repos analysis and the cancellable tasks and what you mentioned. I see that the RepositoryAnalyzeAction basically spawns a bunch of BlobAnalyzeAction children tasks (using the transportService.sendChildRequest method). When the 1second timeout is reached, the parent RepositoryAnalyzeAction task's handleException() method is called which then calls cancelTaskAndDescendants() on the parent task. However, if I understand correctly, this method does NOT send any cancellation messages to the children tasks, right? It seems to be only banning new tasks, and waiting for the ongoing tasks to complete, right?

@DaveCTurner
Copy link
Contributor

Oh good point, I missed that in this specific case we have a workaround for #66992. This will ban new tasks, mark all child tasks as cancelled and then wait for them to complete. If each child were to register a listener with org.elasticsearch.tasks.CancellableTask#addListener then those listeners would be called on cancellation, which would be the right spot to interrupt any blocked threads.

@kingherc
Copy link
Contributor Author

@DaveCTurner , I think we are getting closer :)

I did try that, by wrapping the main functionality inside the beginning of the org.elasticsearch.repositories.blobstore.testkit.BlobAnalyzeAction.BlobAnalysis#run function with the following:

            final CancellableThreads cancellableThreads = new CancellableThreads();
            task.addListener(() -> {
                logger.warn("RECEIVED CANCELLATION");
                cancellableThreads.cancel(task.getReasonCancelled());
            });
            cancellableThreads.execute(
                () -> writeRandomBlob(
                    request.readEarly || request.getAbortWrite() || (request.targetLength <= MAX_ATOMIC_WRITE_SIZE && random.nextBoolean()),
                    true,
                    this::onLastReadForInitialWrite,
                    write1Step
                )
            );

However, I did not see it triggered. So I think I am missing something. I still haven't found the spot in the code that cancelTaskAndDescendants() actually sends some sort of cancellation message to remote children tasks...?

@kingherc
Copy link
Contributor Author

mark all child tasks as cancelled

I still haven't found where this is happening

@kingherc
Copy link
Contributor Author

Hmm, it seems I had to register the listener on the parent task rather than the the child task itself. Because this works:

            final CancellableThreads cancellableThreads = new CancellableThreads();
            CancellableTask parentTask = (CancellableTask) taskManager.getTask(task.getParentTaskId().getId());
            parentTask.addListener(() -> {
                cancellableThreads.cancel(task.getReasonCancelled());
            });
            cancellableThreads.execute(
                () -> writeRandomBlob(
                    request.readEarly || request.getAbortWrite() || (request.targetLength <= MAX_ATOMIC_WRITE_SIZE && random.nextBoolean()),
                    true,
                    this::onLastReadForInitialWrite,
                    write1Step
                )
            );

I was mistakenly thinking that the cancellation of the parent task would send "cancellation requests" to the children tasks.

I'll re-work the PR.

@DaveCTurner
Copy link
Contributor

I still haven't found where this is happening

It's here, when the receiving node receives the ban request it cancels all the running children of the banned parent:

for (CancellableTask childTask : childTasks) {
cancelTaskAndDescendants(childTask, request.reason, request.waitForCompletion, listener);
}

Hmm, it seems I had to register the listener on the parent task rather than the the child task itself. Because this works:

I think this will only work if the parent and child tasks are on the same node. Otherwise taskManager.getTask(task.getParentTaskId().getId()) won't find the parent task.

@kingherc
Copy link
Contributor Author

OK, so my thinking was correct. Then, there is some bug or some issue I'm ignoring because adding a listener on the child task did not work. Maybe there's some sort of optimization for local tasks? I'll try to figure out why I'm not getting the notification by working my way in the code from the method you mentioned (thanks!).

@kingherc
Copy link
Contributor Author

Hmm, that code path you mentioned is not called. Specifically I see the following order of executions in the log:

[2022-12-22T17:13:13,732][TRACE][o.e.r.b.t.RepositoryAnalyzeAction] [yamlRestTest-0] processing [VerifyBlobTask{node={yamlRestTest-0}{1FYznhT7RsebJrEr-S-3dg}{FWA_LR_SSY-BPlzEyWMJdQ}{yamlRestTest-0}{127.0.0.1}{127.0.0.1:37789}{cdfhilmrstw}{testattr=test, ml.allocated_processors_double=1.0, ml.machine_memory=66789920768, xpack.installed=true, ml.max_jvm_size=536870912, ml.allocated_processors=1}, request=BlobAnalyzeAction.Request{blob analysis [test_repo_slow:temp-analysis-C9WIbkqxSUCjZQwKZf1Uhg/test-blob-0-xLBn58ElSpOtVrBNpKZVWg, length=2147483648, seed=-1818594403331045661, readEarly=false, writeAndOverwrite=false, abortWrite=false]}}]
[2022-12-22T17:13:13,733][TRACE][o.e.t.TaskManager        ] [yamlRestTest-0] register child connection [local node connection] task [148]
[2022-12-22T17:13:13,733][TRACE][o.e.t.T.tracer           ] [yamlRestTest-0] [44][cluster:admin/repository/analyze/blob] sent to [{yamlRestTest-0}{1FYznhT7RsebJrEr-S-3dg}{FWA_LR_SSY-BPlzEyWMJdQ}{yamlRestTest-0}{127.0.0.1}{127.0.0.1:37789}{cdfhilmrstw}{testattr=test, ml.allocated_processors_double=1.0, ml.machine_memory=66789920768, xpack.installed=true, ml.max_jvm_size=536870912, ml.allocated_processors=1}] (timeout: [1s])
[2022-12-22T17:13:13,733][TRACE][o.e.t.T.tracer           ] [yamlRestTest-0] [44][cluster:admin/repository/analyze/blob] received request
[2022-12-22T17:13:13,733][TRACE][o.e.t.TaskManager        ] [yamlRestTest-0] unregister task for id: 146
[2022-12-22T17:13:13,816][TRACE][o.e.x.w.t.s.e.TickerScheduleTriggerEngine] [yamlRestTest-0] checking jobs [2022-12-22T15:13:13.816925827Z]
[2022-12-22T17:13:14,317][TRACE][o.e.x.w.t.s.e.TickerScheduleTriggerEngine] [yamlRestTest-0] checking jobs [2022-12-22T15:13:14.317303862Z]
[2022-12-22T17:13:14,551][TRACE][o.e.x.t.t.s.TransformScheduler] [yamlRestTest-0] Processing scheduled tasks started
[2022-12-22T17:13:14,551][TRACE][o.e.x.t.t.s.TransformScheduler] [yamlRestTest-0] Processing scheduled tasks finished, took 0ms
[2022-12-22T17:13:14,733][TRACE][o.e.t.TaskManager        ] [yamlRestTest-0] unregister child connection [local node connection] task [148]
[2022-12-22T17:13:14,734][DEBUG][o.e.r.b.t.RepositoryAnalyzeAction] [yamlRestTest-0] failed [VerifyBlobTask{node={yamlRestTest-0}{1FYznhT7RsebJrEr-S-3dg}{FWA_LR_SSY-BPlzEyWMJdQ}{yamlRestTest-0}{127.0.0.1}{127.0.0.1:37789}{cdfhilmrstw}{testattr=test, ml.allocated_processors_double=1.0, ml.machine_memory=66789920768, xpack.installed=true, ml.max_jvm_size=536870912, ml.allocated_processors=1}, request=BlobAnalyzeAction.Request{blob analysis [test_repo_slow:temp-analysis-C9WIbkqxSUCjZQwKZf1Uhg/test-blob-0-xLBn58ElSpOtVrBNpKZVWg, length=2147483648, seed=-1818594403331045661, readEarly=false, writeAndOverwrite=false, abortWrite=false]}}] org.elasticsearch.transport.ReceiveTimeoutTransportException: [yamlRestTest-0][127.0.0.1:37789][cluster:admin/repository/analyze/blob] request_id [44] timed out after [1001ms]
[2022-12-22T17:13:14,734][TRACE][o.e.t.TaskCancellationService] [yamlRestTest-0] cancelling task [1FYznhT7RsebJrEr-S-3dg:148] and its descendants
[2022-12-22T17:13:14,734][TRACE][o.e.t.TaskCancellationService] [yamlRestTest-0] child tasks of parent [1FYznhT7RsebJrEr-S-3dg:148] are completed
[2022-12-22T17:13:14,734][TRACE][o.e.t.TaskManager        ] [yamlRestTest-0] cancelling task with id 148

You will notice that the unregister child connection happens BEFORE the failure of the parent task that calls the cancelTaskAndDescendants(). That means that it thinks it has no children and does not send ban messages. I wonder why this asynchrony is happening.

@DaveCTurner
Copy link
Contributor

Huh, yes, this looks like a bug.

@kingherc
Copy link
Contributor Author

I discussed further with @DaveCTurner . We think it makes sense to make some small infrastructure to also send a new Action to cancel a remote child task when org.elasticsearch.tasks.TaskManager.CancellableTaskHolder#unregisterChildConnection() is called with a failure (e.g., a timeout). I will see if I can implement this easily.

There's a more general conversation that we could extend such a logic to generic tasks, but we thought it was complex to handle for now.

@kingherc
Copy link
Contributor Author

Closing in favor of #92588

@kingherc kingherc closed this Dec 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI v8.6.0 v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] SnapshotRepoTestKitClientYamlTestSuiteIT test {p0=/10_analyze/Timeout with large blobs} failing
3 participants