Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Child requests proactively cancel children tasks #92588

Merged

Conversation

kingherc
Copy link
Contributor

To make this possible we modify the CancellableTasksTracker to track children tasks by the Request ID as well. That way, we can send an Action to cancel a child based on the parent task and the Request ID.

This is especially useful when parents' children requests timeout on the parents' side.

The motivation behind this PR lies behind fixing test failure #90353. In discussing the simple solution of PR #92520, we decided with @DaveCTurner that the best approach to solving the test failure would be to solve #66992. Unfortunately that issue may require substantial effort. But for the moment, we thought it would be easier to cancel children requests on timeout, since we already have infrastructure for tracking children tasks (through the CancellableTasksTracker).

Fixes #90353
Relates #66992

@kingherc kingherc added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.7.0 labels Dec 28, 2022
@kingherc kingherc self-assigned this Dec 28, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @kingherc, I've created a changelog YAML for you.

@kingherc kingherc changed the title Failed tasks proactively cancel children tasks Children requests proactively cancel children tasks Dec 28, 2022
@kingherc kingherc changed the title Children requests proactively cancel children tasks Child requests proactively cancel children tasks Dec 28, 2022
To make this possible we modify the CancellableTasksTracker
to track children tasks by the Request ID as well. That
way, we can send an Action to cancel a child based on the
parent task and the Request ID.

This is especially useful when parents' children requests
timeout on the parents' side.

Fixes elastic#90353
Relates elastic#66992
@kingherc kingherc force-pushed the enhancement/90353-66992-cancel-child-on-timeout branch from 075c129 to 5b04545 Compare December 28, 2022 15:04
@kingherc kingherc marked this pull request as ready for review December 29, 2022 10:22
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@kingherc
Copy link
Contributor Author

Hi @DaveCTurner , @original-brownbear this is ready for review now.

@kingherc kingherc requested a review from tlrx January 9, 2023 14:39
@kingherc
Copy link
Contributor Author

kingherc commented Jan 9, 2023

Requesting @tlrx 's review since he may have more time to review this these days.

@kingherc
Copy link
Contributor Author

Ping for reviews

@fcofdez fcofdez self-requested a review January 24, 2023 16:11
Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks in the good direction, I left a few comments.

for (DiscoveryNode node : nodes) {
TransportService transportService = internalCluster().getInstance(TransportService.class, node.getName());
for (ThreadPoolStats.Stats stat : transportService.getThreadPool().stats()) {
assertEquals(0, stat.getActive());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky assertion, something might get enqueued just after the assertion. But I think ensureBansAndCancellationsConsistency covers most of what we want to assert here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because it seems otherwise the test infrastructure may get stuck. If I remember correctly, it's probably because the test at the end then thinks that sometimes things are still running (servers received cancellation but did not make it to process it). That is why I wait for the thread pools to be empty. I do not think anything else can get enqueued afterwards, but if anything does I would expect it to be done by the end of the test procedure.

@rjernst rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023
Copy link
Contributor Author

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the comments @fcofdez ! See ameliorations. I invite you to re-review. @tlrx feel free to also take a look if you have time. Thanks!

for (DiscoveryNode node : nodes) {
TransportService transportService = internalCluster().getInstance(TransportService.class, node.getName());
for (ThreadPoolStats.Stats stat : transportService.getThreadPool().stats()) {
assertEquals(0, stat.getActive());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because it seems otherwise the test infrastructure may get stuck. If I remember correctly, it's probably because the test at the end then thinks that sometimes things are still running (servers received cancellation but did not make it to process it). That is why I wait for the thread pools to be empty. I do not think anything else can get enqueued afterwards, but if anything does I would expect it to be done by the end of the test procedure.

@kingherc kingherc requested a review from fcofdez March 10, 2023 14:56
Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the clarifications 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] SnapshotRepoTestKitClientYamlTestSuiteIT test {p0=/10_analyze/Timeout with large blobs} failing
4 participants