Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Actions] Fixed ad-hoc actions tasks remain as "running" when they timeout by adding cancellation support #120853

Merged

Conversation

YulNaumenko
Copy link
Contributor

@YulNaumenko YulNaumenko commented Dec 8, 2021

Summary

Provided cancel function with the proper logging for the actions task.
Resolves #119647
Resolves #64148

Testing steps:

  1. Create (by adding timeout) long running action execution method which is longer than the default timeout 5m.
  2. Create rule with alerts and attach the previously created action.
    (before)
  3. Check the task manager index and observer that task is forever in the running state and cancellation call was not cancel the task.
    (after)
  4. The task was removed after the default timeout 5m and the proper logs was added to the Kibana log:
[2021-12-08T15:31:09.894-08:00][WARN ][plugins.taskManager] Cancelling task actions:.server-log "3867c0d0-587e-11ec-8a9c-ffc4592af071" as it expired at 2021-12-08T23:31:09.882Z after running for 05m 00s (with timeout set at 5m).
[2021-12-08T15:31:09.894-08:00][DEBUG ][plugins.actions] Cancelling action task for test action with id 2 - execution error due to timeout.

And new event log record:

"hits" : [
      {
        "_index" : ".kibana-event-log-8.1.0-000001",
        "_id" : "Xsk6p30BCxUbiwMDdIHo",
        "_score" : 0.0,
        "_source" : {
          "@timestamp" : "2021-12-11T02:02:45.607Z",
          "event" : {
            "provider" : "actions",
            "action" : "execute-timeout",
            "kind" : "action"
          },
          "message" : "action: .server-log:f74a6ac0-5a22-11ec-9ce8-b5e69b25eb4f: 'test' execution cancelled due to timeout - exceeded default timeout of \"5m\"",
          "kibana" : {
            "task" : {
              "scheduled" : "2021-12-11T01:57:41.051Z"
            },
            "saved_objects" : [
              {
                "rel" : "primary",
                "type" : "action",
                "id" : "f74a6ac0-5a22-11ec-9ce8-b5e69b25eb4f",
                "type_id" : ".server-log",
                "namespace" : "default"
              },
              {
                "rel" : "primary",
                "type" : "alert",
                "id" : "f99719e0-5a22-11ec-9ce8-b5e69b25eb4f",
                "type_id" : ".index-threshold"
              }
            ],
            "server_uuid" : "5b2de169-2785-441b-ae8c-186a1936b17d",
            "version" : "8.1.0"
          },
          "ecs" : {
            "version" : "1.8.0"
          }
        }
      },

@YulNaumenko YulNaumenko added bug Fixes for quality problems that affect the customer experience v8.0.0 release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework v8.1.0 v8.2.0 labels Dec 8, 2021
@YulNaumenko YulNaumenko self-assigned this Dec 8, 2021
@YulNaumenko YulNaumenko requested a review from a team as a code owner December 8, 2021 23:07
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@ymao1
Copy link
Contributor

ymao1 commented Dec 9, 2021

We are adding an execute-timeout event log entry when this happens for rule execution. Should we add one here as well?

@pmuellr
Copy link
Member

pmuellr commented Dec 9, 2021

I'm a little confused how just implementing a basically no-op cancel() function actually fixes this!

This appears to be the code that TM uses itself to cancel tasks on timeout:

private async cancelTask(task: TaskRunner) {
try {
this.logger.debug(`Cancelling task ${task.toString()}.`);
this.tasksInPool.delete(task.taskExecutionId);
await task.cancel();
} catch (err) {
this.logger.error(`Failed to cancel task ${task.toString()}: ${err}`);
}
}

And here is the code in the task runner that will end up getting called from that (same code in the ephemeral task runner):

public async cancel() {
const { task } = this;
if (task?.cancel) {
this.task = undefined;
return task.cancel();
}
this.logger.debug(`The task ${this} is not cancellable.`);
}

So it looks like setting this.task = undefined is the bit that actually causes the task to get marked as no longer running? Very obscure!

Or maybe I'm missing something. But if that was it, I think adding a comment to that indicating that it will cause the task state of running to be cleared would be good.

@YulNaumenko
Copy link
Contributor Author

@elasticmachine merge upstream

@YulNaumenko YulNaumenko requested a review from pmuellr December 13, 2021 03:19
Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could simplify the logic that's used for getting the actionInfo that's used for the event log entry

@YulNaumenko YulNaumenko requested a review from ymao1 December 14, 2021 21:17
@YulNaumenko
Copy link
Contributor Author

@elasticmachine merge upstream

@YulNaumenko
Copy link
Contributor Author

@elasticmachine merge upstream

@YulNaumenko
Copy link
Contributor Author

@elasticmachine merge upstream

@YulNaumenko
Copy link
Contributor Author

@elasticmachine merge upstream

@YulNaumenko
Copy link
Contributor Author

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @YulNaumenko

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; works as expected

@mikecote - thinking we should create a follow-up issue to allow a connector executor to determine if the task has been cancelled, and to provide the cancellable es client - same stuff we did for rule executors; low priority, since we aren't expecting timeouts for most connectors ...

@mikecote
Copy link
Contributor

@mikecote - thinking we should create a follow-up issue to allow a connector executor to determine if the task has been cancelled, and to provide the cancellable es client - same stuff we did for rule executors; low priority, since we aren't expecting timeouts for most connectors ...

Agreed, I'll try and create one unless you beat me to it! I think we have one already for the research that we could re-purposed to also make connector types leverage cancellation as mentioned ^^.

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@YulNaumenko YulNaumenko merged commit 158a9a5 into elastic:main Jan 19, 2022
YulNaumenko added a commit to YulNaumenko/kibana that referenced this pull request Jan 19, 2022
…meout by adding cancellation support (elastic#120853)

* [Actions] Fixed ad-hoc actions tasks remain as "running" when they timeout by adding cancellation support

* fixed test

* fixed tests

* fixed test

* removed test data

* fixed typechecks

* fixed typechecks

* fixed typechecks

* fixed tests

* fixed typechecks

* fixed tests

* fixed typechecks

* fixed test

* fixed tests

* fixed tests

* changed unit tests

* fixed tests

* fixed jest tests

* fixed typechecks

Co-authored-by: Kibana Machine <[email protected]>
(cherry picked from commit 158a9a5)
YulNaumenko added a commit that referenced this pull request Jan 19, 2022
…meout by adding cancellation support (#120853) (#123417)

* [Actions] Fixed ad-hoc actions tasks remain as "running" when they timeout by adding cancellation support

* fixed test

* fixed tests

* fixed test

* removed test data

* fixed typechecks

* fixed typechecks

* fixed typechecks

* fixed tests

* fixed typechecks

* fixed tests

* fixed typechecks

* fixed test

* fixed tests

* fixed tests

* changed unit tests

* fixed tests

* fixed jest tests

* fixed typechecks

Co-authored-by: Kibana Machine <[email protected]>
(cherry picked from commit 158a9a5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.0.0 v8.1.0 v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ad-hoc tasks remain as "running" when they timeout [Alerting] Add cancel to alert and action tasks
7 participants