Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Initial implementation of alerting task cancel() #114289

Merged
merged 40 commits into from
Nov 16, 2021

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Oct 7, 2021

Resolves #113459

Summary

With this PR, the alerting task runner implements a cancel() function (currently only called by the task manager when an alerting task exceeds its timeout value).

  • writes an Event Log document with action execute-timeout
  • updates the execution status in the rule saved object to reflect the timeout with status:error, reason:timeout.
  • adds alerting config cancelAlertsOnRuleTimeout to control action skipping behavior at framework level, defaults to true
  • adds rule type config cancelAlertsOnRuleTimeout to control action skipping behavior at rule type level. if not set, defaults to whatever the value of the alerting config cancelAlertsOnRuleTimeout is
  • skips scheduling actions if the execution is cancelled AND the cancelAlertsOnRuleTimeout in kibana.yml is true AND the rule type cancelAlertsOnRuleTimeout is true

To Verify

  1. Modify the Elasticsearch query rule type x-pack/plugins/stack_alerts/server/alert_types/es_query/alert_type.ts:
@@ -148,6 +148,7 @@ export function getAlertType(logger: Logger): AlertType<
     isExportable: true,
     executor,
     producer: STACK_ALERTS_FEATURE_ID,
+    ruleTaskTimeout: '1m',
   };

   async function executor(
@@ -170,6 +171,8 @@ export function getAlertType(logger: Logger): AlertType<
       throw new Error(getInvalidComparatorError(params.thresholdComparator));
     }

+    await new Promise((resolve) => setTimeout(resolve, 120000));

This sets the ruleTaskTimeout to 1m and adds a 2 minute delay to the executor

  1. Create an Elasticsearch query rule (make sure it is one that will fire!) with an attached action and let it run.
  2. After 1 minute, you should see this warning in the logs:
[WARN ][plugins.taskManager] Cancelling task alerting:.es-query "f15847e0-2a8e-11ec-b690-17b293411601" as it expired at 2021-10-11T12:30:58.026Z after running for 01m 02s (with timeout set at 1m).
  1. In the rule management view, the rule should be in an error state

Screen Shot 2021-10-11 at 8 35 50 AM

  1. Verify in the kibana index that the rule SO execution status has error status with reason: timeout:
Rule SO execution status
{
  "executionStatus" : {
    "status" : "error",
    "lastExecutionDate" : "2021-10-11T12:36:01.073Z",
    "error" : {
      "reason" : "timeout",
      "message" : "rule execution cancelled due to timeout: \".es-queryf08d2fb0-2a8e-11ec-b690-17b293411601\""
    }
  }
}
  1. Verify in the event log index there is a document for execute-timeout:
Event log doc
{
	"@timestamp": "2021-10-11T12:31:00.923Z",
	"event": {
		"provider": "alerting",
		"action": "execute-timeout",
		"kind": "alert",
		"category": [
			"stackAlerts"
		]
	},
	"message": "rule execution cancelled due to timeout: \".es-queryf08d2fb0-2a8e-11ec-b690-17b293411601\"",
	"kibana": {
		"saved_objects": [{
			"rel": "primary",
			"type": "alert",
			"id": "f08d2fb0-2a8e-11ec-b690-17b293411601",
			"type_id": ".es-query"
		}],
		"server_uuid": "5b2de169-2785-441b-ae8c-186a1936b17d",
		"version": "8.0.0"
	},
	"rule": {
		"id": "f08d2fb0-2a8e-11ec-b690-17b293411601",
		"license": "basic",
		"category": ".es-query",
		"ruleset": "stackAlerts"
	},
	"ecs": {
		"version": "1.8.0"
	}
}
  1. Verify you receive no action

  2. Now add xpack.alerting.cancelAlertsOnRuleTimeout: false to your kibana config and repeat steps 1-7. Verify that all the timeout docs/statuses get written as expected, but you also receive your action at the end of rule execution (after timeout).

  3. Now remove xpack.alerting.cancelAlertsOnRuleTimeout: false from your kibana config and add cancelAlertsOnRuleTimeout: false to the Elasticsearch rule type definition in x-pack/plugins/stack_alerts/server/alert_types/es_query/alert_type.ts. Repeat steps 1-7 and verify that all the timeout docs/statuses get written as expected, but you also receive your action at the end of rule execution (after timeout).

Checklist

Delete any items that are not applicable to this PR.

@ymao1 ymao1 changed the title Alerting/cancel alerting tasks [Alerting] Initial implementation of alerting task cancel() Oct 8, 2021
@ymao1
Copy link
Contributor Author

ymao1 commented Oct 11, 2021

@elasticmachine merge upstream

@ymao1 ymao1 self-assigned this Oct 11, 2021
@ymao1 ymao1 added Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) release_note:skip Skip the PR/issue when compiling release notes v7.16.0 v8.0.0 labels Oct 11, 2021
@ymao1 ymao1 marked this pull request as ready for review October 11, 2021 12:52
@ymao1 ymao1 requested review from a team as code owners October 11, 2021 12:52
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@@ -0,0 +1,703 @@
/*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a new file for the cancel tests bc task_runner.test.ts is enormous

Copy link
Member

@jbudz jbudz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kibana-docker LGTM

@chrisronline
Copy link
Contributor

@ymao1

I'm not seeing the execution status updating properly:

Screen Shot 2021-10-11 at 1 57 13 PM

{
  "executionStatus": {
    "status": "active",
    "lastExecutionDate": "2021-10-11T17:55:49.057Z",
    "error": null
  }
}

but I also see the execute-timeout in the event log:

POST .kibana-event-log-*/_search
{
  "size": 1,
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ],
  "query": {
    "term": {
      "event.action": {
        "value": "execute-timeout"
      }
    }
  }
}

-->

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : ".kibana-event-log-8.0.0-000001",
        "_id" : "bh2AcHwBEGeKXpWr6gf5",
        "_score" : null,
        "_source" : {
          "@timestamp" : "2021-10-11T17:57:49.084Z",
          "event" : {
            "provider" : "alerting",
            "action" : "execute-timeout",
            "kind" : "alert",
            "category" : [
              "stackAlerts"
            ]
          },
          "message" : "rule execution cancelled due to timeout: \".es-query96e49850-2aba-11ec-96ec-5f9c1e6661e6\"",
          "kibana" : {
            "saved_objects" : [
              {
                "rel" : "primary",
                "type" : "alert",
                "id" : "96e49850-2aba-11ec-96ec-5f9c1e6661e6",
                "type_id" : ".es-query"
              }
            ],
            "server_uuid" : "5b2de169-2785-441b-ae8c-186a1936b17d",
            "version" : "8.0.0"
          },
          "rule" : {
            "id" : "96e49850-2aba-11ec-96ec-5f9c1e6661e6",
            "license" : "basic",
            "category" : ".es-query",
            "ruleset" : "stackAlerts"
          },
          "ecs" : {
            "version" : "1.8.0"
          }
        },
        "sort" : [
          1633975069084
        ]
      }
    ]
  }
}

Maybe I missed a step in the setup? I'm going to dive in a little more but wanted to give you an early heads up

@chrisronline
Copy link
Contributor

chrisronline commented Oct 11, 2021

I added a console.trace({ attributes }) to this line and this is what I see when the problem above happens:

[2021-10-11T14:58:04.941-04:00][WARN ][plugins.taskManager] Cancelling task alerting:.es-query "05fd6820-2ac5-11ec-8e22-e9237d2e050c" as it expired at 2021-10-11T18:58:04.920Z after running for 01m 00s (with timeout set at 1m).
Trace: {
  attributes: {
    executionStatus: {
      lastExecutionDate: '2021-10-11T18:58:04.941Z',
      status: 'error',
      error: [Object]
    }
  }
}
    at TaskRunner.updateRuleExecutionStatus (/Users/chris/dev/repos/kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts:215:13)
    at TaskRunner.cancel (/Users/chris/dev/repos/kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts:735:16)
    at TaskManagerRunner.cancel (/Users/chris/dev/repos/kibana/x-pack/plugins/task_manager/server/task_running/task_runner.ts:407:19)
    at TaskPool.cancelTask (/Users/chris/dev/repos/kibana/x-pack/plugins/task_manager/server/task_pool.ts:207:18)
    at TaskPool.cancelExpiredTasks (/Users/chris/dev/repos/kibana/x-pack/plugins/task_manager/server/task_pool.ts:198:14)
    at TaskPool.availableWorkers (/Users/chris/dev/repos/kibana/x-pack/plugins/task_manager/server/task_pool.ts:89:10)
    at TaskPool.run (/Users/chris/dev/repos/kibana/x-pack/plugins/task_manager/server/task_pool.ts:112:74)
    at run (/Users/chris/dev/repos/kibana/x-pack/plugins/task_manager/server/polling_lifecycle.ts:264:40)
    at onOk (/Users/chris/dev/repos/kibana/x-pack/plugins/task_manager/server/lib/fill_pool.ts:88:23)
    at map (/Users/chris/dev/repos/kibana/x-pack/plugins/task_manager/server/lib/result_type.ts:104:25)
    at MergeMapSubscriber.project (/Users/chris/dev/repos/kibana/x-pack/plugins/task_manager/server/lib/fill_pool.ts:75:11)
    at MergeMapSubscriber._tryNext (/Users/chris/dev/repos/kibana/node_modules/rxjs/src/internal/operators/mergeMap.ts:131:21)
    at MergeMapSubscriber._next (/Users/chris/dev/repos/kibana/node_modules/rxjs/src/internal/operators/mergeMap.ts:121:12)
    at MergeMapSubscriber.Subscriber.next (/Users/chris/dev/repos/kibana/node_modules/rxjs/src/internal/Subscriber.ts:99:12)
    at TapSubscriber._next (/Users/chris/dev/repos/kibana/node_modules/rxjs/src/internal/operators/tap.ts:125:22)
    at TapSubscriber.Subscriber.next (/Users/chris/dev/repos/kibana/node_modules/rxjs/src/internal/Subscriber.ts:99:12)


Trace: {
  attributes: {
    executionStatus: {
      lastExecutionDate: '2021-10-11T18:58:07.960Z',
      status: 'active',
      error: null
    }
  }
}
    at TaskRunner.updateRuleExecutionStatus (/Users/chris/dev/repos/kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts:215:13)
    at TaskRunner.run (/Users/chris/dev/repos/kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts:651:18)

@ymao1
Copy link
Contributor Author

ymao1 commented Oct 11, 2021

@chrisronline Thanks! I'll take a look. First thought is maybe a race condition when the rule timeout is the same as the schedule interval 🤔

@ymao1
Copy link
Contributor Author

ymao1 commented Nov 10, 2021

@elasticmachine merge upstream

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; left a bunch of comments, mainly questions/nits - I think docs is the only thing that we need to have, out of all the comments.

x-pack/plugins/alerting/server/task_runner/task_runner.ts Outdated Show resolved Hide resolved
x-pack/plugins/alerting/server/plugin.ts Outdated Show resolved Hide resolved
x-pack/plugins/alerting/server/plugin.ts Outdated Show resolved Hide resolved
@@ -199,6 +199,7 @@ kibana_vars=(
xpack.alerting.invalidateApiKeysTask.interval
xpack.alerting.invalidateApiKeysTask.removalDelay
xpack.alerting.defaultRuleTaskTimeout
xpack.alerting.cancelAlertsOnRuleTimeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be adding docs for this config as well, right? Presumably we'll create a PR to cloud-enable this as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a cloud PR for this config when this PR gets merged but I didn't think user docs make sense for this config?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think user docs make sense for this config?

Why not?

I think the one-liner that you added makes sense and is good enough :-)

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far! I had a couple of questions to start

@elastic elastic deleted a comment from kibanamachine Nov 15, 2021
Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work!

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes LGTM; thanks!

@@ -199,6 +199,7 @@ kibana_vars=(
xpack.alerting.invalidateApiKeysTask.interval
xpack.alerting.invalidateApiKeysTask.removalDelay
xpack.alerting.defaultRuleTaskTimeout
xpack.alerting.cancelAlertsOnRuleTimeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think user docs make sense for this config?

Why not?

I think the one-liner that you added makes sense and is good enough :-)

@ymao1 ymao1 added the backport:skip This commit does not require backporting label Nov 16, 2021
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
alerting 251 252 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
triggersActionsUi 777.5KB 777.8KB +312.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
alerting 36.6KB 36.6KB +20.0B
Unknown metric groups

API count

id before after diff
alerting 259 260 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

@ymao1 ymao1 merged commit 905ac6c into elastic:main Nov 16, 2021
jloleysens added a commit to jloleysens/kibana that referenced this pull request Nov 16, 2021
…igrate-away-from-injected-css-js

* 'main' of github.com:elastic/kibana: (221 commits)
  [Reporting] Add log level to config (elastic#118319)
  [Metrics UI] Skip failing waffle chart color palette test (elastic#118527)
  [APM] chore: Unify naming of 'apm/scripts/**/*' with snake_case (elastic#118328)
  skip flaky suites (elastic#99581, elastic#118356, elastic#118474)
  [Alerting] Initial implementation of alerting task `cancel()` (elastic#114289)
  chore(NA): creates pkg_npm_types bazel rule (elastic#116465)
  skip flaky suite (elastic#116892)
  Bump chromedriver to 95.0.0 (elastic#116724)
  [Data visualizer] Improve design of expanded rows (elastic#118125)
  [APM] prefer ECS field names for HTTP and URL (elastic#118485)
  Update query_debugging_in_development_and_production.md (elastic#118491)
  [Uptime] adjust Elastic Synthetics integration functional tests (elastic#118163)
  [kbn/rule-data-utils] add submodules and require public use them (elastic#117963)
  [DOCS] Refresh APM correlation screenshots (elastic#116723) (elastic#118577)
  Handles ns to ms conversion for event loop delay metrics (elastic#118447)
  [Cases] Rename functional tests folder (elastic#118490)
  dummy commit
  skip flaky suite (elastic#118593)
  Improve workpad schema validation (elastic#97838)
  skip flaky suite (elastic#118584)
  ...

# Conflicts:
#	src/plugins/dashboard/public/application/embeddable/viewport/dashboard_viewport.tsx
mbondyra pushed a commit to mbondyra/kibana that referenced this pull request Nov 19, 2021
…c#114289)

* Added cancel() to alerting task runner and writing event log document

* Updating rule saved object with timeout execution status

* Skip scheduling actions and logging event log for alerts if rule execution is cancelled

* Adding config for disabling skipping actions

* Fixing types

* Adding flag for rule types to opt out of skipping acitons

* Using task runner uuid to differentiate between task instances

* Adding functional test

* Default to timestamp when startedAt is not available

* Reverting previous change and updating task pool filter instead

* Fixing functional test

* Adding debug logging

* Fixing unit tests

* Fixing unit tests

* Adding rule name to event log doc and rule type timeout to log messages

* Simplifying register logic and adding check to see if already cancelled

* Updating task uuid based on PR comments

* Removing observable

* Fixing functional test

* Adding to docs

Co-authored-by: Kibana Machine <[email protected]>
dmlemeshko pushed a commit that referenced this pull request Nov 29, 2021
* Added cancel() to alerting task runner and writing event log document

* Updating rule saved object with timeout execution status

* Skip scheduling actions and logging event log for alerts if rule execution is cancelled

* Adding config for disabling skipping actions

* Fixing types

* Adding flag for rule types to opt out of skipping acitons

* Using task runner uuid to differentiate between task instances

* Adding functional test

* Default to timestamp when startedAt is not available

* Reverting previous change and updating task pool filter instead

* Fixing functional test

* Adding debug logging

* Fixing unit tests

* Fixing unit tests

* Adding rule name to event log doc and rule type timeout to log messages

* Simplifying register logic and adding check to see if already cancelled

* Updating task uuid based on PR comments

* Removing observable

* Fixing functional test

* Adding to docs

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Alerting] Implement alerting task cancel() hook
7 participants