Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps][Actions] Improve Task Manager’s retry logic for ad-hoc tasks #143860

Merged
merged 7 commits into from
Oct 31, 2022

Conversation

doakalexi
Copy link
Contributor

@doakalexi doakalexi commented Oct 24, 2022

Resolves #143048

Summary

Updated the retry logic in the task manager. If the first attempt encounters a failure we will retry 30 seconds later. If the second attempt fails, we will start a retry 5-minute multiple from the previous run. It will look like this

Attempt 1: now
Attempt 2: 30s after the first attempt
Attempt 3: 5m after the second attempt
Attempt 4: 10m after the third attempt
Attempt 5: 20m after the fourth attempt

Checklist

To verify

  • Create a rule and then force a retry failure
  • Verify that the retry follow the pattern set above ^

@doakalexi doakalexi changed the title Improving task manager retry logic [ResponseOps][Alerting] Improve Task Manager’s retry logic for ad-hoc tasks Oct 24, 2022
@doakalexi doakalexi added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Actions/Framework Issues related to the Actions Framework labels Oct 24, 2022
@doakalexi doakalexi changed the title [ResponseOps][Alerting] Improve Task Manager’s retry logic for ad-hoc tasks [ResponseOps][Actions] Improve Task Manager’s retry logic for ad-hoc tasks Oct 24, 2022
@doakalexi doakalexi marked this pull request as ready for review October 24, 2022 17:18
@doakalexi doakalexi requested a review from a team as a code owner October 24, 2022 17:18
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Verified that I see actions retry after 30 seconds and then in 5 minute increments

@@ -338,9 +338,9 @@ export default function ({ getService }: FtrProviderContext) {

await retry.try(async () => {
const scheduledTask = await currentTask(task.id);
expect(scheduledTask.attempts).to.be.greaterThan(0);
expect(scheduledTask.attempts).to.be.greaterThan(1);
expect(Date.parse(scheduledTask.runAt)).to.be.greaterThan(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run this through the flaky test runner to make sure it's not flaky? I think we've had issues with these types of date based tests before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can do that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled in files

id before after diff
osquery 1 2 +1

ESLint disabled line counts

id before after diff
enterpriseSearch 19 21 +2
fleet 57 63 +6
osquery 103 108 +5
securitySolution 439 443 +4
total +17

Total ESLint disabled count

id before after diff
enterpriseSearch 20 22 +2
fleet 65 71 +6
osquery 104 110 +6
securitySolution 516 520 +4
total +18

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@ersin-erdal ersin-erdal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@doakalexi doakalexi merged commit 6f1df84 into elastic:main Oct 31, 2022
@kibanamachine kibanamachine added v8.6.0 backport:skip This commit does not require backporting labels Oct 31, 2022
@doakalexi doakalexi deleted the alerting/improve-retry-logic branch December 6, 2022 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Actions/Framework Issues related to the Actions Framework release_note:enhancement Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Task Manager’s retry logic for ad-hoc tasks
7 participants