Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Smarter retry interval for ES Connectivity errors #123642

Merged
merged 4 commits into from
Jan 27, 2022

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Jan 24, 2022

Resolves #122390

Summary

When the alerting task runner throws an error, we check for instances of the ES Unavailable error introduced in this PR and adjust the retry interval accordingly. We set the default connectivity retry to 5m. If the alerting rule schedule is less than 5 minutes, we use the alerting rule schedule, otherwise we set the retry to 5m.

Checklist

@ymao1 ymao1 changed the title Checking for es connectivity errors and adjusting retry interval acco… [Alerting] Smarter retry interval for ES Connectivity errors Jan 24, 2022
@ymao1 ymao1 self-assigned this Jan 24, 2022
@ymao1 ymao1 added backport:skip This commit does not require backporting Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.1.0 labels Jan 24, 2022
@ymao1 ymao1 marked this pull request as ready for review January 24, 2022 19:20
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

});

const runnerResult = await taskRunner.run();
expect(runnerResult.schedule!.interval).toEqual('10s');
Copy link
Contributor

@ersin-erdal ersin-erdal Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nitpick : Maybe we can use mockedTaskInstance.schedule?.interval rather than a hardcoded string. it took me some time to figure out where do we get this number from :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in this commit: 1e5e9f9

Copy link
Contributor

@ersin-erdal ersin-erdal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ymao1
Copy link
Contributor Author

ymao1 commented Jan 26, 2022

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

@pmuellr
Copy link
Member

pmuellr commented Jan 26, 2022

I can't imagine there's any way to build a functional test for this, is there? I wonder if anyone else in Kibana has functional tests that require ES connectivity errors? I'm thinking even if we could, we could only reasonably test that intervals < 5m would run before the default 5m timeout; don't want to wait 5m in a functional test to see if intervals > 5m would run again in 5m :-)

@ymao1
Copy link
Contributor Author

ymao1 commented Jan 26, 2022

I can't imagine there's any way to build a functional test for this, is there? I wonder if anyone else in Kibana has functional tests that require ES connectivity errors? I'm thinking even if we could, we could only reasonably test that intervals < 5m would run before the default 5m timeout; don't want to wait 5m in a functional test to see if intervals > 5m would run again in 5m :-)

I double-checked the original core PR for adding the connectivity error type and there are no functional tests for it. I imagine it would be difficult to mimic connectivity errors in the tests.

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ymao1 ymao1 merged commit 0d951bc into elastic:main Jan 27, 2022
@ymao1 ymao1 deleted the alerting/smarter-retry-interval branch January 27, 2022 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Alerting] Smarter retry interval for ES Connectivity errors
6 participants