Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Synthetics] copy alert state to alert context and implement alert recovery #128693

Merged

Conversation

dominiqueclarke
Copy link
Contributor

@dominiqueclarke dominiqueclarke commented Mar 28, 2022

Summary

Resolves #128760
Resolves #128761

This PR takes the first step towards transitioning Uptime alerting towards preferring context over state. For more information about why this change is important, please visit the ticket.

For more information on the overall stages of the transition, please see #126280

Testing that State has been copied over to Context

  1. Connect to ES via oblt-cli. This ensures you will have at least one monitor that is failing and one with an aging or expired certificate.
  2. Navigate to Uptime overview and create a monitor status rule from the alert flyout, making sure to give the rule a name.
  3. Select an alert connector. The easiest connector to configure is a server log. Click save.
  4. Ensure the alert comes to your server log with the appropriate alert message
  5. Navigate to Uptime overview and create a tls rule from the alert flyout, making sure to give the rule a name.
  6. Select the server log connector. Click save
  7. Ensure the tls alert comes to your server log with the appropriate alert message

Note
You'll likely want to set up control rules to ensure that the alert messages match our original implementation and still work for legacy users using state instead of context.
To do so, create two more rules, one for monitor status and one for tls. Adjust the alert message to the following. Ensure that the control (state) matches the test (context).

Monitor Status

Monitor {{state.monitorName}} with url {{{state.monitorUrl}}} from {{state.observerLocation}} {{{state.statusMessage}}} The latest error message is {{{state.latestErrorMessage}}}

TLS

Detected TLS certificate {{state.commonName}} from issuer {{state.issuer}} is {{state.status}}. Certificate {{state.summary}}

Testing alert resolution

Monitor Status

  1. Start your own failing monitor via the synthetics service or running heartbeat from source. This ensures you're able to control your monitor's up and down state.
  2. Create a monitor status alert. To make things simple, I recommend creating an alert for when the monitor has failed more than once within the last 3 minutes, and removing the availability check.
  3. Before saving your alert, ensure that you add an action for the Recovery action group. First select a connector type (I like to use Server logs). Then add default action for Run When: Uptime Monitor Down. Click Add action again. This type change Run When to Recovered. Confirm that the default content for recovery matches the AC defined in [Uptime] Specify alert recovery context #128761.

Screen Shot 2022-04-27 at 4 28 08 PM

4. Wait for the monitor alert to trigger. 5. Force the monitor to resolve by either A.) stopping heartbeat from running or deleting the monitor or B.) Changing the monitor config to a value that would trigger an UP status. 6. Confirm that you receive the alert recovery message to your specified action connector

TLS

  1. Run your own monitors, or connect to ES via oblt-cli.
  2. Navigate to Uptime Settings
  3. Change age limit for certificates to 1 day

Screen Shot 2022-04-27 at 4 36 40 PM

4. Create a TLS status alert. Make sure you add an action for the when the alert is triggered as well as for when the alert is recovered. Confirm that the default content for recovery matches the AC defined in https://github.com//issues/128761. 5. Wait for the alert to trigger 6. Navigate back to Uptime Settings. Change age limit back to 730 days. 7. Confirm that you receive the alert recovery message to your specified action connector

@dominiqueclarke dominiqueclarke force-pushed the feature/uptime-alert-context branch from df965ff to a8269d2 Compare March 29, 2022 02:00
@dominiqueclarke dominiqueclarke marked this pull request as ready for review March 29, 2022 14:34
@dominiqueclarke dominiqueclarke requested a review from a team as a code owner March 29, 2022 14:34
@dominiqueclarke dominiqueclarke added the Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability label Mar 29, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/uptime (Team:uptime)

@dominiqueclarke dominiqueclarke added v8.2.0 enhancement New value added to drive a business result release_note:enhancement labels Mar 29, 2022
@dominiqueclarke dominiqueclarke changed the title copy alert state to alert context [Uptime] copy alert state to alert context Mar 29, 2022
@dominiqueclarke
Copy link
Contributor Author

@elasticmachine merge upstream

@dominiqueclarke
Copy link
Contributor Author

@elasticmachine merge upstream

@dominiqueclarke dominiqueclarke changed the title [Uptime] copy alert state to alert context [Synthetics] copy alert state to alert context Apr 25, 2022
@dominiqueclarke dominiqueclarke changed the title [Synthetics] copy alert state to alert context [Synthetics] copy alert state to alert context and implement alert recovery Apr 25, 2022
@dominiqueclarke
Copy link
Contributor Author

@elasticmachine merge upstream

],
state: [...commonMonitorStateI18, ...commonStateTranslations],
},
isExportable: true,
minimumLicenseRequired: 'basic',
doesSetRecoveryContext: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be set for other alert types?

Copy link
Contributor

@shahzad31 shahzad31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested e2e and everything looks fine !!

Great work on this !!

@dominiqueclarke
Copy link
Contributor Author

@elasticmachine merge upstream

@dominiqueclarke
Copy link
Contributor Author

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
synthetics 792.4KB 792.4KB +16.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
synthetics 22.0KB 23.2KB +1.3KB

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@dominiqueclarke dominiqueclarke merged commit 2b5de74 into elastic:main May 9, 2022
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label May 9, 2022
@dominiqueclarke dominiqueclarke deleted the feature/uptime-alert-context branch May 9, 2022 15:55
kertal pushed a commit to kertal/kibana that referenced this pull request May 24, 2022
…covery (elastic#128693)

* copy alert state to alert context

* adjust alert translations

* uptime - implement alert recovery

* adjust tests

* [CI] Auto-commit changed files from 'node scripts/eslint --no-cache --fix'

* remove unused constant

* update snapshot

* add default recovery messages

* update snapshot

* add doesSetRecoveryContext to uptime duration anomaly alert

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting enhancement New value added to drive a business result release_note:enhancement Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Uptime] Specify alert recovery context [Uptime] Alerting - Copy Uptime alert state to context
5 participants