Implement Automatic Double Testing On Test Run Failure #792

drewpost · 2023-06-21T11:18:02Z

Synthetic Monitors are considered the source of truth for SREs and Ops Engineers. When an error is created and an alert is sent, what the product is saying is that there is an active issue that needs dealing with. Synthetic monitoring removes the variability of browser, hardware and connection and should only be highlighting genuine issues with an application.

The reality of the public internet, however, means that sometimes tests will fail for totally transient issues. Our current implementation of creating an error on a single failed test means that users are experiencing a higher than acceptable level of false alerts. This is critical to resolve. In order to make our product more reliable, we want to implement automatic double-testing. This involves:

When a scheduled OR ad-hoc synthetic test run fails, an automatic second test is then triggered from the same location
If that second test passes, no error is created and the normal testing schedule continues
If that second test fails, an error is created and the normal testing schedule continues
The second test runs are treated like any other test run and show on charts and tables in the synthetics UI
As alerts are triggered off of errors, the existing alert generating behaviour will persist and the alert will be sent once the error is opened.
Default alert content should be updated to reflect the double testing verification of the failure
Double tests are chargeable like any other scheduled test run
This applies to both ping and browser monitors

More implementation details:

When a previously UP monitor is DOWN, a RETEST will automatically be run
- For browser and lightweight checks
- For the service and private locations
- If the monitor was previously DOWN, a RETEST won't occur (this is to stop repeatedly retesting when we know the site is DOWN)
Heartbeat continues to send full data from all runs, but new schema additions to mark retries vs. failed initial attempt
- i.e. a distinction between an initial run and a retest
Run now UI may need to be enhanced to show retests (needs UX/UI work)
- If on the edit screen, a retest won't be run
- If on the Overview page (run test), a retest will be triggered if the initial run is down
Add a retest badge / identifier to test history and last 10 runs for retested runs, for example:
Ensure Kibana charts are using status, not latest runs
- e.g. Availability calculation is based on states (not runs)
- e.g. Overview cards based on state
- For this implementation, this will still be based on the number of executions (not time based)
- i.e. same as it is now, but for retests, it only considers the retest (not the initial test)
Error (and corresponding alert) is only triggered or created on retest type
Add toggle for retest behaviour, default to on
- If off, then error will be triggered on first test and there is no retest
- This also ensures older stacks don’t do the retest (important as Kibana won’t know about retest types)
- Can set the toggle
  - In the UI
    - Where should this go, under Advanced settings?
  - For Project Monitors (global synthetics.config.ts and per monitor)
    - The default will be on even if no configuration has been set (e.g. retest: true)
      - For lightweight and browser based monitors
    - Consider if Synthetics should check the Kibana version, and only send retest: true if it's on a supported version of Kibana for this configuration, or if it will just be ignored by older Kibana versions (to ensure behaviour doesn't change for older versions)
Leave HB timeout at ~15 minutes, but double the platform timeout
Document limit of 15 mins per test in the service
Ideally release the change in the service before billing begins
There is some mechanism in the schema for linking the initial run and the retest (i.e. knowing that a retest was triggered by which test)
- For potential UI updates in the future

Ensure there is a follow up release note and/or blog that explains the new behaviour, how to disable it, and how retest, errors and alerts relate to each other

Addendum

For configs with no double testing config, that is to say ones where none is configured in the UI / Project monitors we automatically perform a double test on transitions but do not index the initial failed runs. This would amount to a seamless update for users and let us bring double testing forward backend first, without needing UI support initially.

The text was updated successfully, but these errors were encountered:

andrewvc · 2023-06-21T14:25:13Z

Have you considered triple testing? The reason I ask is that if 1 fails and 1 succeeds it could be that the site is unstable. Adding a third test, and requiring 2/3 to pass seems more robust to me.

Another question, do we count the failed tests toward any metrics if the check as a whole succeeds? I will say that doing that is a lot more work, vs. just throwing out the failing test if a subsequent run succeeds.

I would say, at a minimum we should add a new field, something like summary.retest: {up: 1, down: 1} to track what's going on here, but I'm not sure we should index the failing test given the pervasive updates to the Synthetics UI that would require.

That being said, if we do discard that previous test it will require a re-engineering of how heartbeat talks to ES. Right now it streams data as it goes, if we decide to discard the initial failing run we'll need to buffer full runs and only once we know there's no retest in store index them. Once we get the product reqs straight we should have a zoom to discuss the details here as there are a number of tradeoffs to balance here.

paulb-elastic · 2023-06-22T11:14:41Z

I'm not sure triple testing would be so useful. The frequency should be set to an appropriate level that the user is notified in a reasonable amount of time based on their sensitivity to an outage, but the double test at least protects against transient issues.

I'm not sure we should throw away the result either - it's a useful data point, but I understand the idea is that the error itself won't be created until a second (retest) fails, and the alert will be triggered by the error.

I guess by metrics, you are referring to availability. Ideally I'm expecting this should be based on the error trigger, but appreciate there may be complexity here. Let's discuss this further.

paulb-elastic · 2023-06-26T16:46:37Z

Some points from a discussion earlier between @andrewvc @drewpost and @paulb-elastic:

Heartbeat continues to send full data from all runs, but new schema additions to mark retries vs. failed initial attempt
- i.e. a distinction between an initial run and a retest
~~UI queries audited and upgraded to ignore failed initial attempt(s)~~
- Won't be necessary - test results / history etc. will show all test runs (initial run and retests)
Run now UI may need to be enhanced to show retests (needs UX/UI work)
- If on the edit screen, a retest won't be run
- If on the Overview page (run test), a retest will be triggered if the initial run is down
Add a retest badge / identifier to test history and last 10 runs for retested runs, for example:
Ensure availability calculation is based on states (not runs)
- For this implementation, this will still be based on the number of executions (not time based)
- i.e. same as it is now, but for retests, it only considers the retest (not the initial test)
Errors only triggered or created on retest type
Add toggle for retest behaviour, default to on
- If off, then error will be triggered on first test
- This also ensures older stacks don’t do the retest (important as Kibana won’t know about retest types)
- In the UI
- For Project Monitors (global config.ts and per monitor)
Leave HB timeout at ~15 minutes, but double the platform timeout
Document limit of 15 mins per test in the service
Ideally release the change in the service before billing begins

Added after

Mechanism for linking initial run and a retest (i.e. knowing that a retest was triggered by which test)
- For potential UI updates in the future
Question for @drewpost
- Should we not retest if there is already an error open for the monitor (i.e. for a permanently down monitor we don't continually double test)

drewpost · 2023-06-27T20:44:17Z

Question for @drewpost
Should we not retest if there is already an error open for the monitor (i.e. for a permanently down monitor we don't continually double test)

Agreed. It should only be the initial failed test that triggers the re-test

vigneshshanmugam · 2023-06-27T22:46:26Z

Leave HB timeout at ~15 minutes, but double the platform timeout

I am not able to understand why would we need a increased timeout on the platform level? Asking because if the idea of a retest is to identify flakiness and making the test more robust, increased timeout is not going to solve the problem.

Plus we should be mindful of the retesting as a whole, this would back-fire in specific scenarios where the webapp is experience a increased surge of traffic and our synthetic monitoring retest feature is making it more worse.

andrewvc · 2023-06-28T03:23:33Z

@vigneshshanmugam the reason is that we'd potentially run multiple tests back to back in the same container. Heartbeat would still enforce the timeout normally, but if you had say a 10 minute test you couldn't do a retest in 15m potentially.

paulb-elastic · 2023-06-28T08:57:07Z

Added the details to the main description

paulb-elastic · 2023-06-30T09:02:10Z

@drewpost, as well as adding a (retest) description to the status label, @andrewvc also suggested maybe showing the intention of a retest, which makes it clear when one will occur, and when one won't.

For example, when there is a single failure:

To when something is in permanent error:

I like this idea, maybe we should consider this as a stretch when implementing?

drewpost · 2023-06-30T09:38:56Z

I like the intent of this suggestion, as well.

What I'm struggling with a bit are the words used and placement.

"retest" is an adjective that describes the type of test run, not the state. It feel strange attaching it to state
"will retest" is unclear from a point in time perspective. As a user I'm left wondering if this means it's going to retest again, if I need to do something to make it re-test.

At the risk of going too much into solution mode, I wonder if this should be attached to the date/time string instead of the status.
cc @andrewvc

paulb-elastic · 2023-07-03T16:58:04Z

Yeah, that's a fair point Drew, and tbh we were looking at this from the perspective of not introducing (at this stage) a new column describing the test type.

You're right though, maybe it's too disjointed from the status (which is the most important info there). Putting it alongside the date/time looks something like this (with a couple of different widths to see impact of wrapping):

andrewvc · 2023-07-05T21:33:32Z

@paulb-elastic @drewpost the intention of the "will retest" label is to let you know that the failure will be ignored. Maybe it should be Result ignored due to retest policy, will retest. Thoughts?

paulb-elastic · 2023-07-06T16:43:08Z

I'm personally ok with (will retest) especially moving away from the status badge, but keen to hear @drewpost's thoughts

henrikno · 2023-07-07T20:51:32Z

I think this could be solved by only alerting if > n locations report failed checks.
It'll also handle the case where e.g. the provider is having network issues in e.g. asia-northeast1, which if rerun will also be down and generate an alert. But if US and Europe are saying site is up, we probably don't want to alert.

andrewvc · 2023-07-11T03:07:49Z

Thanks @henrikno , that's actually how checks in the Uptime app work (you can even still set them today). I think we do have further refinement here. Pinging @drewpost as well on here.^^

andrewvc · 2023-07-18T20:56:15Z

After further discussion we'd like to add more more AC to this, which is that for configs with no double testing config, that is to say ones where none is configured in the UI / Project monitors we automatically perform a double test on transitions but do not index the initial failed runs. This would amount to a seamless update for users and let us bring double testing forward backend first, without needing UI support initially.

Adds retries to Heartbeat monitors. Part of elastic/synthetics#792 This refactors a ton of code around summarizing events, and cleans up a lot of tech debt as well.

paulb-elastic · 2023-09-01T09:27:57Z

Linking to elastic/beats#36147 for the Heartbeat change
UI work is being addressed via elastic/kibana#162399

paulb-elastic · 2023-09-04T09:07:19Z

@andrewvc my understanding is that the behaviour you added in the Addendum, in the main description, is not now being done, is that right (cc @drewpost)?

shahzad31 · 2023-09-04T12:28:18Z

We are going to add a new column to indicate if it's a retest.
In monitor config there will be a switch with lable

Enable retest on failure

paulb-elastic · 2023-11-07T11:24:33Z

This has been released and tested in production, ready for the upcoming 8.11 release

Adds retries to Heartbeat monitors. Part of elastic/synthetics#792 This refactors a ton of code around summarizing events, and cleans up a lot of tech debt as well.

drewpost added the Team:Uptime Label for the Uptime team label Jun 21, 2023

andrewvc mentioned this issue Jun 29, 2023

Add retries in the synthetics #416

Closed

andrewvc mentioned this issue Jul 20, 2023

[Heartbeat] Double testing on failure / state transitions elastic/beats#36120

Closed

6 tasks

andrewvc self-assigned this Jul 20, 2023

paulb-elastic mentioned this issue Jul 24, 2023

[Synthetics] Implement the Kibana requirements for double testing elastic/kibana#162399

Closed

andrewvc added a commit to elastic/beats that referenced this issue Aug 31, 2023

[Heartbeat] Monitor Retries (#36147)

a6bae85

Adds retries to Heartbeat monitors. Part of elastic/synthetics#792 This refactors a ton of code around summarizing events, and cleans up a lot of tech debt as well.

paulb-elastic mentioned this issue Oct 5, 2023

[Synthetics] Document retest feature elastic/observability-docs#3256

Closed

paulb-elastic closed this as completed Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Automatic Double Testing On Test Run Failure #792

Implement Automatic Double Testing On Test Run Failure #792

drewpost commented Jun 21, 2023 •

edited by paulb-elastic

Loading

andrewvc commented Jun 21, 2023

paulb-elastic commented Jun 22, 2023

paulb-elastic commented Jun 26, 2023 •

edited

Loading

drewpost commented Jun 27, 2023

vigneshshanmugam commented Jun 27, 2023

andrewvc commented Jun 28, 2023

paulb-elastic commented Jun 28, 2023

paulb-elastic commented Jun 30, 2023

drewpost commented Jun 30, 2023 •

edited

Loading

paulb-elastic commented Jul 3, 2023

andrewvc commented Jul 5, 2023

paulb-elastic commented Jul 6, 2023

henrikno commented Jul 7, 2023

andrewvc commented Jul 11, 2023

andrewvc commented Jul 18, 2023

paulb-elastic commented Sep 1, 2023 •

edited

Loading

paulb-elastic commented Sep 4, 2023

shahzad31 commented Sep 4, 2023 •

edited by reakaleek

Loading

paulb-elastic commented Nov 7, 2023

Implement Automatic Double Testing On Test Run Failure #792

Implement Automatic Double Testing On Test Run Failure #792

Comments

drewpost commented Jun 21, 2023 • edited by paulb-elastic Loading

Addendum

andrewvc commented Jun 21, 2023

paulb-elastic commented Jun 22, 2023

paulb-elastic commented Jun 26, 2023 • edited Loading

drewpost commented Jun 27, 2023

vigneshshanmugam commented Jun 27, 2023

andrewvc commented Jun 28, 2023

paulb-elastic commented Jun 28, 2023

paulb-elastic commented Jun 30, 2023

drewpost commented Jun 30, 2023 • edited Loading

paulb-elastic commented Jul 3, 2023

andrewvc commented Jul 5, 2023

paulb-elastic commented Jul 6, 2023

henrikno commented Jul 7, 2023

andrewvc commented Jul 11, 2023

andrewvc commented Jul 18, 2023

paulb-elastic commented Sep 1, 2023 • edited Loading

paulb-elastic commented Sep 4, 2023

shahzad31 commented Sep 4, 2023 • edited by reakaleek Loading

paulb-elastic commented Nov 7, 2023

drewpost commented Jun 21, 2023 •

edited by paulb-elastic

Loading

paulb-elastic commented Jun 26, 2023 •

edited

Loading

drewpost commented Jun 30, 2023 •

edited

Loading

paulb-elastic commented Sep 1, 2023 •

edited

Loading

shahzad31 commented Sep 4, 2023 •

edited by reakaleek

Loading