Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Automatic Double Testing On Test Run Failure #792

Closed
drewpost opened this issue Jun 21, 2023 · 19 comments
Closed

Implement Automatic Double Testing On Test Run Failure #792

drewpost opened this issue Jun 21, 2023 · 19 comments
Assignees
Labels
Team:Uptime Label for the Uptime team

Comments

@drewpost
Copy link

drewpost commented Jun 21, 2023

Synthetic Monitors are considered the source of truth for SREs and Ops Engineers. When an error is created and an alert is sent, what the product is saying is that there is an active issue that needs dealing with. Synthetic monitoring removes the variability of browser, hardware and connection and should only be highlighting genuine issues with an application.

The reality of the public internet, however, means that sometimes tests will fail for totally transient issues. Our current implementation of creating an error on a single failed test means that users are experiencing a higher than acceptable level of false alerts. This is critical to resolve. In order to make our product more reliable, we want to implement automatic double-testing. This involves:

  • When a scheduled OR ad-hoc synthetic test run fails, an automatic second test is then triggered from the same location
  • If that second test passes, no error is created and the normal testing schedule continues
  • If that second test fails, an error is created and the normal testing schedule continues
  • The second test runs are treated like any other test run and show on charts and tables in the synthetics UI
  • As alerts are triggered off of errors, the existing alert generating behaviour will persist and the alert will be sent once the error is opened.
  • Default alert content should be updated to reflect the double testing verification of the failure
  • Double tests are chargeable like any other scheduled test run
  • This applies to both ping and browser monitors

More implementation details:

  • When a previously UP monitor is DOWN, a RETEST will automatically be run
    • For browser and lightweight checks
    • For the service and private locations
    • If the monitor was previously DOWN, a RETEST won't occur (this is to stop repeatedly retesting when we know the site is DOWN)
  • Heartbeat continues to send full data from all runs, but new schema additions to mark retries vs. failed initial attempt
    • i.e. a distinction between an initial run and a retest
  • Run now UI may need to be enhanced to show retests (needs UX/UI work)
    • If on the edit screen, a retest won't be run
    • If on the Overview page (run test), a retest will be triggered if the initial run is down
  • Add a retest badge / identifier to test history and last 10 runs for retested runs, for example:
    image
  • Ensure Kibana charts are using status, not latest runs
    • e.g. Availability calculation is based on states (not runs)
    • e.g. Overview cards based on state
    • For this implementation, this will still be based on the number of executions (not time based)
    • i.e. same as it is now, but for retests, it only considers the retest (not the initial test)
  • Error (and corresponding alert) is only triggered or created on retest type
  • Add toggle for retest behaviour, default to on
    • If off, then error will be triggered on first test and there is no retest
    • This also ensures older stacks don’t do the retest (important as Kibana won’t know about retest types)
    • Can set the toggle
      • In the UI
        • Where should this go, under Advanced settings?
      • For Project Monitors (global synthetics.config.ts and per monitor)
        • The default will be on even if no configuration has been set (e.g. retest: true)
          • For lightweight and browser based monitors
        • Consider if Synthetics should check the Kibana version, and only send retest: true if it's on a supported version of Kibana for this configuration, or if it will just be ignored by older Kibana versions (to ensure behaviour doesn't change for older versions)
  • Leave HB timeout at ~15 minutes, but double the platform timeout
  • Document limit of 15 mins per test in the service
  • Ideally release the change in the service before billing begins
  • There is some mechanism in the schema for linking the initial run and the retest (i.e. knowing that a retest was triggered by which test)
    • For potential UI updates in the future

Ensure there is a follow up release note and/or blog that explains the new behaviour, how to disable it, and how retest, errors and alerts relate to each other

Addendum

For configs with no double testing config, that is to say ones where none is configured in the UI / Project monitors we automatically perform a double test on transitions but do not index the initial failed runs. This would amount to a seamless update for users and let us bring double testing forward backend first, without needing UI support initially.

@drewpost drewpost added the Team:Uptime Label for the Uptime team label Jun 21, 2023
@andrewvc
Copy link
Contributor

Have you considered triple testing? The reason I ask is that if 1 fails and 1 succeeds it could be that the site is unstable. Adding a third test, and requiring 2/3 to pass seems more robust to me.

Another question, do we count the failed tests toward any metrics if the check as a whole succeeds? I will say that doing that is a lot more work, vs. just throwing out the failing test if a subsequent run succeeds.

I would say, at a minimum we should add a new field, something like summary.retest: {up: 1, down: 1} to track what's going on here, but I'm not sure we should index the failing test given the pervasive updates to the Synthetics UI that would require.

That being said, if we do discard that previous test it will require a re-engineering of how heartbeat talks to ES. Right now it streams data as it goes, if we decide to discard the initial failing run we'll need to buffer full runs and only once we know there's no retest in store index them. Once we get the product reqs straight we should have a zoom to discuss the details here as there are a number of tradeoffs to balance here.

@paulb-elastic
Copy link
Contributor

I'm not sure triple testing would be so useful. The frequency should be set to an appropriate level that the user is notified in a reasonable amount of time based on their sensitivity to an outage, but the double test at least protects against transient issues.

I'm not sure we should throw away the result either - it's a useful data point, but I understand the idea is that the error itself won't be created until a second (retest) fails, and the alert will be triggered by the error.

I guess by metrics, you are referring to availability. Ideally I'm expecting this should be based on the error trigger, but appreciate there may be complexity here. Let's discuss this further.

@paulb-elastic
Copy link
Contributor

paulb-elastic commented Jun 26, 2023

Some points from a discussion earlier between @andrewvc @drewpost and @paulb-elastic:

  • Heartbeat continues to send full data from all runs, but new schema additions to mark retries vs. failed initial attempt
    • i.e. a distinction between an initial run and a retest
  • UI queries audited and upgraded to ignore failed initial attempt(s)
    • Won't be necessary - test results / history etc. will show all test runs (initial run and retests)
  • Run now UI may need to be enhanced to show retests (needs UX/UI work)
    • If on the edit screen, a retest won't be run
    • If on the Overview page (run test), a retest will be triggered if the initial run is down
  • Add a retest badge / identifier to test history and last 10 runs for retested runs, for example:
    image
  • Ensure availability calculation is based on states (not runs)
    • For this implementation, this will still be based on the number of executions (not time based)
    • i.e. same as it is now, but for retests, it only considers the retest (not the initial test)
  • Errors only triggered or created on retest type
  • Add toggle for retest behaviour, default to on
    • If off, then error will be triggered on first test
    • This also ensures older stacks don’t do the retest (important as Kibana won’t know about retest types)
    • In the UI
    • For Project Monitors (global config.ts and per monitor)
  • Leave HB timeout at ~15 minutes, but double the platform timeout
  • Document limit of 15 mins per test in the service
  • Ideally release the change in the service before billing begins

Added after

  • Mechanism for linking initial run and a retest (i.e. knowing that a retest was triggered by which test)
    • For potential UI updates in the future
  • Question for @drewpost
    • Should we not retest if there is already an error open for the monitor (i.e. for a permanently down monitor we don't continually double test)

@drewpost
Copy link
Author

Question for @drewpost
Should we not retest if there is already an error open for the monitor (i.e. for a permanently down monitor we don't continually double test)

Agreed. It should only be the initial failed test that triggers the re-test

@vigneshshanmugam
Copy link
Member

Leave HB timeout at ~15 minutes, but double the platform timeout

I am not able to understand why would we need a increased timeout on the platform level? Asking because if the idea of a retest is to identify flakiness and making the test more robust, increased timeout is not going to solve the problem.

Plus we should be mindful of the retesting as a whole, this would back-fire in specific scenarios where the webapp is experience a increased surge of traffic and our synthetic monitoring retest feature is making it more worse.

@andrewvc
Copy link
Contributor

@vigneshshanmugam the reason is that we'd potentially run multiple tests back to back in the same container. Heartbeat would still enforce the timeout normally, but if you had say a 10 minute test you couldn't do a retest in 15m potentially.

@paulb-elastic
Copy link
Contributor

Added the details to the main description

@paulb-elastic
Copy link
Contributor

@drewpost, as well as adding a (retest) description to the status label, @andrewvc also suggested maybe showing the intention of a retest, which makes it clear when one will occur, and when one won't.

For example, when there is a single failure:
image

To when something is in permanent error:
image

I like this idea, maybe we should consider this as a stretch when implementing?

@drewpost
Copy link
Author

drewpost commented Jun 30, 2023

I like the intent of this suggestion, as well.

What I'm struggling with a bit are the words used and placement.

  • "retest" is an adjective that describes the type of test run, not the state. It feel strange attaching it to state
  • "will retest" is unclear from a point in time perspective. As a user I'm left wondering if this means it's going to retest again, if I need to do something to make it re-test.

At the risk of going too much into solution mode, I wonder if this should be attached to the date/time string instead of the status.
cc @andrewvc

@paulb-elastic
Copy link
Contributor

Yeah, that's a fair point Drew, and tbh we were looking at this from the perspective of not introducing (at this stage) a new column describing the test type.

You're right though, maybe it's too disjointed from the status (which is the most important info there). Putting it alongside the date/time looks something like this (with a couple of different widths to see impact of wrapping):

image

image

image

@andrewvc
Copy link
Contributor

andrewvc commented Jul 5, 2023

@paulb-elastic @drewpost the intention of the "will retest" label is to let you know that the failure will be ignored. Maybe it should be Result ignored due to retest policy, will retest. Thoughts?

@paulb-elastic
Copy link
Contributor

I'm personally ok with (will retest) especially moving away from the status badge, but keen to hear @drewpost's thoughts

@henrikno
Copy link

henrikno commented Jul 7, 2023

I think this could be solved by only alerting if > n locations report failed checks.
It'll also handle the case where e.g. the provider is having network issues in e.g. asia-northeast1, which if rerun will also be down and generate an alert. But if US and Europe are saying site is up, we probably don't want to alert.

@andrewvc
Copy link
Contributor

Thanks @henrikno , that's actually how checks in the Uptime app work (you can even still set them today). I think we do have further refinement here. Pinging @drewpost as well on here.^^

@andrewvc
Copy link
Contributor

After further discussion we'd like to add more more AC to this, which is that for configs with no double testing config, that is to say ones where none is configured in the UI / Project monitors we automatically perform a double test on transitions but do not index the initial failed runs. This would amount to a seamless update for users and let us bring double testing forward backend first, without needing UI support initially.

@andrewvc andrewvc self-assigned this Jul 20, 2023
andrewvc added a commit to elastic/beats that referenced this issue Aug 31, 2023
Adds retries to Heartbeat monitors. Part of elastic/synthetics#792

This refactors a ton of code around summarizing events, and cleans up a lot of tech debt as well.
@paulb-elastic
Copy link
Contributor

paulb-elastic commented Sep 1, 2023

Linking to elastic/beats#36147 for the Heartbeat change
UI work is being addressed via elastic/kibana#162399

@paulb-elastic
Copy link
Contributor

@andrewvc my understanding is that the behaviour you added in the Addendum, in the main description, is not now being done, is that right (cc @drewpost)?

@shahzad31
Copy link
Contributor

shahzad31 commented Sep 4, 2023

We are going to add a new column to indicate if it's a retest.
In monitor config there will be a switch with lable

  • Enable retest on failure

@paulb-elastic
Copy link
Contributor

This has been released and tested in production, ready for the upcoming 8.11 release

Scholar-Li pushed a commit to Scholar-Li/beats that referenced this issue Feb 5, 2024
Adds retries to Heartbeat monitors. Part of elastic/synthetics#792

This refactors a ton of code around summarizing events, and cleans up a lot of tech debt as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Uptime Label for the Uptime team
Projects
None yet
Development

No branches or pull requests

6 participants