-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Automatic Double Testing On Test Run Failure #792
Comments
Have you considered triple testing? The reason I ask is that if 1 fails and 1 succeeds it could be that the site is unstable. Adding a third test, and requiring 2/3 to pass seems more robust to me. Another question, do we count the failed tests toward any metrics if the check as a whole succeeds? I will say that doing that is a lot more work, vs. just throwing out the failing test if a subsequent run succeeds. I would say, at a minimum we should add a new field, something like That being said, if we do discard that previous test it will require a re-engineering of how heartbeat talks to ES. Right now it streams data as it goes, if we decide to discard the initial failing run we'll need to buffer full runs and only once we know there's no retest in store index them. Once we get the product reqs straight we should have a zoom to discuss the details here as there are a number of tradeoffs to balance here. |
I'm not sure triple testing would be so useful. The frequency should be set to an appropriate level that the user is notified in a reasonable amount of time based on their sensitivity to an outage, but the double test at least protects against transient issues. I'm not sure we should throw away the result either - it's a useful data point, but I understand the idea is that the error itself won't be created until a second (retest) fails, and the alert will be triggered by the error. I guess by metrics, you are referring to availability. Ideally I'm expecting this should be based on the error trigger, but appreciate there may be complexity here. Let's discuss this further. |
Some points from a discussion earlier between @andrewvc @drewpost and @paulb-elastic:
Added after
|
Agreed. It should only be the initial failed test that triggers the re-test |
I am not able to understand why would we need a increased timeout on the platform level? Asking because if the idea of a retest is to identify flakiness and making the test more robust, increased timeout is not going to solve the problem. Plus we should be mindful of the retesting as a whole, this would back-fire in specific scenarios where the webapp is experience a increased surge of traffic and our synthetic monitoring retest feature is making it more worse. |
@vigneshshanmugam the reason is that we'd potentially run multiple tests back to back in the same container. Heartbeat would still enforce the timeout normally, but if you had say a 10 minute test you couldn't do a retest in 15m potentially. |
Added the details to the main description |
@drewpost, as well as adding a For example, when there is a single failure: To when something is in permanent error: I like this idea, maybe we should consider this as a stretch when implementing? |
I like the intent of this suggestion, as well. What I'm struggling with a bit are the words used and placement.
At the risk of going too much into solution mode, I wonder if this should be attached to the date/time string instead of the status. |
Yeah, that's a fair point Drew, and tbh we were looking at this from the perspective of not introducing (at this stage) a new column describing the test type. You're right though, maybe it's too disjointed from the status (which is the most important info there). Putting it alongside the date/time looks something like this (with a couple of different widths to see impact of wrapping): |
@paulb-elastic @drewpost the intention of the "will retest" label is to let you know that the failure will be ignored. Maybe it should be |
I'm personally ok with (will retest) especially moving away from the status badge, but keen to hear @drewpost's thoughts |
I think this could be solved by only alerting if > n locations report failed checks. |
After further discussion we'd like to add more more AC to this, which is that for configs with no double testing config, that is to say ones where none is configured in the UI / Project monitors we automatically perform a double test on transitions but do not index the initial failed runs. This would amount to a seamless update for users and let us bring double testing forward backend first, without needing UI support initially. |
Adds retries to Heartbeat monitors. Part of elastic/synthetics#792 This refactors a ton of code around summarizing events, and cleans up a lot of tech debt as well.
Linking to elastic/beats#36147 for the Heartbeat change |
We are going to add a new column to indicate if it's a retest.
|
This has been released and tested in production, ready for the upcoming 8.11 release |
Adds retries to Heartbeat monitors. Part of elastic/synthetics#792 This refactors a ton of code around summarizing events, and cleans up a lot of tech debt as well.
Synthetic Monitors are considered the source of truth for SREs and Ops Engineers. When an error is created and an alert is sent, what the product is saying is that there is an active issue that needs dealing with. Synthetic monitoring removes the variability of browser, hardware and connection and should only be highlighting genuine issues with an application.
The reality of the public internet, however, means that sometimes tests will fail for totally transient issues. Our current implementation of creating an error on a single failed test means that users are experiencing a higher than acceptable level of false alerts. This is critical to resolve. In order to make our product more reliable, we want to implement automatic double-testing. This involves:
More implementation details:
synthetics.config.ts
and per monitor)retest: true
)retest: true
if it's on a supported version of Kibana for this configuration, or if it will just be ignored by older Kibana versions (to ensure behaviour doesn't change for older versions)Ensure there is a follow up release note and/or blog that explains the new behaviour, how to disable it, and how retest, errors and alerts relate to each other
Addendum
For configs with no double testing config, that is to say ones where none is configured in the UI / Project monitors we automatically perform a double test on transitions but do not index the initial failed runs. This would amount to a seamless update for users and let us bring double testing forward backend first, without needing UI support initially.
The text was updated successfully, but these errors were encountered: