-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry unit tests upon failure #9236
Retry unit tests upon failure #9236
Conversation
Signed-off-by: Juraci Paixão Kröhling <[email protected]>
I am not sure this is a good approach. Instead I would prefer to disable flaky tests and ask owners to fix them. |
It's rare to find a PR that isn't affected by a flaky test nowadays. This PR would be just to alleviate this pain a bit. We can remove it once we feel confident with our unit tests again.
I actually started with that, based on the most common flaky tests, but I soon realized that it's not easy to determine when to disable a test (3 "it happened again" comments? older than a week?), and it's even harder to determine/remember to enable it back. We have been asking code owners to fix them, but some aren't that easy. A few tests are failing due to circumstances outside of our control, like the cache reflector the Kubernetes cluster receiver, or the 8k limit for goroutines which will be removed in newer Go versions. The WAL tests in the Prometheus (remote write?) exporter also seemed tricky to me. |
Perhaps as a very short term duration, temporary measure it is ok, but I think we need to fix the underlying tests. As for "when to disable" in a previous project I automated escalation of priority and any repeatedly failing one would quickly become a P1 that needs to either be fixed immediately or deleted if there is no known way to fix it. Note: a retry of an individual test may be a valid way to fix it if the resource it depends on is known to be unstable. However, retrying the entire test suite is a poor practice and is going to mask tests that need fixing and will increase the build time. |
+1, no question about it
Agree about that as well, but restarting the whole CI is basically what we are doing manually today, so, build time might increase, but not as much as the manual re-runs. |
I agree with Tigran. I'd rather disable flaky tests than retry them |
OK, I don't mind merging this if other approvers are on board and we plan to fix flaky tests quickly and revert this change. |
This doesn't seem like a good approach to me imo. I understand the intent though, I've had my PR retried like 10 times by Juraci once. I still think we should aggressively disable flaky tests rather than run them multiple times until they succeed. I've started looking into the Prometheus flaky tests but as Juraci pointed out, they're not trivial to fix (#9124 and #9119). Good to get some more eyes on them :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotestsum
can re-run only failed tests and it should be quite easy to use it on this repository. I would suggest giving this a try instead of re-running every single test
I am against using tools that mask randomly failing tests. A randomly failing test can be an indication of a real bug. Using tools like this is going to hide the bug. I suggest that we adopt a policy like this:
Again, as a temporary measure I am OK with retrying the tests until we identify all flaky tests and fix/delete them. However, I do not think we should invest time in tooling around retrying and I think we should remove the retrying as soon as the codebase is returned to a stable state. Instead, I would prefer to invest time into tooling to surface flaky tests, automatically file a github issue, auto-assign to test owner, automatically escalate priority of the issue on repeated failures, etc. cc @open-telemetry/collector-contrib-approvers |
I think @tigrannajaryan's plan makes a lot of sense and it's something that can be done whether or not we merge this PR. I want to make a case for retrying flaky tests, even if we adopt the approach that Tigran proposes, since I think it still makes sense to do something like this. Flaky tests cause pain to all contributors on this repository, since one has to re-run tests every time a flaky test is encountered for a PR to be considered mergeable, and tests take a long time. I don't think there are enough people today working on this repository so that the time from "flaky tests appears" to "flaky test is fixed or removed" is short enough so as not to disrupt everyone else on this repository: some flakes are hard to fix, other times finding the right person to work on a flake takes a long time. Retrying failing unit tests is a cheap solution to reduce the number of flakes and allows us to focus on the worst offenders, instead of causing pain to every contributor. We can potentially still detect flakes that were averted by a tool like gotestsum and start working on them before they become more troublesome. |
Perhaps we can implement individual test retrying and use it permanently, provided that we have a few safeguards in place:
|
This is what bothers me the most about this PR here: we'd miss the information about which tests failed and were retried. If we had a way to enforce flaky tests to be fixed before other work is done, I don't think we'd need this PR here or gotestsum, but experience has shown that we don't typically give that much attention to fixing flaky tests (including myself). Just disabling the test without a concrete plan to enable it back isn't a solution either. I think this is where @tigrannajaryan's policy would work.
Who's going to be the one checking that? |
Ideally, automated! :-) |
This will be shown on
I think this is also doable with the |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
I'm closing this, as I think the PR served its purpose. Right now, the code base is more stable when this was opened and we have some great options listed here in case we need again. |
This change attempts to reduce the build failures due to flaky tests by re-running the tests up to 5 times upon failure. This is not a real solution and it will increase build times for failures, but it automates in a smarter way what we are doing manually already anyway.
When the tests fail, this is the expected output:
When they work, this is the expected output (using the extension/jaegerremotesampling as example):
Signed-off-by: Juraci Paixão Kröhling [email protected]