Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry unit tests upon failure #9236

Conversation

jpkrohling
Copy link
Member

@jpkrohling jpkrohling commented Apr 12, 2022

This change attempts to reduce the build failures due to flaky tests by re-running the tests up to 5 times upon failure. This is not a real solution and it will increase build times for failures, but it automates in a smarter way what we are doing manually already anyway.

When the tests fail, this is the expected output:

$ make test
for i in {1..5}; do \
	go test -race -v -timeout 300s --tags="" ./... && break || echo "Tests failed (attempt #$i)"; \
done
go: updates to go.mod needed; to update it:
	go mod tidy
Tests failed (attempt #1)
go: updates to go.mod needed; to update it:
	go mod tidy
Tests failed (attempt #2)
go: updates to go.mod needed; to update it:
	go mod tidy
Tests failed (attempt #3)
go: updates to go.mod needed; to update it:
	go mod tidy
Tests failed (attempt #4)
go: updates to go.mod needed; to update it:
	go mod tidy
Tests failed (attempt #5)

When they work, this is the expected output (using the extension/jaegerremotesampling as example):

$ make test
for i in {1..5}; do \
	go test -race -v -timeout 300s --tags="" ./... && break || echo "Tests failed (attempt #$i)"; \
done
=== RUN   TestLoadConfig
--- PASS: TestLoadConfig (0.00s)
=== RUN   TestValidate
=== RUN   TestValidate/no_receiving_protocols
=== RUN   TestValidate/no_sources
=== RUN   TestValidate/too_many_sources
--- PASS: TestValidate (0.00s)
    --- PASS: TestValidate/no_receiving_protocols (0.00s)
    --- PASS: TestValidate/no_sources (0.00s)
    --- PASS: TestValidate/too_many_sources (0.00s)
=== RUN   TestNewExtension
--- PASS: TestNewExtension (0.00s)
=== RUN   TestStartAndShutdownLocalFile
--- PASS: TestStartAndShutdownLocalFile (0.00s)
=== RUN   TestStartAndShutdownRemote
--- PASS: TestStartAndShutdownRemote (0.00s)
=== RUN   TestCreateDefaultConfig
--- PASS: TestCreateDefaultConfig (0.00s)
=== RUN   TestCreateExtension
--- PASS: TestCreateExtension (0.00s)
=== RUN   TestNewFactory
--- PASS: TestNewFactory (0.00s)
PASS
ok  	github.com/open-telemetry/opentelemetry-collector-contrib/extension/jaegerremotesampling	0.042s
=== RUN   TestMissingClientConfigManager
--- PASS: TestMissingClientConfigManager (0.00s)
=== RUN   TestStartAndStop
--- PASS: TestStartAndStop (0.00s)
=== RUN   TestEndpointsAreWired
=== RUN   TestEndpointsAreWired/legacy
=== RUN   TestEndpointsAreWired/new
--- PASS: TestEndpointsAreWired (0.01s)
    --- PASS: TestEndpointsAreWired/legacy (0.00s)
    --- PASS: TestEndpointsAreWired/new (0.00s)
=== RUN   TestServiceNameIsRequired
--- PASS: TestServiceNameIsRequired (0.00s)
=== RUN   TestErrorFromClientConfigManager
--- PASS: TestErrorFromClientConfigManager (0.00s)
PASS
ok  	github.com/open-telemetry/opentelemetry-collector-contrib/extension/jaegerremotesampling/internal	0.041s
?   	github.com/open-telemetry/opentelemetry-collector-contrib/extension/jaegerremotesampling/local	[no test files]

Signed-off-by: Juraci Paixão Kröhling [email protected]

Signed-off-by: Juraci Paixão Kröhling <[email protected]>
@jpkrohling jpkrohling requested review from a team and djaglowski April 12, 2022 17:46
@jpkrohling jpkrohling added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Apr 12, 2022
@tigrannajaryan
Copy link
Member

I am not sure this is a good approach. Instead I would prefer to disable flaky tests and ask owners to fix them.

@jpkrohling
Copy link
Member Author

I am not sure this is a good approach.

It's rare to find a PR that isn't affected by a flaky test nowadays. This PR would be just to alleviate this pain a bit. We can remove it once we feel confident with our unit tests again.

Instead I would prefer to disable flaky tests and ask owners to fix them.

I actually started with that, based on the most common flaky tests, but I soon realized that it's not easy to determine when to disable a test (3 "it happened again" comments? older than a week?), and it's even harder to determine/remember to enable it back.

We have been asking code owners to fix them, but some aren't that easy. A few tests are failing due to circumstances outside of our control, like the cache reflector the Kubernetes cluster receiver, or the 8k limit for goroutines which will be removed in newer Go versions. The WAL tests in the Prometheus (remote write?) exporter also seemed tricky to me.

@tigrannajaryan
Copy link
Member

Perhaps as a very short term duration, temporary measure it is ok, but I think we need to fix the underlying tests.

As for "when to disable" in a previous project I automated escalation of priority and any repeatedly failing one would quickly become a P1 that needs to either be fixed immediately or deleted if there is no known way to fix it.

Note: a retry of an individual test may be a valid way to fix it if the resource it depends on is known to be unstable. However, retrying the entire test suite is a poor practice and is going to mask tests that need fixing and will increase the build time.

@jpkrohling
Copy link
Member Author

Perhaps as a very short term duration, temporary measure it is ok, but I think we need to fix the underlying tests.

+1, no question about it

However, retrying the entire test suite is a poor practice and is going to mask tests that need fixing and will increase the build time

Agree about that as well, but restarting the whole CI is basically what we are doing manually today, so, build time might increase, but not as much as the manual re-runs.

@dmitryax
Copy link
Member

I agree with Tigran. I'd rather disable flaky tests than retry them

@tigrannajaryan
Copy link
Member

OK, I don't mind merging this if other approvers are on board and we plan to fix flaky tests quickly and revert this change.

@gouthamve
Copy link
Member

This doesn't seem like a good approach to me imo. I understand the intent though, I've had my PR retried like 10 times by Juraci once. I still think we should aggressively disable flaky tests rather than run them multiple times until they succeed.

I've started looking into the Prometheus flaky tests but as Juraci pointed out, they're not trivial to fix (#9124 and #9119). Good to get some more eyes on them :)

Copy link
Member

@mx-psi mx-psi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotestsum can re-run only failed tests and it should be quite easy to use it on this repository. I would suggest giving this a try instead of re-running every single test

@tigrannajaryan
Copy link
Member

gotestsum can re-run only failed tests and it should be quite easy to use it on this repository. I would suggest giving this a try instead of re-running every single test

I am against using tools that mask randomly failing tests. A randomly failing test can be an indication of a real bug. Using tools like this is going to hide the bug.

I suggest that we adopt a policy like this:

  • A failing test is a bug, whether it fails always or sporadically.
  • The priority of such bug is proportional to failure frequency.
  • A sporadically failing test must be fixed by the owner or must be deleted (deleting can be a valid approach if fixing is expensive while the value of the test is low).
  • The higher the priority of the bug the shorter the time period given to the owner to fix it (we can define precise ETAs for each priority if we want to be strict). Owners who do not meet the deadline will have their test deleted and/or the component will be subject to our usual "non-maintained" policy.
  • If the fix is not easy and the owner needs more time to work on a fix than we can afford to suffer from instability, we may temporarily skip the test until the owner fixes it. There should be a bug recorded as a github issue for each such skipped test (remember, the bug may be in the actual code, not in the test).

Again, as a temporary measure I am OK with retrying the tests until we identify all flaky tests and fix/delete them. However, I do not think we should invest time in tooling around retrying and I think we should remove the retrying as soon as the codebase is returned to a stable state.

Instead, I would prefer to invest time into tooling to surface flaky tests, automatically file a github issue, auto-assign to test owner, automatically escalate priority of the issue on repeated failures, etc.

cc @open-telemetry/collector-contrib-approvers

@mx-psi
Copy link
Member

mx-psi commented Apr 13, 2022

I think @tigrannajaryan's plan makes a lot of sense and it's something that can be done whether or not we merge this PR. I want to make a case for retrying flaky tests, even if we adopt the approach that Tigran proposes, since I think it still makes sense to do something like this.

Flaky tests cause pain to all contributors on this repository, since one has to re-run tests every time a flaky test is encountered for a PR to be considered mergeable, and tests take a long time. I don't think there are enough people today working on this repository so that the time from "flaky tests appears" to "flaky test is fixed or removed" is short enough so as not to disrupt everyone else on this repository: some flakes are hard to fix, other times finding the right person to work on a flake takes a long time. Retrying failing unit tests is a cheap solution to reduce the number of flakes and allows us to focus on the worst offenders, instead of causing pain to every contributor. We can potentially still detect flakes that were averted by a tool like gotestsum and start working on them before they become more troublesome.

@tigrannajaryan
Copy link
Member

Perhaps we can implement individual test retrying and use it permanently, provided that we have a few safeguards in place:

  • There is a clear visibility that the test failed and was retried to pass.
  • A PR with a failed/retried/passed test must be explicitly labelled as "Skip-Failed-Tests" before it is allowed to be merged. An exception from this may be the known flaky tests for which there is already a github issue opened.
  • A failed/retried test on main or release branch is treated as a bug. A github issue is automatically created, assigned to the owner and the priority of the bug escalated with every additional failure.

@jpkrohling
Copy link
Member Author

There is a clear visibility that the test failed and was retried to pass.

This is what bothers me the most about this PR here: we'd miss the information about which tests failed and were retried. If we had a way to enforce flaky tests to be fixed before other work is done, I don't think we'd need this PR here or gotestsum, but experience has shown that we don't typically give that much attention to fixing flaky tests (including myself). Just disabling the test without a concrete plan to enable it back isn't a solution either. I think this is where @tigrannajaryan's policy would work.

Owners who do not meet the deadline will have their test deleted and/or the component will be subject to our usual "non-maintained" policy.

Who's going to be the one checking that?

@tigrannajaryan
Copy link
Member

Owners who do not meet the deadline will have their test deleted and/or the component will be subject to our usual "non-maintained" policy.

Who's going to be the one checking that?

Ideally, automated! :-)

@mx-psi
Copy link
Member

mx-psi commented Apr 14, 2022

Perhaps we can implement individual test retrying and use it permanently, provided that we have a few safeguards in place:

* There is a clear visibility that the test failed and was retried to pass.

This will be shown on gotestsum output, would that be enough? (Together with the label below?)

* A PR with a failed/retried/passed test must be explicitly labelled as "Skip-Failed-Tests" before it is allowed to be merged. An exception from this may be the known flaky tests for which there is already a github issue opened.

gotestsum can use test2json under the hood to provide a machine-readable test output. From this it should be easy to write a Github Action that does this. We have something like this on Datadog which works well enough for our purposes.

* A failed/retried test on main or release branch is treated as a bug. A github issue is automatically created, assigned to the owner and the priority of the bug escalated with every additional failure.

I think this is also doable with the test2json output.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Apr 29, 2022
@jpkrohling
Copy link
Member Author

I'm closing this, as I think the PR served its purpose. Right now, the code base is more stable when this was opened and we have some great options listed here in case we need again.

@jpkrohling jpkrohling closed this Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skip Changelog PRs that do not require a CHANGELOG.md entry Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants