Adding test-retry plugin #456

amitgalitz · 2022-03-22T22:40:25Z

Signed-off-by: Amit Galitzky [email protected]

Description

Testing out new gradle plugin that retries failed test.

Added this to both unit tests and integ tests
maxFailures: "The maximum number of test failures that are allowed before retrying is disabled." (set to 10)
maxRetries: "The maximum number of times to retry an individual test." (set to 10)
getFailOnPassedAfterRetry: "Whether tests that initially fail and then pass on retry should fail the task." (set to false since if it passes once its good enough to pass build)

Will be re-running workflows a few times on this PR to try this out and potentially adjust max failures/retries setting. https://blog.gradle.org/gradle-flaky-test-retry-plugin

Related Issues

#451

Check List

Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov-commenter · 2022-03-23T01:05:51Z

Codecov Report

Merging #456 (1eb1781) into main (fa00555) will decrease coverage by 0.03%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##               main     #456      +/-   ##
============================================
- Coverage     78.18%   78.14%   -0.04%     
+ Complexity     4162     4159       -3     
============================================
  Files           296      296              
  Lines         17659    17659              
  Branches       1879     1879              
============================================
- Hits          13807    13800       -7     
- Misses         2958     2963       +5     
- Partials        894      896       +2

Flag	Coverage Δ
plugin	`78.14% <ø> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...rch/ad/transport/ForwardADTaskTransportAction.java	`94.06% <0.00%> (-3.39%)`	⬇️
...ain/java/org/opensearch/ad/model/ModelProfile.java	`70.90% <0.00%> (-1.82%)`	⬇️
...java/org/opensearch/ad/task/ADBatchTaskRunner.java	`81.76% <0.00%> (-0.46%)`	⬇️
...opensearch/ad/indices/AnomalyDetectionIndices.java	`71.99% <0.00%> (-0.19%)`	⬇️
.../main/java/org/opensearch/ad/ml/CheckpointDao.java	`70.19% <0.00%> (+0.64%)`	⬆️

Signed-off-by: Amit Galitzky <[email protected]>

…ould Signed-off-by: Amit Galitzky <[email protected]>

Signed-off-by: Amit Galitzky <[email protected]>

ylwu-amzn · 2022-03-30T18:04:53Z

build.gradle

+        if (isCiServer) {
+            failOnPassedAfterRetry = false
+            maxRetries = 6
+            maxFailures = 10


Haven't used test retry, just asking: if maxRetries is 6, the max failure should't exceed 7? Why set maxFailures = 10?

From the plugin details, maxFailures is about how many tests failed per run. Assuming the number is a lot, it is likely that there's other issues causing the tests to fail (e.g. cluster isn't coming up), besides flaky tests

src/test/java/org/opensearch/ad/e2e/DetectionResultEvalutationIT.java

Signed-off-by: Amit Galitzky <[email protected]>

ohltyler

LGTM! Just wondering, I feel this can cause flaky tests to stick around longer without fixing them. Is there still a way to get a report of the tests that may have failed / been retried? Ideally those are still brought to our attention so they can be fixed. That way, in case the test suite is run in such a way or in a different env where they won't automatically be retried, it's not a surprise.

amitgalitz · 2022-03-31T16:05:54Z

LGTM! Just wondering, I feel this can cause flaky tests to stick around longer without fixing them. Is there still a way to get a report of the tests that may have failed / been retried? Ideally those are still brought to our attention so they can be fixed. That way, in case the test suite is run in such a way or in a different env where they won't automatically be retried, it's not a surprise.

Thats a valid concern, in order to mitigate this the retry-gradle will only work when running through github actions and not when building or testing locally so at least if developers build locally they can encounter flakiness first. However there are obviously times when we don't do the whole ./gradlew :build process locally before making a PR. If tests have failed and then passed on retry, the failures will show up on the action logs if developers check this, however it also makes sense that if a developer sees all checks passing then they wont check the logs. Opensearch has a bot that actually produces the reports and lets you see all failed tests in a nice format like we can do locally, opensearch core also has just implement gradle-retry and validated that they can see the flaky tests in their report( same as we can in our logs) opensearch-project/OpenSearch#2638 (comment) (if you click report it will download a zip that includes the nicely formatted test results as seen in screenshots in that PR)

We can potentially decide to do something like this even just for tests. I'll open an issue and look at this later.

ohltyler · 2022-03-31T16:31:53Z

build.gradle

@@ -148,8 +149,15 @@ def _numNodes = findProperty('numNodes') as Integer ?: 1

 def opensearch_tmp_dir = rootProject.file('build/private/opensearch_tmp').absoluteFile
 opensearch_tmp_dir.mkdirs()
-
+boolean isCiServer = System.getenv().containsKey("CI")


How did you come up with this value? Will this only work in github CI runners, or will it work in Jenkins hosts too? Ideally it's run in both so that we don't get surprises on test failures during infra builds.

This value was recommended by https://github.com/gradle/test-retry-gradle-plugin documentation and gradle documentation recommends this line when running something only in CI. I tested it and it works on github actions, I am not sure about jenkins. I saw this also recommended on the opensearch-core PR if (BuildParams.isCi() == true) but wasn't used in the end. I was thinking maybe its okay if we don't retry on jenkins as that is run on a larger instance and I didn't AD backend was as flaky there?

Fair enough. I think at least adding for github CI is helpful because of how it's usually run on lower-provisioned hosts compared to a local machine or Jenkins like you mentioned. And if it's still flaky, it will be exposed in those places.

ohltyler · 2022-03-31T16:33:12Z

Got it - yeah I think as long as this still fails in localhost & being able to view the reports from CI, should be ok. Can you help add the bot workflow that comments the zipped test output?

ohltyler · 2022-03-31T16:58:56Z

Got it - yeah I think as long as this still fails in localhost & being able to view the reports from CI, should be ok. Can you help add the bot workflow that comments the zipped test output?

I see you've created #480 to track, sounds good

ohltyler

LGTM!

dblock · 2022-04-04T23:21:16Z

Nice to see plugins adopt this for flakey tests.

We had some discussions in opensearch-project/OpenSearch#2638 about default values. I would take a look and lower the number of retries to catch flakey tests, and consider removing the "is CI" check to match the CI and developer experience, we had problems with this in core.

Signed-off-by: Amit Galitzky <[email protected]>

* Fix restart HCAD detector bug (#460) * Fix restart HCAD detector bug * Adding test-retry plugin (#456) * backport cve fix and improve restart IT To prevent repeatedly cold starting a model due to sparse data, HCAD has a cache that remembers we have done cold start for a model. A second attempt to cold start will need to wait for 60 detector intervals. Previously, when stopping a detector, I forgot to clean the cache. So the cache remembers the model and won’t retry cold start after some time. This PR fixes the bug by cleaning the cache when stopping a detector. Testing done: 1. added unit and integration tests. 2. manually reproduced the issue and verified the fix.

amitgalitz added 2 commits March 24, 2022 16:24

adding retry failed test to gradle

0c2877f

Signed-off-by: Amit Galitzky <[email protected]>

only retry tests when running CI, changing retry to 8

8257c86

Signed-off-by: Amit Galitzky <[email protected]>

amitgalitz force-pushed the retry-gradle branch from 598f4ad to 8257c86 Compare March 24, 2022 16:24

removed retries in code for restartHCAD test

01572a9

Signed-off-by: Amit Galitzky <[email protected]>

kaituo mentioned this pull request Mar 28, 2022

[Backport 1.3] Fix restart HCAD detector bug #466

Closed

amitgalitz added 2 commits March 30, 2022 16:16

this is a test commit to see if CI fails even with retries when it sh…

78166db

…ould Signed-off-by: Amit Galitzky <[email protected]>

upgraded version and removed failing test

6921d7a

Signed-off-by: Amit Galitzky <[email protected]>

amitgalitz marked this pull request as ready for review March 30, 2022 17:22

amitgalitz requested a review from a team March 30, 2022 17:22

ohltyler linked an issue Mar 30, 2022 that may be closed by this pull request

Updating how we run integ tests #451

Closed

ylwu-amzn reviewed Mar 30, 2022

View reviewed changes

ohltyler reviewed Mar 30, 2022

View reviewed changes

src/test/java/org/opensearch/ad/e2e/DetectionResultEvalutationIT.java Outdated Show resolved Hide resolved

added clean cluster and wipe indices back after test failure

1eb1781

Signed-off-by: Amit Galitzky <[email protected]>

ohltyler reviewed Mar 31, 2022

View reviewed changes

ohltyler approved these changes Mar 31, 2022

View reviewed changes

ylwu-amzn approved these changes Mar 31, 2022

View reviewed changes

amitgalitz merged commit 2766a8a into opensearch-project:main Mar 31, 2022

kaituo pushed a commit to kaituo/anomaly-detection-1 that referenced this pull request Apr 8, 2022

Adding test-retry plugin (opensearch-project#456)

b5dcd92

Signed-off-by: Amit Galitzky <[email protected]>

kaituo mentioned this pull request Apr 8, 2022

[Backport to 1.3] restart bug fix #496

Merged

1 task

amitgalitz added the infra Changes to infrastructure, testing, CI/CD, pipelines, etc. label Apr 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding test-retry plugin #456

Adding test-retry plugin #456

amitgalitz commented Mar 22, 2022

codecov-commenter commented Mar 23, 2022 •

edited

Loading

ylwu-amzn Mar 30, 2022

ohltyler Mar 30, 2022 •

edited

Loading

ohltyler left a comment •

edited

Loading

amitgalitz commented Mar 31, 2022 •

edited

Loading

ohltyler Mar 31, 2022

amitgalitz Mar 31, 2022

ohltyler Mar 31, 2022

ohltyler commented Mar 31, 2022

ohltyler commented Mar 31, 2022

ohltyler left a comment

dblock commented Apr 4, 2022

Adding test-retry plugin #456

Adding test-retry plugin #456

Conversation

amitgalitz commented Mar 22, 2022

Description

Related Issues

Check List

codecov-commenter commented Mar 23, 2022 • edited Loading

Codecov Report

ylwu-amzn Mar 30, 2022

Choose a reason for hiding this comment

ohltyler Mar 30, 2022 • edited Loading

Choose a reason for hiding this comment

ohltyler left a comment • edited Loading

Choose a reason for hiding this comment

amitgalitz commented Mar 31, 2022 • edited Loading

ohltyler Mar 31, 2022

Choose a reason for hiding this comment

amitgalitz Mar 31, 2022

Choose a reason for hiding this comment

ohltyler Mar 31, 2022

Choose a reason for hiding this comment

ohltyler commented Mar 31, 2022

ohltyler commented Mar 31, 2022

ohltyler left a comment

Choose a reason for hiding this comment

dblock commented Apr 4, 2022

codecov-commenter commented Mar 23, 2022 •

edited

Loading

ohltyler Mar 30, 2022 •

edited

Loading

ohltyler left a comment •

edited

Loading

amitgalitz commented Mar 31, 2022 •

edited

Loading