Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding test-retry plugin #456

Merged
merged 6 commits into from
Mar 31, 2022
Merged

Conversation

amitgalitz
Copy link
Member

Signed-off-by: Amit Galitzky [email protected]

Description

Testing out new gradle plugin that retries failed test.

  • Added this to both unit tests and integ tests
  • maxFailures: "The maximum number of test failures that are allowed before retrying is disabled." (set to 10)
  • maxRetries: "The maximum number of times to retry an individual test." (set to 10)
  • getFailOnPassedAfterRetry: "Whether tests that initially fail and then pass on retry should fail the task." (set to false since if it passes once its good enough to pass build)

Will be re-running workflows a few times on this PR to try this out and potentially adjust max failures/retries setting. https://blog.gradle.org/gradle-flaky-test-retry-plugin

Related Issues

#451

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov-commenter
Copy link

codecov-commenter commented Mar 23, 2022

Codecov Report

Merging #456 (1eb1781) into main (fa00555) will decrease coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##               main     #456      +/-   ##
============================================
- Coverage     78.18%   78.14%   -0.04%     
+ Complexity     4162     4159       -3     
============================================
  Files           296      296              
  Lines         17659    17659              
  Branches       1879     1879              
============================================
- Hits          13807    13800       -7     
- Misses         2958     2963       +5     
- Partials        894      896       +2     
Flag Coverage Δ
plugin 78.14% <ø> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rch/ad/transport/ForwardADTaskTransportAction.java 94.06% <0.00%> (-3.39%) ⬇️
...ain/java/org/opensearch/ad/model/ModelProfile.java 70.90% <0.00%> (-1.82%) ⬇️
...java/org/opensearch/ad/task/ADBatchTaskRunner.java 81.76% <0.00%> (-0.46%) ⬇️
...opensearch/ad/indices/AnomalyDetectionIndices.java 71.99% <0.00%> (-0.19%) ⬇️
.../main/java/org/opensearch/ad/ml/CheckpointDao.java 70.19% <0.00%> (+0.64%) ⬆️

@amitgalitz amitgalitz marked this pull request as ready for review March 30, 2022 17:22
@amitgalitz amitgalitz requested a review from a team March 30, 2022 17:22
@ohltyler ohltyler linked an issue Mar 30, 2022 that may be closed by this pull request
if (isCiServer) {
failOnPassedAfterRetry = false
maxRetries = 6
maxFailures = 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't used test retry, just asking: if maxRetries is 6, the max failure should't exceed 7? Why set maxFailures = 10?

Copy link
Member

@ohltyler ohltyler Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the plugin details, maxFailures is about how many tests failed per run. Assuming the number is a lot, it is likely that there's other issues causing the tests to fail (e.g. cluster isn't coming up), besides flaky tests

Copy link
Member

@ohltyler ohltyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just wondering, I feel this can cause flaky tests to stick around longer without fixing them. Is there still a way to get a report of the tests that may have failed / been retried? Ideally those are still brought to our attention so they can be fixed. That way, in case the test suite is run in such a way or in a different env where they won't automatically be retried, it's not a surprise.

@amitgalitz
Copy link
Member Author

amitgalitz commented Mar 31, 2022

LGTM! Just wondering, I feel this can cause flaky tests to stick around longer without fixing them. Is there still a way to get a report of the tests that may have failed / been retried? Ideally those are still brought to our attention so they can be fixed. That way, in case the test suite is run in such a way or in a different env where they won't automatically be retried, it's not a surprise.

Thats a valid concern, in order to mitigate this the retry-gradle will only work when running through github actions and not when building or testing locally so at least if developers build locally they can encounter flakiness first. However there are obviously times when we don't do the whole ./gradlew :build process locally before making a PR. If tests have failed and then passed on retry, the failures will show up on the action logs if developers check this, however it also makes sense that if a developer sees all checks passing then they wont check the logs. Opensearch has a bot that actually produces the reports and lets you see all failed tests in a nice format like we can do locally, opensearch core also has just implement gradle-retry and validated that they can see the flaky tests in their report( same as we can in our logs) opensearch-project/OpenSearch#2638 (comment) (if you click report it will download a zip that includes the nicely formatted test results as seen in screenshots in that PR)

We can potentially decide to do something like this even just for tests. I'll open an issue and look at this later.

@@ -148,8 +149,15 @@ def _numNodes = findProperty('numNodes') as Integer ?: 1

def opensearch_tmp_dir = rootProject.file('build/private/opensearch_tmp').absoluteFile
opensearch_tmp_dir.mkdirs()

boolean isCiServer = System.getenv().containsKey("CI")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with this value? Will this only work in github CI runners, or will it work in Jenkins hosts too? Ideally it's run in both so that we don't get surprises on test failures during infra builds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value was recommended by https://github.com/gradle/test-retry-gradle-plugin documentation and gradle documentation recommends this line when running something only in CI. I tested it and it works on github actions, I am not sure about jenkins. I saw this also recommended on the opensearch-core PR if (BuildParams.isCi() == true) but wasn't used in the end. I was thinking maybe its okay if we don't retry on jenkins as that is run on a larger instance and I didn't AD backend was as flaky there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I think at least adding for github CI is helpful because of how it's usually run on lower-provisioned hosts compared to a local machine or Jenkins like you mentioned. And if it's still flaky, it will be exposed in those places.

@ohltyler
Copy link
Member

Got it - yeah I think as long as this still fails in localhost & being able to view the reports from CI, should be ok. Can you help add the bot workflow that comments the zipped test output?

@ohltyler
Copy link
Member

Got it - yeah I think as long as this still fails in localhost & being able to view the reports from CI, should be ok. Can you help add the bot workflow that comments the zipped test output?

I see you've created #480 to track, sounds good

Copy link
Member

@ohltyler ohltyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@amitgalitz amitgalitz merged commit 2766a8a into opensearch-project:main Mar 31, 2022
@dblock
Copy link
Member

dblock commented Apr 4, 2022

Nice to see plugins adopt this for flakey tests.

We had some discussions in opensearch-project/OpenSearch#2638 about default values. I would take a look and lower the number of retries to catch flakey tests, and consider removing the "is CI" check to match the CI and developer experience, we had problems with this in core.

kaituo pushed a commit to kaituo/anomaly-detection-1 that referenced this pull request Apr 8, 2022
@kaituo kaituo mentioned this pull request Apr 8, 2022
1 task
@amitgalitz amitgalitz added the infra Changes to infrastructure, testing, CI/CD, pipelines, etc. label Apr 20, 2022
kaituo added a commit that referenced this pull request Apr 22, 2022
* Fix restart HCAD detector bug (#460)

* Fix restart HCAD detector bug

* Adding test-retry plugin (#456)

* backport cve fix and improve restart IT

To prevent repeatedly cold starting a model due to sparse data, HCAD has a cache that remembers we have done cold start for a model. A second attempt to cold start will need to wait for 60 detector intervals. Previously, when stopping a detector, I forgot to clean the cache. So the cache remembers the model and won’t retry cold start after some time. This PR fixes the bug by cleaning the cache when stopping a detector.

Testing done:
1. added unit and integration tests.
2. manually reproduced the issue and verified the fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra Changes to infrastructure, testing, CI/CD, pipelines, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Updating how we run integ tests
5 participants