-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding test-retry plugin #456
Conversation
Codecov Report
@@ Coverage Diff @@
## main #456 +/- ##
============================================
- Coverage 78.18% 78.14% -0.04%
+ Complexity 4162 4159 -3
============================================
Files 296 296
Lines 17659 17659
Branches 1879 1879
============================================
- Hits 13807 13800 -7
- Misses 2958 2963 +5
- Partials 894 896 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Signed-off-by: Amit Galitzky <[email protected]>
Signed-off-by: Amit Galitzky <[email protected]>
598f4ad
to
8257c86
Compare
Signed-off-by: Amit Galitzky <[email protected]>
…ould Signed-off-by: Amit Galitzky <[email protected]>
Signed-off-by: Amit Galitzky <[email protected]>
if (isCiServer) { | ||
failOnPassedAfterRetry = false | ||
maxRetries = 6 | ||
maxFailures = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't used test retry, just asking: if maxRetries
is 6, the max failure should't exceed 7? Why set maxFailures = 10
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the plugin details, maxFailures
is about how many tests failed per run. Assuming the number is a lot, it is likely that there's other issues causing the tests to fail (e.g. cluster isn't coming up), besides flaky tests
src/test/java/org/opensearch/ad/e2e/DetectionResultEvalutationIT.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Amit Galitzky <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just wondering, I feel this can cause flaky tests to stick around longer without fixing them. Is there still a way to get a report of the tests that may have failed / been retried? Ideally those are still brought to our attention so they can be fixed. That way, in case the test suite is run in such a way or in a different env where they won't automatically be retried, it's not a surprise.
Thats a valid concern, in order to mitigate this the retry-gradle will only work when running through github actions and not when building or testing locally so at least if developers build locally they can encounter flakiness first. However there are obviously times when we don't do the whole ./gradlew :build process locally before making a PR. If tests have failed and then passed on retry, the failures will show up on the action logs if developers check this, however it also makes sense that if a developer sees all checks passing then they wont check the logs. Opensearch has a bot that actually produces the reports and lets you see all failed tests in a nice format like we can do locally, opensearch core also has just implement gradle-retry and validated that they can see the flaky tests in their report( same as we can in our logs) opensearch-project/OpenSearch#2638 (comment) (if you click report it will download a zip that includes the nicely formatted test results as seen in screenshots in that PR) We can potentially decide to do something like this even just for tests. I'll open an issue and look at this later. |
@@ -148,8 +149,15 @@ def _numNodes = findProperty('numNodes') as Integer ?: 1 | |||
|
|||
def opensearch_tmp_dir = rootProject.file('build/private/opensearch_tmp').absoluteFile | |||
opensearch_tmp_dir.mkdirs() | |||
|
|||
boolean isCiServer = System.getenv().containsKey("CI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you come up with this value? Will this only work in github CI runners, or will it work in Jenkins hosts too? Ideally it's run in both so that we don't get surprises on test failures during infra builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value was recommended by https://github.com/gradle/test-retry-gradle-plugin documentation and gradle documentation recommends this line when running something only in CI. I tested it and it works on github actions, I am not sure about jenkins. I saw this also recommended on the opensearch-core PR if (BuildParams.isCi() == true)
but wasn't used in the end. I was thinking maybe its okay if we don't retry on jenkins as that is run on a larger instance and I didn't AD backend was as flaky there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I think at least adding for github CI is helpful because of how it's usually run on lower-provisioned hosts compared to a local machine or Jenkins like you mentioned. And if it's still flaky, it will be exposed in those places.
Got it - yeah I think as long as this still fails in localhost & being able to view the reports from CI, should be ok. Can you help add the bot workflow that comments the zipped test output? |
I see you've created #480 to track, sounds good |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Nice to see plugins adopt this for flakey tests. We had some discussions in opensearch-project/OpenSearch#2638 about default values. I would take a look and lower the number of retries to catch flakey tests, and consider removing the "is CI" check to match the CI and developer experience, we had problems with this in core. |
Signed-off-by: Amit Galitzky <[email protected]>
* Fix restart HCAD detector bug (#460) * Fix restart HCAD detector bug * Adding test-retry plugin (#456) * backport cve fix and improve restart IT To prevent repeatedly cold starting a model due to sparse data, HCAD has a cache that remembers we have done cold start for a model. A second attempt to cold start will need to wait for 60 detector intervals. Previously, when stopping a detector, I forgot to clean the cache. So the cache remembers the model and won’t retry cold start after some time. This PR fixes the bug by cleaning the cache when stopping a detector. Testing done: 1. added unit and integration tests. 2. manually reproduced the issue and verified the fix.
Signed-off-by: Amit Galitzky [email protected]
Description
Testing out new gradle plugin that retries failed test.
maxFailures
: "The maximum number of test failures that are allowed before retrying is disabled." (set to 10)maxRetries
: "The maximum number of times to retry an individual test." (set to 10)getFailOnPassedAfterRetry
: "Whether tests that initially fail and then pass on retry should fail the task." (set to false since if it passes once its good enough to pass build)Will be re-running workflows a few times on this PR to try this out and potentially adjust max failures/retries setting. https://blog.gradle.org/gradle-flaky-test-retry-plugin
Related Issues
#451
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.