Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky BWC test #296

Merged
merged 1 commit into from
Feb 8, 2022
Merged

Fix flaky BWC test #296

merged 1 commit into from
Feb 8, 2022

Conversation

qreshi
Copy link
Contributor

@qreshi qreshi commented Jan 31, 2022

Signed-off-by: Mohammad Qureshi [email protected]

Issue #, if available: #241

Description of changes:
The BWC tests will occasionally fail on the twoThirdsUpgradedClusterTask. It seems that there is some race condition that can lead to some test cluster changes to not go through due to the sleep being done in the mixed cluster scenario. For the time being, the sleep will be removed from MIXED_CLUSTER scenario as this was shown to remove flakiness.

The sleeps were initially put in to give time for a Monitor execution in between assertions but the current checks in mixed state will suffice. Once we have an easy way to artificially forward Monitor executions (which will likely be after integration with Job Scheduler since the next execution time isn't stored in the Monitor job currently) then we can add additional checks without relying on sleeps.

Some local verification:

Multiple test runs before change

(22-01-31 17:22:48) <0> [~/workspace/opensearch-alerting]
qreshi % for i in {1..10}; do ./gradlew bwcTestSuite &> /dev/null || echo "Run $i failed"; done
Run 1 failed
Run 4 failed
Run 6 failed
Run 7 failed
Run 9 failed

Multiple test runs after removal of sleep

(22-01-29 1:06:34) <0> [~/workspace/opensearch-alerting]
qreshi % for i in {1..10}; do ./gradlew bwcTestSuite &> /dev/null || echo "Run $i failed"; done

CheckList:
[x] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@qreshi qreshi requested a review from a team January 31, 2022 18:29
Comment on lines +69 to 73
// TODO: Change the next execution time of the Monitor manually instead since this inflates
// the test execution by a lot (might have to wait for Job Scheduler plugin integration first)
// Waiting a minute to ensure the Monitor ran again at least once before checking if the job is running
// on time
Thread.sleep(60000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually a consistent way to avoid sleeps is by using assertBusy, given if we expect some change of state within a stipulated time threshold. It allows to runs the code block for the provided interval, waiting for no assertions to trip, and return as soon as all the assertions are met. So wondering of there exists some state related to monitor execution, such as count, that we can rely upon as a trigger here, and then verify the stats response.

Copy link
Contributor Author

@qreshi qreshi Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the sleep was that the minimum Monitor execution interval is 1 minute and we were going to ensure the Monitor executed again before checking the stats to make sure that any late running Monitor manifested itself. The test itself is not perfect and we'd normally want to just fast forward the job execution like we do for the Index Management test but Alerting uses an older job scheduling implementation where the next job execution time is not stored on the document but rather in some ConcurrentHashMap object that is not test friendly (is not easily mutable in written tests).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying @qreshi. So my suggestion was, is there is a way (API/Object) which can specify if monitor was run again (such as a running-counter). If yes, then we can do assertBusy, which provides an exponential backoff mechanism to wait until the condition is met.
If not, I understand having sleep is the only valid option.

@qreshi qreshi merged commit 92aa4bd into opensearch-project:main Feb 8, 2022
qreshi added a commit to qreshi/opensearch-alerting that referenced this pull request Feb 14, 2022
qreshi added a commit to qreshi/opensearch-alerting that referenced this pull request Feb 15, 2022
qreshi added a commit that referenced this pull request Feb 21, 2022
AWSHurneyt pushed a commit to AWSHurneyt/OpenSearch-Alerting that referenced this pull request Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants