-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky RemoteIndexRecoveryIT testRerouteRecovery test #9580 #11918
Conversation
…roject#9580 Signed-off-by: Ashish Singh <[email protected]>
Running the testRerouteRecovery for RemoteIndexRecoveryIT and IndexRecoveryIT both for 1K iterations atleast. Without the fix, the test currently fails around 20th iteration. |
Compatibility status:Checks if related components are compatible with change ff8fc99 Incompatible componentsIncompatible components: [https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git] Skipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/k-nn.git] |
❌ Gradle check result for ff8fc99: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Flaky test - #9891 |
❕ Gradle check result for ff8fc99: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #11918 +/- ##
============================================
+ Coverage 71.33% 71.41% +0.08%
- Complexity 59300 59328 +28
============================================
Files 4921 4921
Lines 278989 278989
Branches 40543 40543
============================================
+ Hits 199014 199252 +238
+ Misses 63444 63120 -324
- Partials 16531 16617 +86 ☔ View full report in Codecov by Sentry. |
Flaky test - #9191 |
The test has been run on iteration for more than 1K iterations without any failure yet. |
Are you actually fixing the race condition here, or just making it much less likely? How can we make the test pass in a way that isn't dependent on the timing of the assertions? |
There is no underlying problem here. The test tries to assert an intermediate state which is there transiently and for a very short period. The problem exists in the underlying doc rep test as well. Once the recovery process completes, there is shard started action triggered which causes the index shard to get cleared from the source node. |
test/framework/src/main/java/org/opensearch/test/OpenSearchTestCase.java
Show resolved
Hide resolved
Signed-off-by: Ashish Singh <[email protected]> (cherry picked from commit c6cebc7) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
… (#12275) (cherry picked from commit c6cebc7) Signed-off-by: Ashish Singh <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…roject#9580 (opensearch-project#11918) Signed-off-by: Ashish Singh <[email protected]>
…roject#9580 (opensearch-project#11918) Signed-off-by: Ashish Singh <[email protected]>
…roject#9580 (opensearch-project#11918) Signed-off-by: Ashish Singh <[email protected]> Signed-off-by: Shivansh Arora <[email protected]>
Description
This PR fixes the flakiness in the testRerouteRecovery of RemoteIndexRecoveryIT.
The issue is happening due to 2 reasons as we have seen the stack trace -
The first issue happens due to cluster state publication cleaning up the index shard on the old primary node between 2 assertion check interval. The current assertBusy has exponential backoff between 2 assertion checks and between these 2 checks, the condition of assertion becomes true and the index shard itself gets cleared from the old node. However, due to high interval, it misses hitting the assertion true condition. The fix for this issue is to have the assertion checks done at fixed interval.
The second issue occurs due to the same reason as above but in this case the peer recovery has completed and changed the state of recovery state of the index shard.
Related Issues
Resolves #9580
Check List
New functionality includes testing.New functionality has been documented.Commit changes are listed out in CHANGELOG.md file (See: Changelog)Public documentation issue/PR createdBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.