-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Timeout on org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation #1957
Comments
Hey @jainankitk can you provide more details on this issue. It's from #1445. Thanks. |
I will prioritize this before EOW and update this thread with my findings! |
Hey @jainankitk! I spent some time on the bug and it was mostly related to adding a timeout to ensureGreen function for nodes to get available. I raised a PR for the same and it's merged now. |
Thank you, appreciate that! |
Another occurrence : https://fork-jenkins.searchservices.aws.dev/job/OpenSearch_CI/job/PR_Checks/job/Gradle_Check/2122/artifact/gradle_check_2122.log
|
Looks like the above mentioned commit wasn’t rebased with the latest |
Another failure after main is rebased: #2033 |
@jainankitk what do you think about increasing the timeout to 120 here? Looks like test is still failing for 60. |
Another failure here: #1917 (comment) |
Taking a look |
The issue is caused due to one of the primary shard being initialized and some replica starts meanwhile. Hence, latch is counted down as half shards are already initialized. Making the check more robust by ensuring no primaries are initializing and not more than 20% of replicas have started on new nodes |
Okay, I can see that none of the shard was unassigned, just 1 replica was remaining that would have started given few more seconds. @owaiskazi19 - I will increase the timeout to 60 seconds! :)
|
Is the issue here that this test is using 100 primary shards, which is more than most other tests use, and therefore it takes longer and requires more resources from the test machine to get everything started? My concern is that even if we get the timeouts set to work well for the hosts used by CI infrastructure it might still be flakey when run on developer machines. Is there anything we can do to make this test more deterministic? |
I don't think 100 shards is the issue here. I have been able to run the test on my machine several times without any issue. Though thinking more on it, the test might be able to run with even smaller number like 25 or 50 primary shards. I will wait to see if the issue is reported by anyone else. |
Okay, I can see that some of the shards were initializing. Considering reducing the number of shards instead of increasing the timeout to not increase overall test suite time |
The test was failing due to some replica shards initializing and completing before last primary shard could finish initializing. Discussed the issue with @dblock this morning to make test more predictable. Suggestion was to add shards per node constraint to allow exactly 50% of shards to relocate on new nodes. Including the constraint helped make tests really lightweight (completes in less than 5 seconds) and ran locally without any failure more than 500 times |
All above reported failures were for this test before the latest fix. No recent failures, this issue can be resolved for good - @VachaShah |
This issue can be closed. @VachaShah @dblock @andrross |
Describe the bug
Caught on PR #1952. The test timed out while waiting for the cluster to become green. Related PR for test: #1445.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The cluster becomes green and the test does not timeout.
Plugins
Core OpenSearch.
Host/Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: