Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Timeout on org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation #1957

Closed
Tracked by #1715
VachaShah opened this issue Jan 21, 2022 · 23 comments
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Comments

@VachaShah
Copy link
Collaborator

Describe the bug
Caught on PR #1952. The test timed out while waiting for the cluster to become green. Related PR for test: #1445.

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation" -Dtests.seed=AF1232B890DC88C7 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=he-IL -Dtests.timezone=Japan -Druntime.java=17

To Reproduce
Steps to reproduce the behavior:

  1. Run the above command in the repo. Its a flaky test.

Expected behavior
The cluster becomes green and the test does not timeout.

Plugins
Core OpenSearch.

Host/Environment (please complete the following information):

@owaiskazi19
Copy link
Member

Hey @jainankitk can you provide more details on this issue. It's from #1445. Thanks.

@jainankitk
Copy link
Collaborator

Hey @jainankitk can you provide more details on this issue. It's from #1445.

I will prioritize this before EOW and update this thread with my findings!

@owaiskazi19
Copy link
Member

owaiskazi19 commented Jan 27, 2022

Hey @jainankitk can you provide more details on this issue. It's from #1445.

I will prioritize this before EOW and update this thread with my findings!

Hey @jainankitk! I spent some time on the bug and it was mostly related to adding a timeout to ensureGreen function for nodes to get available. I raised a PR for the same and it's merged now.

@jainankitk
Copy link
Collaborator

Hey @jainankitk! I spent some time on the bug and it was mostly related to adding a timeout to ensureGreen function for nodes to get available. I raised a PR for the same and it's merged now.

Thank you, appreciate that!

@owaiskazi19 owaiskazi19 self-assigned this Jan 28, 2022
@dreamer-89
Copy link
Member

Another occurrence : https://fork-jenkins.searchservices.aws.dev/job/OpenSearch_CI/job/PR_Checks/job/Gradle_Check/2122/artifact/gradle_check_2122.log

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation" -Dtests.seed=63FE4BB64BECDB9A -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sl-SI -Dtests.timezone=America/Blanc-Sablon -Druntime.java=17

org.opensearch.cluster.routing.MovePrimaryFirstTests > testClusterGreenAfterPartialRelocation FAILED
    java.lang.AssertionError: timed out waiting for green state
        at __randomizedtesting.SeedInfo.seed([63FE4BB64BECDB9A:30342FD360A8A48E]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.opensearch.test.OpenSearchIntegTestCase.ensureColor(OpenSearchIntegTestCase.java:985)
        at org.opensearch.test.OpenSearchIntegTestCase.ensureGreen(OpenSearchIntegTestCase.java:924)
        at org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation(MovePrimaryFirstTests.java:116)

@owaiskazi19
Copy link
Member

Looks like the above mentioned commit wasn’t rebased with the latest main.

@saratvemulapalli
Copy link
Member

Another failure after main is rebased: #2033

@owaiskazi19
Copy link
Member

owaiskazi19 commented Feb 2, 2022

@jainankitk what do you think about increasing the timeout to 120 here? Looks like test is still failing for 60.

@andrross
Copy link
Member

andrross commented Feb 2, 2022

Another failure here: #1917 (comment)

@jainankitk
Copy link
Collaborator

Taking a look

@jainankitk
Copy link
Collaborator

The issue is caused due to one of the primary shard being initialized and some replica starts meanwhile. Hence, latch is counted down as half shards are already initialized. Making the check more robust by ensuring no primaries are initializing and not more than 20% of replicas have started on new nodes

@dreamer-89
Copy link
Member

dreamer-89 commented Feb 9, 2022

Another failure here: #2026 comment

@owaiskazi19 owaiskazi19 removed their assignment Feb 9, 2022
@jainankitk
Copy link
Collaborator

Okay, I can see that none of the shard was unassigned, just 1 replica was remaining that would have started given few more seconds. @owaiskazi19 - I will increase the timeout to 60 seconds! :)

  1> --------[foo][98], node[gK0yMdlgQcq7XDvBVcHqHA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=AHHzR21gRwyH9m8Z2tDMoA], unassigned_info[[reason=NODE_LEFT], at[2022-02-09T01:03:06.551Z], delayed=false, details[node_left [dDqzlwdpR5yo4Eu75qv02Q]], allocation_status[no_attempt]]
  1> --------[foo][15], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=YA_tO2W5SlGHclGkPmNptQ]
  1> --------[foo][40], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=6B01KebASMCgo8EzO8YKZg]
  1> --------[foo][6], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=cnF4_Bs4QqCuxxj_HFAqgw]
  1> --------[foo][95], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pAGuKsjjR-G3RbvFpvYwQQ]
  1> --------[foo][20], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pEms1UhDSxW0LUngYrIsPQ]
  1> --------[foo][14], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=aqRw-O9sQ3i3odC_TtwSwg]
  1> --------[foo][76], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=9pd7AEihRFO-_Cwoa_UAGA]
  1> --------[foo][7], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=OUN6zdcYTFK5F3MfFQOdrg]
  1> --------[foo][81], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=H5BXr0llSS-pR3ai8VrySg]
  1> --------[foo][89], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=y-_-RKj8SyCfKIFnvmQL4w]
  1> --------[foo][24], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pq4U5n3tQlOGFmrxcNiJ3Q]
  1> --------[foo][19], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qs8kmVy1RAKaRKmCdWJsXQ]
  1> --------[foo][59], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=AllB9KA8ST-R6hiADkbp8Q]
  1> --------[foo][58], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=Cc110Ve-QAyxnMs8gwN5jQ]
  1> --------[foo][26], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=6bx8XZ49QZCVLwMAHzFO9A]
  1> --------[foo][66], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=m-0y5bnxREajEeNHSDVlPQ]
  1> --------[foo][86], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=p0Q47DxfTleki8s42UIqhg]
  1> --------[foo][9], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=y-F_n75mR3KpWbHkMXqBVg]
  1> --------[foo][17], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qjWibVYFSFildJbykxn3rg]
  1> --------[foo][44], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=bA-J1mDcR_u7CdCgneZxzQ]
  1> --------[foo][94], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qr73XHAKRKebttLJrfbVPw]
  1> --------[foo][11], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=CTR9cjS3QfSFG0qt7VRrbQ]
  1> --------[foo][28], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=izZjpNhoS-2AFvj4Y4Y2JQ]
  1> --------[foo][30], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=fzERcAZ8TemBOeaUKWsMxw]
  1> --------[foo][53], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=fiMnSwICTN--mEf8Fn3oqQ]
  1> --------[foo][71], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=7xbaRZI2QPKdvBrE5iT0JQ]
  1> --------[foo][32], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=bFplE9pGTY-uTVHdl5RByQ]
  1> --------[foo][43], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=wsdW3kXSRZakgjBEUoLNOw]
  1> --------[foo][52], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=zxDyzd0gTISJfR05fNkZSw]
  1> --------[foo][3], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=j_ePRsVJRm2DPaDHu-q-tg]
  1> --------[foo][92], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=OBJ-9NKNTU6kZQ9R0a0s6A]
  1> --------[foo][82], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=5L3LOKOiR5mGoFwWoCxkHg]
  1> --------[foo][41], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=wTks5SaaQ2a7i74Cy7WwXA]
  1> --------[foo][50], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=D2n8YmDQSbab6YaDEmNk-w]
  1> --------[foo][33], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=i0GmrglcSyWLeM7jBmtU7Q]
  1> --------[foo][57], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=LngaeINzSQCPywze9NU  1> 2Rw]
  1> --------[foo][18], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=VxHWMVCUThy08tOoJ6VAww]
  1> --------[foo][61], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=9EPpu_ThT8u1HXDIX-DVzQ]
  1> --------[foo][48], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=LE_hapDHTdyGqdUm1xNk_Q]
  1> --------[foo][51], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=UWCrnlekRJ6kTKufo_cmUg]
  1> --------[foo][85], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=0OIL6unHQMyhG0MpuEXrdQ]
  1> --------[foo][38], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=ir4h6zK0SDGCzYEEu5QBUw]
  1> ---- unassigned

  1> pending tasks:
  1> tasks: (1):
  1> 933/URGENT/shard-started StartedShardEntry{shardId [[foo][98]], allocationId [AHHzR21gRwyH9m8Z2tDMoA], primary term [1], message [after peer recovery]}/36ms

@andrross
Copy link
Member

andrross commented Feb 9, 2022

Is the issue here that this test is using 100 primary shards, which is more than most other tests use, and therefore it takes longer and requires more resources from the test machine to get everything started? My concern is that even if we get the timeouts set to work well for the hosts used by CI infrastructure it might still be flakey when run on developer machines. Is there anything we can do to make this test more deterministic?

@jainankitk
Copy link
Collaborator

Is the issue here that this test is using 100 primary shards, which is more than most other tests use, and therefore it takes longer and requires more resources from the test machine to get everything started? My concern is that even if we get the timeouts set to work well for the hosts used by CI infrastructure it might still be flakey when run on developer machines. Is there anything we can do to make this test more deterministic?

I don't think 100 shards is the issue here. I have been able to run the test on my machine several times without any issue. Though thinking more on it, the test might be able to run with even smaller number like 25 or 50 primary shards. I will wait to see if the issue is reported by anyone else.

@dblock
Copy link
Member

dblock commented Feb 14, 2022

#2069 (comment)

@jainankitk
Copy link
Collaborator

#2069 (comment)

Okay, I can see that some of the shards were initializing. Considering reducing the number of shards instead of increasing the timeout to not increase overall test suite time

@dblock
Copy link
Member

dblock commented Feb 14, 2022

#2069 (comment)

@kartg
Copy link
Member

kartg commented Feb 15, 2022

#2096 (comment)

@jainankitk
Copy link
Collaborator

The test was failing due to some replica shards initializing and completing before last primary shard could finish initializing. Discussed the issue with @dblock this morning to make test more predictable. Suggestion was to add shards per node constraint to allow exactly 50% of shards to relocate on new nodes. Including the constraint helped make tests really lightweight (completes in less than 5 seconds) and ran locally without any failure more than 500 times

@jainankitk
Copy link
Collaborator

jainankitk commented Mar 17, 2022

All above reported failures were for this test before the latest fix.

No recent failures, this issue can be resolved for good - @VachaShah

@jainankitk
Copy link
Collaborator

This issue can be closed. @VachaShah @dblock @andrross

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.
Projects
None yet
Development

No branches or pull requests

9 participants