[BUG] Timeout on org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation #1957

VachaShah · 2022-01-21T20:29:55Z

Describe the bug
Caught on PR #1952. The test timed out while waiting for the cluster to become green. Related PR for test: #1445.

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation" -Dtests.seed=AF1232B890DC88C7 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=he-IL -Dtests.timezone=Japan -Druntime.java=17

To Reproduce
Steps to reproduce the behavior:

Run the above command in the repo. Its a flaky test.

Expected behavior
The cluster becomes green and the test does not timeout.

Plugins
Core OpenSearch.

Host/Environment (please complete the following information):

Using pull_request_target in place of pull_request #1952

The text was updated successfully, but these errors were encountered:

owaiskazi19 · 2022-01-27T01:19:26Z

Hey @jainankitk can you provide more details on this issue. It's from #1445. Thanks.

jainankitk · 2022-01-27T18:47:08Z

Hey @jainankitk can you provide more details on this issue. It's from #1445.

I will prioritize this before EOW and update this thread with my findings!

owaiskazi19 · 2022-01-27T19:01:41Z

Hey @jainankitk can you provide more details on this issue. It's from #1445.

I will prioritize this before EOW and update this thread with my findings!

Hey @jainankitk! I spent some time on the bug and it was mostly related to adding a timeout to ensureGreen function for nodes to get available. I raised a PR for the same and it's merged now.

jainankitk · 2022-01-27T21:02:04Z

Hey @jainankitk! I spent some time on the bug and it was mostly related to adding a timeout to ensureGreen function for nodes to get available. I raised a PR for the same and it's merged now.

Thank you, appreciate that!

dreamer-89 · 2022-01-29T17:46:33Z

Another occurrence : https://fork-jenkins.searchservices.aws.dev/job/OpenSearch_CI/job/PR_Checks/job/Gradle_Check/2122/artifact/gradle_check_2122.log

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation" -Dtests.seed=63FE4BB64BECDB9A -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sl-SI -Dtests.timezone=America/Blanc-Sablon -Druntime.java=17

org.opensearch.cluster.routing.MovePrimaryFirstTests > testClusterGreenAfterPartialRelocation FAILED
    java.lang.AssertionError: timed out waiting for green state
        at __randomizedtesting.SeedInfo.seed([63FE4BB64BECDB9A:30342FD360A8A48E]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.opensearch.test.OpenSearchIntegTestCase.ensureColor(OpenSearchIntegTestCase.java:985)
        at org.opensearch.test.OpenSearchIntegTestCase.ensureGreen(OpenSearchIntegTestCase.java:924)
        at org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation(MovePrimaryFirstTests.java:116)

dreamer-89 · 2022-01-29T17:52:25Z

Another occurrence:
https://fork-jenkins.searchservices.aws.dev/job/OpenSearch_CI/job/PR_Checks/job/Gradle_Check/2124/artifact/gradle_check_2124.log

owaiskazi19 · 2022-01-31T18:22:12Z

Looks like the above mentioned commit wasn’t rebased with the latest main.

saratvemulapalli · 2022-02-02T03:32:46Z

Another failure after main is rebased: #2033

owaiskazi19 · 2022-02-02T17:47:58Z

@jainankitk what do you think about increasing the timeout to 120 here? Looks like test is still failing for 60.

andrross · 2022-02-02T19:02:55Z

Another failure here: #1917 (comment)

jainankitk · 2022-02-02T19:41:24Z

Taking a look

jainankitk · 2022-02-03T02:31:06Z

The issue is caused due to one of the primary shard being initialized and some replica starts meanwhile. Hence, latch is counted down as half shards are already initialized. Making the check more robust by ensuring no primaries are initializing and not more than 20% of replicas have started on new nodes

dreamer-89 · 2022-02-09T01:24:50Z

Another failure here: #2026 comment

jainankitk · 2022-02-09T03:49:24Z

Okay, I can see that none of the shard was unassigned, just 1 replica was remaining that would have started given few more seconds. @owaiskazi19 - I will increase the timeout to 60 seconds! :)

  1> --------[foo][98], node[gK0yMdlgQcq7XDvBVcHqHA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=AHHzR21gRwyH9m8Z2tDMoA], unassigned_info[[reason=NODE_LEFT], at[2022-02-09T01:03:06.551Z], delayed=false, details[node_left [dDqzlwdpR5yo4Eu75qv02Q]], allocation_status[no_attempt]]
  1> --------[foo][15], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=YA_tO2W5SlGHclGkPmNptQ]
  1> --------[foo][40], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=6B01KebASMCgo8EzO8YKZg]
  1> --------[foo][6], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=cnF4_Bs4QqCuxxj_HFAqgw]
  1> --------[foo][95], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pAGuKsjjR-G3RbvFpvYwQQ]
  1> --------[foo][20], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pEms1UhDSxW0LUngYrIsPQ]
  1> --------[foo][14], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=aqRw-O9sQ3i3odC_TtwSwg]
  1> --------[foo][76], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=9pd7AEihRFO-_Cwoa_UAGA]
  1> --------[foo][7], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=OUN6zdcYTFK5F3MfFQOdrg]
  1> --------[foo][81], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=H5BXr0llSS-pR3ai8VrySg]
  1> --------[foo][89], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=y-_-RKj8SyCfKIFnvmQL4w]
  1> --------[foo][24], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=pq4U5n3tQlOGFmrxcNiJ3Q]
  1> --------[foo][19], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qs8kmVy1RAKaRKmCdWJsXQ]
  1> --------[foo][59], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=AllB9KA8ST-R6hiADkbp8Q]
  1> --------[foo][58], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=Cc110Ve-QAyxnMs8gwN5jQ]
  1> --------[foo][26], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=6bx8XZ49QZCVLwMAHzFO9A]
  1> --------[foo][66], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=m-0y5bnxREajEeNHSDVlPQ]
  1> --------[foo][86], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=p0Q47DxfTleki8s42UIqhg]
  1> --------[foo][9], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=y-F_n75mR3KpWbHkMXqBVg]
  1> --------[foo][17], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qjWibVYFSFildJbykxn3rg]
  1> --------[foo][44], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=bA-J1mDcR_u7CdCgneZxzQ]
  1> --------[foo][94], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=qr73XHAKRKebttLJrfbVPw]
  1> --------[foo][11], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=CTR9cjS3QfSFG0qt7VRrbQ]
  1> --------[foo][28], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=izZjpNhoS-2AFvj4Y4Y2JQ]
  1> --------[foo][30], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=fzERcAZ8TemBOeaUKWsMxw]
  1> --------[foo][53], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=fiMnSwICTN--mEf8Fn3oqQ]
  1> --------[foo][71], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=7xbaRZI2QPKdvBrE5iT0JQ]
  1> --------[foo][32], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=bFplE9pGTY-uTVHdl5RByQ]
  1> --------[foo][43], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=wsdW3kXSRZakgjBEUoLNOw]
  1> --------[foo][52], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=zxDyzd0gTISJfR05fNkZSw]
  1> --------[foo][3], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=j_ePRsVJRm2DPaDHu-q-tg]
  1> --------[foo][92], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=OBJ-9NKNTU6kZQ9R0a0s6A]
  1> --------[foo][82], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=5L3LOKOiR5mGoFwWoCxkHg]
  1> --------[foo][41], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=wTks5SaaQ2a7i74Cy7WwXA]
  1> --------[foo][50], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=D2n8YmDQSbab6YaDEmNk-w]
  1> --------[foo][33], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=i0GmrglcSyWLeM7jBmtU7Q]
  1> --------[foo][57], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=LngaeINzSQCPywze9NU  1> 2Rw]
  1> --------[foo][18], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=VxHWMVCUThy08tOoJ6VAww]
  1> --------[foo][61], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=9EPpu_ThT8u1HXDIX-DVzQ]
  1> --------[foo][48], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=LE_hapDHTdyGqdUm1xNk_Q]
  1> --------[foo][51], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=UWCrnlekRJ6kTKufo_cmUg]
  1> --------[foo][85], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=0OIL6unHQMyhG0MpuEXrdQ]
  1> --------[foo][38], node[gK0yMdlgQcq7XDvBVcHqHA], [R], s[STARTED], a[id=ir4h6zK0SDGCzYEEu5QBUw]
  1> ---- unassigned

  1> pending tasks:
  1> tasks: (1):
  1> 933/URGENT/shard-started StartedShardEntry{shardId [[foo][98]], allocationId [AHHzR21gRwyH9m8Z2tDMoA], primary term [1], message [after peer recovery]}/36ms

andrross · 2022-02-09T17:26:03Z

Is the issue here that this test is using 100 primary shards, which is more than most other tests use, and therefore it takes longer and requires more resources from the test machine to get everything started? My concern is that even if we get the timeouts set to work well for the hosts used by CI infrastructure it might still be flakey when run on developer machines. Is there anything we can do to make this test more deterministic?

jainankitk · 2022-02-09T22:24:04Z

Is the issue here that this test is using 100 primary shards, which is more than most other tests use, and therefore it takes longer and requires more resources from the test machine to get everything started? My concern is that even if we get the timeouts set to work well for the hosts used by CI infrastructure it might still be flakey when run on developer machines. Is there anything we can do to make this test more deterministic?

I don't think 100 shards is the issue here. I have been able to run the test on my machine several times without any issue. Though thinking more on it, the test might be able to run with even smaller number like 25 or 50 primary shards. I will wait to see if the issue is reported by anyone else.

dblock · 2022-02-14T20:45:14Z

#2069 (comment)

jainankitk · 2022-02-14T21:00:44Z

#2069 (comment)

Okay, I can see that some of the shards were initializing. Considering reducing the number of shards instead of increasing the timeout to not increase overall test suite time

dblock · 2022-02-14T22:23:10Z

#2069 (comment)

kartg · 2022-02-15T01:04:50Z

#2096 (comment)

jainankitk · 2022-02-15T21:02:40Z

The test was failing due to some replica shards initializing and completing before last primary shard could finish initializing. Discussed the issue with @dblock this morning to make test more predictable. Suggestion was to add shards per node constraint to allow exactly 50% of shards to relocate on new nodes. Including the constraint helped make tests really lightweight (completes in less than 5 seconds) and ran locally without any failure more than 500 times

jainankitk · 2022-03-17T21:39:19Z

All above reported failures were for this test before the latest fix.

No recent failures, this issue can be resolved for good - @VachaShah

jainankitk · 2022-05-17T21:01:05Z

This issue can be closed. @VachaShah @dblock @andrross

VachaShah added bug Something isn't working >test-failure Test failure from CI, local build, etc. untriaged flaky-test Random test failure that succeeds on second run labels Jan 21, 2022

VachaShah mentioned this issue Jan 21, 2022

Using pull_request_target in place of pull_request #1952

Merged

1 task

CEHENKLE removed the untriaged label Jan 25, 2022

This was referenced Jan 25, 2022

[Refactor] InternalEngine to always use soft deletes #1933

Merged

[Meta] Fix random test failures #1715

Closed

[Deprecate] Setting explicit version on analysis component #1978

Merged

[Deprecate] index.merge.policy.max_merge_at_once_explicit #1981

Merged

owaiskazi19 mentioned this issue Jan 27, 2022

Added timeout to ensureGreen() for testClusterGreenAfterPartialRelocation #1983

Merged

5 tasks

andrross mentioned this issue Jan 27, 2022

build: introduce support for reproducible builds #1995

Merged

2 tasks

owaiskazi19 self-assigned this Jan 28, 2022

dblock mentioned this issue Feb 2, 2022

Adding support for JDK17 and removing JDK8 #2025

Merged

1 task

andrross mentioned this issue Feb 2, 2022

Fix AssertionError message #2044

Merged

5 tasks

jainankitk mentioned this issue Feb 3, 2022

Stabilizing org.opensearch.cluster.routing.MovePrimaryFirstTests.test… #2048

Merged

3 tasks

owaiskazi19 removed their assignment Feb 9, 2022

owaiskazi19 assigned jainankitk Feb 9, 2022

jainankitk mentioned this issue Feb 9, 2022

Added timeout to ensureGreen() for testClusterGreenAfterPartialReloca… #2074

Merged

3 tasks

jainankitk mentioned this issue Feb 10, 2022

Backport/backport 2048,2074 to 1.x #2085

Merged

3 tasks

reta mentioned this issue Feb 14, 2022

Mapping update for “date_range” field type is not idempotent #2094

Merged

5 tasks

dblock mentioned this issue Feb 14, 2022

[Backport 1.x] Add regexp interval source #2069

Merged

kartg mentioned this issue Feb 15, 2022

Add proxy settings for GCS repository plugin #2096

Merged

5 tasks

jainankitk mentioned this issue Feb 15, 2022

Adding shards per node constraint for predictability to testClusterGr… #2110

Merged

3 tasks

nknize mentioned this issue Feb 18, 2022

[Remove] Default Mapping #2151

Merged

VachaShah mentioned this issue Feb 18, 2022

Bump asm from 7.1 to 9.2 in /test/logger-usage #2134

Merged

dblock mentioned this issue Feb 19, 2022

[Backport 1.x] Override Default Distribution Download Url with Custom Distribution Url When User Passes a Url #2175

Closed

owaiskazi19 mentioned this issue Feb 26, 2022

[Backport 1.x] Install plugin command help (#2193) #2264

Merged

kartg closed this as completed May 18, 2022

This was referenced Jun 16, 2023

[Backport 1.x] Bump guava bump to 32.0.1-jre #8124

Merged

Fix mapping char_filter when mapping a hashtag (#7591) #8147

Merged

Bump versions of gradle-info-plugin and nebula-publishing-plugin #8150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Timeout on org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation #1957

[BUG] Timeout on org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation #1957

VachaShah commented Jan 21, 2022

owaiskazi19 commented Jan 27, 2022

jainankitk commented Jan 27, 2022

owaiskazi19 commented Jan 27, 2022 •

edited

Loading

jainankitk commented Jan 27, 2022

dreamer-89 commented Jan 29, 2022

dreamer-89 commented Jan 29, 2022

owaiskazi19 commented Jan 31, 2022

saratvemulapalli commented Feb 2, 2022

owaiskazi19 commented Feb 2, 2022 •

edited

Loading

andrross commented Feb 2, 2022

jainankitk commented Feb 2, 2022

jainankitk commented Feb 3, 2022

dreamer-89 commented Feb 9, 2022 •

edited

Loading

jainankitk commented Feb 9, 2022

andrross commented Feb 9, 2022

jainankitk commented Feb 9, 2022

dblock commented Feb 14, 2022

jainankitk commented Feb 14, 2022

dblock commented Feb 14, 2022

kartg commented Feb 15, 2022

jainankitk commented Feb 15, 2022

jainankitk commented Mar 17, 2022 •

edited

Loading

jainankitk commented May 17, 2022

[BUG] Timeout on org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation #1957

[BUG] Timeout on org.opensearch.cluster.routing.MovePrimaryFirstTests.testClusterGreenAfterPartialRelocation #1957

Comments

VachaShah commented Jan 21, 2022

owaiskazi19 commented Jan 27, 2022

jainankitk commented Jan 27, 2022

owaiskazi19 commented Jan 27, 2022 • edited Loading

jainankitk commented Jan 27, 2022

dreamer-89 commented Jan 29, 2022

dreamer-89 commented Jan 29, 2022

owaiskazi19 commented Jan 31, 2022

saratvemulapalli commented Feb 2, 2022

owaiskazi19 commented Feb 2, 2022 • edited Loading

andrross commented Feb 2, 2022

jainankitk commented Feb 2, 2022

jainankitk commented Feb 3, 2022

dreamer-89 commented Feb 9, 2022 • edited Loading

jainankitk commented Feb 9, 2022

andrross commented Feb 9, 2022

jainankitk commented Feb 9, 2022

dblock commented Feb 14, 2022

jainankitk commented Feb 14, 2022

dblock commented Feb 14, 2022

kartg commented Feb 15, 2022

jainankitk commented Feb 15, 2022

jainankitk commented Mar 17, 2022 • edited Loading

jainankitk commented May 17, 2022

owaiskazi19 commented Jan 27, 2022 •

edited

Loading

owaiskazi19 commented Feb 2, 2022 •

edited

Loading

dreamer-89 commented Feb 9, 2022 •

edited

Loading

jainankitk commented Mar 17, 2022 •

edited

Loading