Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.upgrades.IndexingIT.testIndexingWithSegRep test failure #7679

Closed
gashutos opened this issue May 23, 2023 · 15 comments · Fixed by #7876, #9079 or #12043
Closed

[BUG] org.opensearch.upgrades.IndexingIT.testIndexingWithSegRep test failure #7679

gashutos opened this issue May 23, 2023 · 15 comments · Fixed by #7876, #9079 or #12043
Assignees
Labels
bug Something isn't working distributed framework flaky-test Random test failure that succeeds on second run

Comments

@gashutos
Copy link
Contributor

Describe the bug
./gradlew check is failing for #7453

org.opensearch.client.ResponseException: method [POST], host [http://[::1]:45733], URI [/test-index-segrep/_search?rest_total_hits_as_int=true&filter_path=hits.total&preference=_shards%3A1%7C_only_nodes%3APHxrXxV9RPOX7nX4OCYXyw], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"no data nodes with criteria [PHxrXxV9RPOX7nX4OCYXyw] found for shard: [test-index-segrep][1]"}],"type":"illegal_argument_exception","reason":"no data nodes with criteria [PHxrXxV9RPOX7nX4OCYXyw] found for shard: [test-index-segrep][1]"},"status":400}
	at __randomizedtesting.SeedInfo.seed([7EFF7D9B1E55E8C2:F1866BD6EA0C1D03]:0)
	at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:384)
	at app//org.opensearch.client.RestClient.performRequest(RestClient.java:354)
	at app//org.opensearch.client.RestClient.performRequest(RestClient.java:329)
	at app//org.opensearch.upgrades.IndexingIT.lambda$waitForSearchableDocs$0(IndexingIT.java:135)
	at app//org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1072)
	at app//org.opensearch.upgrades.IndexingIT.waitForSearchableDocs(IndexingIT.java:130)
	at app//org.opensearch.upgrades.IndexingIT.testIndexingWithSegRep(IndexingIT.java:295)
	at [email protected]/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at [email protected]/java.lang.reflect.Method.invoke(Method.java:578)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at app//org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at [email protected]/java.lang.Thread.run(Thread.java:1589)
@gashutos
Copy link
Contributor Author

@dreamer-89 can you fix this ?

@dreamer-89
Copy link
Member

@dreamer-89 can you fix this ?

Thank you @gashutos for creating this issue. I will look into it.

@msfroh
Copy link
Collaborator

msfroh commented May 24, 2023

Not sure if it's the same underlying issue, but this test also fails with

rg.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:36429], URI [/test-index-segrep], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [test-index-segrep/uxvIXhHGRVyG43-f_8-Ozg] already exists","index":"test-index-segrep","index_uuid":"uxvIXhHGRVyG43-f_8-Ozg"}],"type":"resource_already_exists_exception","reason":"index [test-index-segrep/uxvIXhHGRVyG43-f_8-Ozg] already exists","index":"test-index-segrep","index_uuid":"uxvIXhHGRVyG43-f_8-Ozg"},"status":400}
	at __randomizedtesting.SeedInfo.seed([363D470782E0DDAE:B944514A76B9286F]:0)
	at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:384)
	at app//org.opensearch.client.RestClient.performRequest(RestClient.java:354)
	at app//org.opensearch.client.RestClient.performRequest(RestClient.java:329)
	at app//org.opensearch.test.rest.OpenSearchRestTestCase.createIndex(OpenSearchRestTestCase.java:988)
	at app//org.opensearch.test.rest.OpenSearchRestTestCase.createIndex(OpenSearchRestTestCase.java:971)
	at app//org.opensearch.test.rest.OpenSearchRestTestCase.createIndex(OpenSearchRestTestCase.java:967)
	at app//org.opensearch.upgrades.IndexingIT.testIndexingWithSegRep(IndexingIT.java:262)
	at [email protected]/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at [email protected]/java.lang.reflect.Method.invoke(Method.java:578)

See e.g. https://build.ci.opensearch.org/job/gradle-check/16254/testReport/junit/org.opensearch.upgrades/IndexingIT/testIndexingWithSegRep/

@kotwanikunal
Copy link
Member

Re-opening this issue as I noticed this in #8198

@kotwanikunal kotwanikunal added flaky-test Random test failure that succeeds on second run and removed untriaged labels Jul 13, 2023
@dreamer-89
Copy link
Member

Tried to run the test locally with failing seed but tests ran without failure.

@reta
Copy link
Collaborator

reta commented Aug 14, 2023

Sadly, the issue is not gone, test is still faliling (as of today) but with a different reaso:

org.opensearch.client.ResponseException: method [PUT], host [http://[::1]:34843], URI [/test-index-segrep], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"unknown value for [index.codec] must be one of [default, best_compression, zstd, zstd_no_dict] but was: zlib"}],"type":"illegal_argument_exception","reason":"unknown value for [index.codec] must be one of [default, best_compression, zstd, zstd_no_dict] but was: zlib"},"status":400}

@sarthakaggarwal97 could you please take a look? thank you

[1] https://build.ci.opensearch.org/job/gradle-check/22620/testReport/junit/org.opensearch.upgrades/IndexingIT/testIndexingWithSegRep/

@mch2
Copy link
Member

mch2 commented Aug 14, 2023

This is popping up quite a bit, suggest we mute again until fixed @dreamer-89

@dreamer-89
Copy link
Member

Upgrade failure

Assertion trips after two nodes are upgraded to 3.0.0

> Task :qa:rolling-upgrade:v2.10.0#twoThirdsUpgradedTest FAILED

Assertion trip

The replica shard sitting on the only non-upgraded node v2.10.0-2 does not catch up with primary and remains 1 checkpoints_behind stat. Even though, this verification is wrapped inside assertBusy and retried for 1 minute.

  1> [2023-08-11T19:01:23,157][INFO ][o.o.u.IndexingIT         ] [testIndexingWithSegRep] after test
  2> REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:v2.10.0#twoThirdsUpgradedTest' --tests "org.opensearch.upgrades.IndexingIT.testIndexingWithSegRep" -Dtests.seed=DAE468033CE81DEE -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=id -Dtests.timezone=America/Argentina/San_Juan -Druntime.java=20
  2> java.lang.AssertionError: Replica shard [test-index-segrep][1]not upto date with primary  expected:<0> but was:<1>
        at __randomizedtesting.SeedInfo.seed([DAE468033CE81DEE:559D7E4EC8B1E82F]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.opensearch.upgrades.IndexingIT.lambda$verifySegmentStats$1(IndexingIT.java:166)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1084)
        at org.opensearch.upgrades.IndexingIT.verifySegmentStats(IndexingIT.java:160)
        at org.opensearch.upgrades.IndexingIT.waitForSearchableDocs(IndexingIT.java:97)
        at org.opensearch.upgrades.IndexingIT.testIndexingWithSegRep(IndexingIT.java:334)

The replica shard test-index-segrep[2] on node v2.10.0-2 also lags on overall document count which suggests a generic issue where replica is not able to catch up with primary even after 1 minute of overall wait time.

  1> test-index-segrep                       0     r      STARTED       7 23.7kb 127.0.0.1 v2.10.0-0
  1> test-index-segrep                       0     r      STARTED       7 16.3kb 127.0.0.1 v2.10.0-1
  1> test-index-segrep                       0     p      STARTED       7 23.6kb 127.0.0.1 v2.10.0-2
  1> test-index-segrep                       1     p      STARTED       6  8.4kb 127.0.0.1 v2.10.0-0
  1> test-index-segrep                       1     r      STARTED       6  8.4kb 127.0.0.1 v2.10.0-1
  1> test-index-segrep                       1     r      STARTED       5  8.3kb 127.0.0.1 v2.10.0-2
  1> test-index-segrep                       2     p      STARTED       3  4.2kb 127.0.0.1 v2.10.0-0
  1> test-index-segrep                       2     r      STARTED       3  4.2kb 127.0.0.1 v2.10.0-1
  1> test-index-segrep                       2     r      STARTED       1  4.1kb 127.0.0.1 v2.10.0-2

@sarthakaggarwal97
Copy link
Contributor

@reta this looks weird, in this PR #9170 the exception message, alongside the acceptable values have been modified in EngineConfig, ideally we should not be seeing this. I'll take a look and understand why is this happening.

@dreamer-89 dreamer-89 assigned andrross and unassigned dreamer-89 Sep 6, 2023
@andrross
Copy link
Member

andrross commented Sep 20, 2023

Some updates on progress here:

  • I am able to reproduce the failure semi-regularly on the latest code on main. It fails with the same errors as detailed by Suraj above.
  • The test configures 3 shards and 2 replicas for a 3 node cluster. If I reduce either of these counts, the test does not fail.
  • The test task ':qa:rolling-upgrade:v2.11.0#twoThirdsUpgradedTest' will run its dependencies oneThirdUpgradedTest, and oldClusterTest, and when it runs the dependencies it runs all tests even if you pass the parameter --tests "MyTest". If I remove the other tests so that they don't run, then I cannot reproduce the failure. This may suggest that some other test is interfering here. Update: RecoveryIT appears to be the test that causes problems. If I remove RecoveryIT, I cannot reproduce the failure. If I keep RecoveryIT but remove all the other tests, I can reproduce the failure.

@dreamer-89
Copy link
Member

RecoveryIT appears to be the test that causes problems. If I remove RecoveryIT, I cannot reproduce the failure. If I keep RecoveryIT but remove all the other tests, I can reproduce the failure.

This is interesting. Thanks for catching this @andrross

@andrross andrross assigned mch2 and unassigned andrross Jan 11, 2024
@andrross
Copy link
Member

@mch2 Has helped me out here, and it seems that in rare cases a replica is getting stuck somewhere in the segment replication process. The replica node appears to try to replicate from a node that is no longer the primary, and that process must not be failing correctly because replication gets stuck and it manifests as the replica not being able to catch up to the latest segments, leading to the test failure. I'm going to hand this issue off for @mch2 to continue to investigate/fix.

@andrross
Copy link
Member

andrross commented Jan 16, 2024

@mch2 found that the replica seems to be getting stuck retrying replication from a node that has gone down (presumably for upgrade). A new primary exists but the replica node is stuck retrying the connection error, with a default retry timeout of 1 minute. I ran an experiment where I hardcoded the retry timeout from 1 minute to 1 second (on both the 2.x version and the main version) and the test was able to run all weekend (>2400 iterations) with no failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment