Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing flaky failure #9191

Closed
gbbafna opened this issue Aug 9, 2023 · 11 comments · Fixed by #9565
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Storage Issues and PRs relating to data and metadata storage v2.13.0 Issues and PRs related to version 2.13.0

Comments

@gbbafna
Copy link
Collaborator

gbbafna commented Aug 9, 2023

Describe the bug

4 failures in org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing (21818,21853,22063,22065)

Meta issue - #8279

@reta
Copy link
Collaborator

reta commented Nov 28, 2023

Sadly the issue is not fixed:

java.lang.AssertionError: timed out waiting for relocation iteration [1] 
	at org.opensearch.indices.recovery.IndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing(IndexPrimaryRelocationIT.java:128)
	at org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing(RemoteIndexPrimaryRelocationIT.java:54)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:1583)

https://build.ci.opensearch.org/job/gradle-check/30517/testReport/junit/org.opensearch.remotestore/RemoteIndexPrimaryRelocationIT/testPrimaryRelocationWhileIndexing/

@peternied
Copy link
Member

@andrross
Copy link
Member

#12374 (comment)

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing" -Dtests.seed=5269994C327E7FD4 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-CO -Dtests.timezone=Asia/Jerusalem -Druntime.java=21

org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT > testPrimaryRelocationWhileIndexing FAILED
    java.lang.AssertionError: timed out waiting for relocation iteration [1] 
        at org.opensearch.indices.recovery.IndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing(IndexPrimaryRelocationIT.java:128)
        at org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing(RemoteIndexPrimaryRelocationIT.java:49)

    java.lang.AssertionError: local checkpoint [2] does not match checkpoint from primary context [PrimaryContext{clusterStateVersion=11, checkpoints={nGakQWX8R9KvpZ_0TRmH7A=LocalCheckpointState{localCheckpoint=2, globalCheckpoint=1, inSync=true, tracked=true, replicated=true}, n3ewmqN0QPie74GcmSR4QA=LocalCheckpointState{localCheckpoint=1, globalCheckpoint=1, inSync=true, tracked=true, replicated=true}}, routingTable=IndexShardRoutingTable([test][0]){[test][0], node[iXsoU28JQzqETiYo1eIRbQ], relocating [c-tpWdW7RsqF5g-oLUYCRg], [P], s[RELOCATING], a[id=nGakQWX8R9KvpZ_0TRmH7A, rId=n3ewmqN0QPie74GcmSR4QA]}}]
        at __randomizedtesting.SeedInfo.seed([5269994C327E7FD4]:0)
        at org.opensearch.index.shard.IndexShard.activateWithPrimaryContext(IndexShard.java:3464)
        at org.opensearch.indices.recovery.RecoveryTarget.handoffPrimaryContext(RecoveryTarget.java:305)
        at org.opensearch.indices.recovery.PeerRecoveryTargetService$HandoffPrimaryContextRequestHandler.messageReceived(PeerRecoveryTargetService.java:431)
        at org.opensearch.indices.recovery.PeerRecoveryTargetService$HandoffPrimaryContextRequestHandler.messageReceived(PeerRecoveryTargetService.java:425)
        at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106)
        at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:480)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.****/java.lang.Thread.run(Thread.java:1583)

@sohami
Copy link
Collaborator

sohami commented Feb 20, 2024

CI with failure: https://build.ci.opensearch.org/job/gradle-check/33937/testReport/junit/org.opensearch.remotestore/RemoteIndexPrimaryRelocationIT/testPrimaryRelocationWhileIndexing/

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing" -Dtests.seed=39EA872CCCBDE24A -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-US -Dtests.timezone=America/Goose_Bay -Druntime.java=21

feb 19, 2024 8:47:37 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
ADVERTENCIA: Uncaught exception in thread: Thread[#2029,opensearch[node_t1][generic][T#2],5,TGRP-RemoteIndexPrimaryRelocationIT]
java.lang.AssertionError: local checkpoint [9] does not match checkpoint from primary context [PrimaryContext{clusterStateVersion=15, checkpoints={MYNdvCfISO-NnoEkwL95lw=LocalCheckpointState{localCheckpoint=9, globalCheckpoint=8, inSync=true, tracked=true, replicated=true}, i5IiBar_QwK4i2YqQkpIjg=LocalCheckpointState{localCheckpoint=8, globalCheckpoint=8, inSync=true, tracked=true, replicated=true}}, routingTable=IndexShardRoutingTable([test][0]){[test][0], node[3pqwwzGRS7qWaTAVaIrnqw], relocating [I_Z9NuQWSSyxkaQmz-M3eQ], [P], s[RELOCATING], a[id=MYNdvCfISO-NnoEkwL95lw, rId=i5IiBar_QwK4i2YqQkpIjg]}}]
	at __randomizedtesting.SeedInfo.seed([39EA872CCCBDE24A]:0)
	at org.opensearch.index.shard.IndexShard.activateWithPrimaryContext(IndexShard.java:3451)
	at org.opensearch.indices.recovery.RecoveryTarget.handoffPrimaryContext(RecoveryTarget.java:258)
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$HandoffPrimaryContextRequestHandler.messageReceived(PeerRecoveryTargetService.java:430)
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$HandoffPrimaryContextRequestHandler.messageReceived(PeerRecoveryTargetService.java:424)
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106)
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:480)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

@peternied
Copy link
Member

New failure: https://build.ci.opensearch.org/job/gradle-check/33963/testReport/

java.lang.AssertionError: Count is 2 hits but 1 was expected. Total shards: 1 Successful shards: 1 & 0 shard failures:

@peternied
Copy link
Member

@sachinpkale This test seems to be heavily influencing flaky run failures - how we help get this issue resolved?

@sachinpkale sachinpkale added v2.13.0 Issues and PRs related to version 2.13.0 and removed v2.10.0 labels Feb 26, 2024
@sachinpkale
Copy link
Member

@sachinpkale This test seems to be heavily influencing flaky run failures - how we help get this issue resolved?

I will try to prioritize the fix.

@peternied
Copy link
Member

@sachinpkale @rramachand21 In the past 30 days, this flaky test has impacted pull requests including [#12394, #12383, #12382 (repeated), #12376, #12375 (repeated), #12374 (repeated), #12372, #12368 (repeated), #12367, #12343, #12337 (repeated), #12326 (repeated), #12320 (repeated), #12316, #12301 (repeated), #12293 (repeated), #12290 (repeated), #12278, #12273 (repeated), #12260, #12196, #12193, #12183, #12180, #12168 (repeated), #12154, #12148 (repeated), #12136, #12133 (repeated), #12121, #12111 (repeated)].

Please prioritize fixing this test or disabling the test case until it can be fixed.

@sachinpkale sachinpkale moved this from 🆕 New to 🏗 In progress in Storage Project Board Mar 6, 2024
@sachinpkale
Copy link
Member

The test is fixed as part of #12494

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Storage Issues and PRs relating to data and metadata storage v2.13.0 Issues and PRs related to version 2.13.0
Projects
Status: ✅ Done
7 participants