Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RemoteStoreStatsIT.testStatsResponseFromAllNodes is flaky #9828

Closed
shourya035 opened this issue Sep 6, 2023 · 4 comments
Closed

[BUG] RemoteStoreStatsIT.testStatsResponseFromAllNodes is flaky #9828

shourya035 opened this issue Sep 6, 2023 · 4 comments
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run

Comments

@shourya035
Copy link
Member

Describe the bug
RemoteStoreStatsIT.testStatsResponseFromAllNodes is failing
Gradle build: https://build.ci.opensearch.org/job/gradle-check/24535/

Seems like the failure is coming from:

private void assertNonZeroTranslogUploadStatsNoFailures(RemoteTranslogTransferTracker.Stats stats) {
assertTrue(stats.uploadBytesStarted > 0);
assertTrue(stats.totalUploadsStarted > 0);

@shourya035 shourya035 added bug Something isn't working untriaged labels Sep 6, 2023
@shourya035
Copy link
Member Author

Tagging @BhumikaSaini-Amazon

@BhumikaSaini-Amazon
Copy link
Contributor

BhumikaSaini-Amazon commented Sep 7, 2023

Thanks @shourya035

I tried to repro this locally:

Tests with failures:
 - org.opensearch.remotestore.RemoteStoreStatsIT.testStatsResponseFromAllNodes {seed=[BF41DC8F5BDE48FB:EAF82628D3FF804A]}
 - org.opensearch.remotestore.RemoteStoreStatsIT.testStatsResponseFromAllNodes {seed=[BF41DC8F5BDE48FB:E70285B0347709C0]}
 - org.opensearch.remotestore.RemoteStoreStatsIT.testStatsResponseFromAllNodes {seed=[BF41DC8F5BDE48FB:5EF77F8498B1EB9F]}

200 tests completed, 3 failed

> Task :server:internalClusterTest FAILED

The 3/200 failed instances are failing on asserting 0 failed segment uploads:

assertEquals(0, stats.totalUploadsFailed);
java.lang.AssertionError: expected:<0> but was:<1>
	at __randomizedtesting.SeedInfo.seed([BF41DC8F5BDE48FB:E70285B0347709C0]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:647)
	at org.junit.Assert.assertEquals(Assert.java:633)
	at org.opensearch.remotestore.RemoteStoreStatsIT.validateSegmentUploadStats(RemoteStoreStatsIT.java:627)
	at org.opensearch.remotestore.RemoteStoreStatsIT.testStatsResponseFromAllNodes(RemoteStoreStatsIT.java:81)
	at jdk.internal.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)

Will check for more iterations and post if I am able to repro with tlog failures.

@ashking94
Copy link
Member

ashking94 commented Jan 8, 2024

I have run this test for around 1K iterations without failure once. Reran again and seeing failure this time.

Assertion failure stack trace -

java.lang.AssertionError: expected:<0> but was:<1>
	at __randomizedtesting.SeedInfo.seed([C2031211D4089003:63E80CD13F4F4C86]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:647)
	at org.junit.Assert.assertEquals(Assert.java:633)
	at org.opensearch.remotestore.RemoteStoreStatsIT.validateSegmentUploadStats(RemoteStoreStatsIT.java:733)
	at org.opensearch.remotestore.RemoteStoreStatsIT.testStatsResponseFromAllNodes(RemoteStoreStatsIT.java:91)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:1583)

It looks to be happening due to below exception -

[2024-01-09T04:09:11,958][WARN ][o.o.i.s.RemoteStoreRefreshListener] [node_t1] [remote-store-test-idx-1][0] Exception: [java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValue()" because the return value of "java.util.Map.get(Object)" is null] while uploading segment files
java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValue()" because the return value of "java.util.Map.get(Object)" is null
	at org.opensearch.index.shard.RemoteStoreRefreshListener$2.onSuccess(RemoteStoreRefreshListener.java:539) ~[main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.lambda$uploadNewSegments$3(RemoteStoreRefreshListener.java:390) ~[main/:?]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.store.RemoteSegmentStoreDirectory.copyFrom(RemoteSegmentStoreDirectory.java:466) [main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.uploadNewSegments(RemoteStoreRefreshListener.java:401) [main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:254) [main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:152) [main/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:160) [main/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:852) [main/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2024-01-09T04:09:11,965][WARN ][o.o.i.s.RemoteSegmentStoreDirectory] [node_t1] [remote-store-test-idx-1][0] Exception while uploading file segments_3 to the remote segment store
java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValue()" because the return value of "java.util.Map.get(Object)" is null
	at org.opensearch.index.shard.RemoteStoreRefreshListener$2.onFailure(RemoteStoreRefreshListener.java:547) ~[main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.lambda$uploadNewSegments$5(RemoteStoreRefreshListener.java:397) ~[main/:?]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:84) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.store.RemoteSegmentStoreDirectory.copyFrom(RemoteSegmentStoreDirectory.java:466) [main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.uploadNewSegments(RemoteStoreRefreshListener.java:401) [main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:254) [main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:152) [main/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:160) [main/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:852) [main/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2024-01-09T04:09:11,966][WARN ][o.o.i.s.RemoteStoreRefreshListener] [node_t1] [remote-store-test-idx-1][0] Exception: [java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValue()" because the return value of "java.util.Map.get(Object)" is null] while uploading segment files
java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValue()" because the return value of "java.util.Map.get(Object)" is null
	at org.opensearch.index.shard.RemoteStoreRefreshListener$2.onFailure(RemoteStoreRefreshListener.java:547) ~[main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.lambda$uploadNewSegments$5(RemoteStoreRefreshListener.java:397) [main/:?]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:84) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.store.RemoteSegmentStoreDirectory.copyFrom(RemoteSegmentStoreDirectory.java:466) [main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.uploadNewSegments(RemoteStoreRefreshListener.java:401) [main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:254) [main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:152) [main/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:160) [main/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:852) [main/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2024-01-09T04:09:11,968][ERROR][o.o.i.s.RemoteStoreRefreshListener] [node_t1] [remote-store-test-idx-1][0] Exception in RemoteStoreRefreshListener.afterRefresh()
java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValue()" because the return value of "java.util.Map.get(Object)" is null
	at org.opensearch.index.shard.RemoteStoreRefreshListener$2.onFailure(RemoteStoreRefreshListener.java:547) ~[main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.lambda$uploadNewSegments$5(RemoteStoreRefreshListener.java:397) ~[main/:?]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.store.RemoteSegmentStoreDirectory.copyFrom(RemoteSegmentStoreDirectory.java:470) ~[main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.uploadNewSegments(RemoteStoreRefreshListener.java:401) ~[main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:254) [main/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:152) [main/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:160) [main/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:852) [main/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2024-01-09T04:09:11,969][INFO ][o.o.i.s.RemoteStoreRefreshListener] [node_t1] [remote-store-test-idx-1][0] Scheduled retry with didRefresh=true

@ashking94
Copy link
Member

I have root caused this issue to be same as #9774. @linuxpi is already working on fixing it. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run
Projects
None yet
Development

No branches or pull requests

5 participants