Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flaky org.opensearch.remotestore.RemoteStoreStatsIT.testStatsResponseFromLocalNode test #9774

Closed
ashking94 opened this issue Sep 5, 2023 · 6 comments
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Storage:Durability Issues and PRs related to the durability framework v2.11.0 Issues and PRs related to version 2.11.0

Comments

@ashking94
Copy link
Member

Describe the bug
org.opensearch.remotestore.RemoteStoreStatsIT.testStatsResponseFromLocalNode is flaky.

To Reproduce

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.RemoteStoreStatsIT.testStatsResponseFromLocalNode" -Dtests.seed=9A2A91FCF5675AF6 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-NI -Dtests.timezone=Antarctica/McMurdo -Druntime.java=20

Note - I have not verified if this reproducible.

Expected behavior
Test should always pass.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
CI - https://build.ci.opensearch.org/job/gradle-check/24470/

@ashking94 ashking94 added bug Something isn't working Storage:Durability Issues and PRs related to the durability framework v2.10.0 labels Sep 5, 2023
@kotwanikunal kotwanikunal added the flaky-test Random test failure that succeeds on second run label Sep 19, 2023
@sachinpkale sachinpkale added v2.11.0 Issues and PRs related to version 2.11.0 and removed v2.10.0 labels Sep 23, 2023
@ashking94
Copy link
Member Author

#11361 (comment)

@linuxpi
Copy link
Collaborator

linuxpi commented Jan 12, 2024

1> [2024-01-12T14:08:06,423][INFO ][o.o.i.s.RemoteStoreRefreshListener] [node_t2] [remote-store-test-idx-1][1] Scheduled retry with didRefresh=true
  1> [2024-01-12T14:08:06,424][INFO ][o.o.i.s.RemoteStoreRefreshListener] [node_t1] [remote-store-test-idx-1][2] Scheduled retry with didRefresh=true
  1> [2024-01-12T14:08:06,668][INFO ][o.o.p.PluginsService     ] [node_t0] PluginService:onIndexModule index:[remote-store-test-idx-1/yVIhXsaZQOW265TzA7UerA]
  1> [2024-01-12T14:08:06,676][INFO ][o.o.c.m.MetadataMappingService] [node_t0] [remote-store-test-idx-1/yVIhXsaZQOW265TzA7UerA] create_mapping
  1> [2024-01-12T14:08:06,684][INFO ][o.o.c.m.MetadataMappingService] [node_t0] [remote-store-test-idx-1/yVIhXsaZQOW265TzA7UerA] update_mapping [_doc]
  1> [2024-01-12T14:08:06,734][INFO ][o.o.p.PluginsService     ] [node_t0] PluginService:onIndexModule index:[remote-store-test-idx-1/yVIhXsaZQOW265TzA7UerA]
  1> [2024-01-12T14:08:06,751][INFO ][o.o.c.m.MetadataMappingService] [node_t0] [remote-store-test-idx-1/yVIhXsaZQOW265TzA7UerA] update_mapping [_doc]
  1> [2024-01-12T14:08:06,819][INFO ][o.o.p.PluginsService     ] [node_t0] PluginService:onIndexModule index:[remote-store-test-idx-1/yVIhXsaZQOW265TzA7UerA]
  1> [2024-01-12T14:08:06,840][INFO ][o.o.c.m.MetadataMappingService] [node_t0] [remote-store-test-idx-1/yVIhXsaZQOW265TzA7UerA] update_mapping [_doc]
  1> [2024-01-12T14:08:06,904][INFO ][o.o.p.PluginsService     ] [node_t0] PluginService:onIndexModule index:[remote-store-test-idx-1/yVIhXsaZQOW265TzA7UerA]
  1> [2024-01-12T14:08:06,931][INFO ][o.o.c.m.MetadataMappingService] [node_t0] [remote-store-test-idx-1/yVIhXsaZQOW265TzA7UerA] update_mapping [_doc]
  1> [2024-01-12T14:08:06,977][INFO ][o.o.p.PluginsService     ] [node_t0] PluginService:onIndexModule index:[remote-store-test-idx-1/yVIhXsaZQOW265TzA7UerA]
  1> [2024-01-12T14:08:07,190][WARN ][o.o.i.s.RemoteStoreRefreshListener] [node_t0] [remote-store-test-idx-1][0] Exception: [java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValu
e()" because the return value of "java.util.Map.get(Object)" is null] while uploading segment files
  1> java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValue()" because the return value of "java.util.Map.get(Object)" is null
  1>    at org.opensearch.index.shard.RemoteStoreRefreshListener$2.onSuccess(RemoteStoreRefreshListener.java:539) ~[main/:?]
  1>    at org.opensearch.index.shard.RemoteStoreRefreshListener.lambda$uploadNewSegments$3(RemoteStoreRefreshListener.java:390) ~[main/:?]
  1>    at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1>    at org.opensearch.index.store.RemoteSegmentStoreDirectory.copyFrom(RemoteSegmentStoreDirectory.java:466) [main/:?]
  1>    at org.opensearch.index.shard.RemoteStoreRefreshListener.uploadNewSegments(RemoteStoreRefreshListener.java:401) [main/:?]
  1>    at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:254) [main/:?]
  1>    at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:152) [main/:?]
  1>    at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:160) [main/:?]
  1>    at org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126) [main/:?]
  1>    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:852) [main/:?]
  1>    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
  1>    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
  1>    at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
  1> [2024-01-12T14:08:07,202][WARN ][o.o.i.s.RemoteSegmentStoreDirectory] [node_t0] [remote-store-test-idx-1][0] Exception while uploading file segments_3 to the remote segment store
  1> java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValue()" because the return value of "java.util.Map.get(Object)" is null
  1>    at org.opensearch.index.shard.RemoteStoreRefreshListener$2.onFailure(RemoteStoreRefreshListener.java:547) ~[main/:?]
  1>    at org.opensearch.index.shard.RemoteStoreRefreshListener.lambda$uploadNewSegments$5(RemoteStoreRefreshListener.java:397) ~[main/:?]
  1>    at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1>    at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:84) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1>    at org.opensearch.index.store.RemoteSegmentStoreDirectory.copyFrom(RemoteSegmentStoreDirectory.java:466) [main/:?]
  1>    at org.opensearch.index.shard.RemoteStoreRefreshListener.uploadNewSegments(RemoteStoreRefreshListener.java:401) [main/:?]
  1>    at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:254) [main/:?]
  1>    at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:152) [main/:?]
  1>    at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:160) [main/:?]
  1>    at org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:126) [main/:?]
  1>    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:852) [main/:?]
  1>    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
  1>    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
  1>    at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

latest failure logs from my machine.

Looks like the local segment map is changing b/w beforeUpload and upload completion of the file

}
statsListener.onFailure(src);
batchUploadListener.onFailure(ex);
});
statsListener.beforeUpload(src);
remoteDirectory.copyFrom(storeDirectory, src, IOContext.DEFAULT, aggregatedListener);
}
}

will enable debug logs and repro again

@linuxpi
Copy link
Collaborator

linuxpi commented Jan 15, 2024

This bug was reported previously as well #11025

@linuxpi
Copy link
Collaborator

linuxpi commented Jan 15, 2024

important TRACE and DEBUG logs from local repro in integ tests

[2024-01-15T12:51:03,033][TRACE][o.o.i.e.E.IW             ] [node_t2] [remote-store-test-idx-1][0] opensearch[node_t2][clusterApplierService#updateTask][T#1] IW: commit: done writing segments file "segments_3"
[2024-01-15T12:51:03,039][TRACE][o.o.i.e.E.IFD            ] [node_t2] [remote-store-test-idx-1][0] opensearch[node_t2][clusterApplierService#updateTask][T#1] IFD: now delete 1 files: [segments_2]


[2024-01-15T12:51:03,928][TRACE][o.o.i.e.E.IW             ] [node_t2] [remote-store-test-idx-1][0] opensearch[node_t2][flush][T#1] IW: commit: done writing segments file "segments_4"


[2024-01-15T12:51:03,949][DEBUG][o.o.i.r.RemoteSegmentTransferTracker] [node_t2] [remote-store-test-idx-1][0] segmentFilesPostRefresh=[segments_3] latestLocalFileNamesBeforeMapUpdate=[]

[2024-01-15T12:51:03,953][TRACE][o.o.i.e.E.DW             ] [node_t2] [remote-store-test-idx-1][0] opensearch[node_t2][flush][T#1] DW: opensearch[node_t2][flush][T#1] finishFullFlush success=true
[2024-01-15T12:51:03,953][TRACE][o.o.i.e.E.IW             ] [node_t2] [remote-store-test-idx-1][0] opensearch[node_t2][flush][T#1] IW: getReader took 23 ms


[2024-01-15T12:51:03,958][DEBUG][o.o.i.s.RemoteStoreRefreshListener] [node_t2] [remote-store-test-idx-1][0] Effective new segments files to upload [segments_3]

[2024-01-15T12:51:03,962][DEBUG][o.o.i.r.RemoteSegmentTransferTracker] [node_t2] [remote-store-test-idx-1][0] segmentFilesPostRefresh=[_0.cfe, _0.si, _0.cfs, segments_4] latestLocalFileNamesBeforeMapUpdate=[segments_3]

[2024-01-15T12:51:03,966][WARN ][o.o.i.s.RemoteStoreRefreshListener] [node_t2] [remote-store-test-idx-1][0] Exception: [java.lang.NullPointerException: Cannot invoke "java.lang.Long.longValue()" because the return value of "java.util.Map.get(Object)" is null] while uploading segment files


[2024-01-15T12:51:03,993][TRACE][o.o.i.e.E.IFD            ] [node_t2] [remote-store-test-idx-1][0] opensearch[node_t2][flush][T#1] IFD: deleteCommits: now decRef commit "segments_3"
[2024-01-15T12:51:03,993][TRACE][o.o.i.e.E.IFD            ] [node_t2] [remote-store-test-idx-1][0] opensearch[node_t2][flush][T#1] IFD: now delete 1 files: [segments_3]

@linuxpi
Copy link
Collaborator

linuxpi commented Jan 15, 2024

Found the issue. Details listed here - #11025 (comment)

Will raise a PR for the fix

@linuxpi
Copy link
Collaborator

linuxpi commented Feb 5, 2024

Closing as fix is merged

@linuxpi linuxpi closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Storage:Durability Issues and PRs related to the durability framework v2.11.0 Issues and PRs related to version 2.11.0
Projects
None yet
Development

No branches or pull requests

5 participants