Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] [BUG] No such file exception due to missing index file in get_checkpoint_info #4310

Closed
dreamer-89 opened this issue Aug 26, 2022 · 1 comment
Labels
bug Something isn't working distributed framework

Comments

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 26, 2022

Describe the bug

On primary, while building CopyState object as part of get_checkpoint_info transport response, missing index file on local store causes FileNotFoundException. This is repro-able on main containing #4288 fix

It looks like indexShard object built from IndexService seems to be outdated (still referencing _9.cfe but latest in memory SegmentInfos fetched using getLatestSegmentInfos() in InternalEngine doesn't contains it) file but when metadata is read this file is long ago gone from shard store. From logs (prints SegmentInfos from in memory & disk store) shows that in memory copy held _9.cfe file once but later (index refresh/commit ?) removes all these files and builds new set of files. It is interesting to know why indexShard (built from indexService) is not upto date with in memory state of SegmentInfos

Log traces

[2022-08-26T15:47:00,116][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos -------------------------
[2022-08-26T15:47:00,116][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,116][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _1.cfs
[2022-08-26T15:47:00,116][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _1.cfe
[2022-08-26T15:47:00,116][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _1.si
[2022-08-26T15:47:00,116][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _0.cfe
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _0.si
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _0.cfs
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _3.si
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _3.cfs
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _3.cfe
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.si
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfe
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfs
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _4.cfe
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _4.cfs
[2022-08-26T15:47:00,117][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _4.si
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _5.cfs
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _5.cfe
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _5.si
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _6.cfe
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _6.cfs
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _6.si
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _7.cfs
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _7.cfe
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _7.si
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _9.cfe
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _9.cfs
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _9.si
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _8.cfe
[2022-08-26T15:47:00,118][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _8.cfs
[2022-08-26T15:47:00,119][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _8.si
[2022-08-26T15:47:00,129][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos On-disk -------------------------
[2022-08-26T15:47:00,129][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,135][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos -------------------------
[2022-08-26T15:47:00,135][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,135][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.nvd
[2022-08-26T15:47:00,135][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.si
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.pos
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tmd
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.nvm
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.dvd
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fnm
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdi
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdm
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.dvm
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.doc
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdd
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdx
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tim
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tip
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdt
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdm
[2022-08-26T15:47:00,136][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.si
[2022-08-26T15:47:00,137][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfe
[2022-08-26T15:47:00,137][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfs
[2022-08-26T15:47:00,140][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos On-disk -------------------------
[2022-08-26T15:47:00,140][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,142][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos -------------------------
[2022-08-26T15:47:00,142][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,142][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.nvd
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.si
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.pos
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tmd
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.nvm
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.dvd
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fnm
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdi
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdm
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.dvm
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.doc
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdd
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdx
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tim
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tip
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdt
[2022-08-26T15:47:00,143][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdm
[2022-08-26T15:47:00,144][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.si
[2022-08-26T15:47:00,144][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfe
[2022-08-26T15:47:00,144][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfs
[2022-08-26T15:47:00,148][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos -------------------------
[2022-08-26T15:47:00,148][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> ---------------------- printing files in SegmentInfos -------------------------
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> segments_2
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.nvd
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.si
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.pos
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tmd
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.nvm
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.dvd
[2022-08-26T15:47:00,159][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fnm
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdi
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdm
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.dvm
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.doc
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdd
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdx
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tim
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b_Lucene90_0.tip
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.fdt
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _b.kdm
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.si
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfe
[2022-08-26T15:47:00,160][INFO ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] --> _2.cfs
[2022-08-26T15:47:00,160][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t6] replication failure
org.opensearch.OpenSearchException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:251) [main/:?]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [main/:?]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [main/:?]
	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [main/:?]
	at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [main/:?]
	at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [main/:?]
	at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [main/:?]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) [main/:?]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) [main/:?]
	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [main/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1379) [main/:?]
	at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [node_t1][127.0.0.1:57189][internal:index/shard/replication/get_checkpoint_info]
Caused by: java.nio.file.NoSuchFileException: /Users/singhnjb/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationIT_15BF646F38F371F3-001/tempDir-009/node_t1-shared/mkDuAoQiFK/0/wWPbk5ETT5-ik2ZIwqmjLg/0/index/_9.cfe
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
	at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181) ~[?:?]
	at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newFileChannel(FilterFileSystemProvider.java:204) ~[lucene-test-framework-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
	at org.apache.lucene.tests.mockfile.DisableFsyncFS.newFileChannel(DisableFsyncFS.java:44) ~[lucene-test-framework-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
	at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newFileChannel(FilterFileSystemProvider.java:204) ~[lucene-test-framework-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
	at org.apache.lucene.tests.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:171) ~[lucene-test-framework-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
	at org.apache.lucene.tests.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:171) ~[lucene-test-framework-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
	at java.nio.channels.FileChannel.open(FileChannel.java:298) ~[?:?]
	at java.nio.channels.FileChannel.open(FileChannel.java:357) ~[?:?]
	at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78) ~[lucene-core-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
	at org.opensearch.index.store.FsDirectoryFactory$HybridDirectory.openInput(FsDirectoryFactory.java:166) ~[main/:?]
	at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
	at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.4.0-snapshot-ddf0d0a.jar:9.4.0-snapshot-ddf0d0a ddf0d0acf4e4443ddea37bb855dead7bed5cc1a2 - runner - 2022-08-09 21:10:22]
	at org.opensearch.index.store.Store$MetadataSnapshot.checksumFromLuceneFile(Store.java:1092) ~[main/:?]
	at org.opensearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:1064) ~[main/:?]
	at org.opensearch.index.store.Store$MetadataSnapshot.<init>(Store.java:941) ~[main/:?]
	at org.opensearch.index.store.Store.getMetadata(Store.java:334) ~[main/:?]
	at org.opensearch.indices.replication.common.CopyState.<init>(CopyState.java:52) ~[main/:?]
	at org.opensearch.indices.replication.OngoingSegmentReplications.getCachedCopyState(OngoingSegmentReplications.java:81) ~[main/:?]
	at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:140) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:107) ~[main/:?]
	at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:88) ~[main/:?]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[main/:?]
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]

Reproduce

Run below test in contiuation

    public void testDropPrimaryDuringReplication() throws Exception {
        final Settings settings = Settings.builder()
            .put(indexSettings())
            .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 6)
            .put(IndexMetadata.SETTING_REPLICATION_TYPE, ReplicationType.SEGMENT)
            .build();
        final String clusterManagerNode = internalCluster().startClusterManagerOnlyNode();
        final String primaryNode = internalCluster().startDataOnlyNode(Settings.EMPTY);
        createIndex(INDEX_NAME, settings);
        internalCluster().startDataOnlyNodes(6);

        int initialDocCount = scaledRandomIntBetween(100, 200);
        try (
            BackgroundIndexer indexer = new BackgroundIndexer(
                INDEX_NAME,
                "_doc",
                client(),
                -1,
                RandomizedTest.scaledRandomIntBetween(2, 5),
                false,
                random()
            )
        ) {
            indexer.start(initialDocCount);
            waitForDocs(initialDocCount, indexer);
            refresh(INDEX_NAME);
            // don't wait for replication to complete, stop the primary immediately.
            internalCluster().stopRandomNode(InternalTestCluster.nameFilter(primaryNode));
            ensureYellow(INDEX_NAME);

            // start another replica.
            internalCluster().startDataOnlyNode();
            ensureGreen(INDEX_NAME);

            // index another doc and refresh - without this the new replica won't catch up.
            client().prepareIndex(INDEX_NAME).setId("1").setSource("foo", "bar").get();

            flushAndRefresh(INDEX_NAME);
            waitForReplicaUpdate();
            assertSegmentStats(6);
        }
    }

Host/Environment (please complete the following information):

  • OS: iOS

Note: This is different from #4178 where FileNotFoundException happends due to missing Segment_N file

@dreamer-89 dreamer-89 added bug Something isn't working untriaged labels Aug 26, 2022
@dreamer-89 dreamer-89 changed the title [Segment Replicatin] [BUG] No such file exception due to missing index file in get_checkpoint_info [Segment Replication] [BUG] No such file exception due to missing index file in get_checkpoint_info Aug 26, 2022
@mch2
Copy link
Member

mch2 commented Nov 8, 2022

Closing - this was fixed with #4366

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
Status: Done
Development

No branches or pull requests

3 participants