Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Segment Replication] Flaky engine AlreadyClosedException exception on closed index action #4530

Closed
dreamer-89 opened this issue Sep 16, 2022 · 0 comments · Fixed by #4743
Assignees
Labels
bug Something isn't working

Comments

@dreamer-89
Copy link
Member

Describe the bug
Engine AlreadyClosedException exception on closed index on replica shard. This issue happens when replication is finalized on replica shard and when translog generation is rolled over when last received gen is different than the one received. The roll translog generation ensure underlying engine to be open which fails sometimes.

[2022-09-15T17:40:14,911][WARN ][o.o.i.e.Engine           ] [node_t1] [test-idx-1][0] failed engine [translog roll generation failed]
org.apache.lucene.store.AlreadyClosedException: [test-idx-1][0] engine is closed
	at org.opensearch.index.engine.Engine.ensureOpen(Engine.java:821) ~[main/:?]
	at org.opensearch.index.engine.Engine.ensureOpen(Engine.java:830) ~[main/:?]
	at org.opensearch.index.translog.InternalTranslogManager.rollTranslogGeneration(InternalTranslogManager.java:85) [main/:?]
	at org.opensearch.index.engine.NRTReplicationEngine.updateSegments(NRTReplicationEngine.java:134) [main/:?]
	at org.opensearch.index.shard.IndexShard.finalizeReplication(IndexShard.java:1382) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$finalizeReplication$5(SegmentReplicationTarget.java:219) [main/:?]
	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:205) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$3(SegmentReplicationTarget.java:169) [main/:?]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [main/:?]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) [main/:?]
	at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) [main/:?]
	at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) [main/:?]
	at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) [main/:?]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:161) [main/:?]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) [main/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1369) [main/:?]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) [main/:?]
	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

To Reproduce
Steps to reproduce the behavior:
Run below integration test in continuation, the failure is fails pretty consistently

    public void testIndexReopenClose() throws Exception {
        final String primary = internalCluster().startNode();
        createIndex(INDEX_NAME);
        ensureYellowAndNoInitializingShards(INDEX_NAME);
        final String replica = internalCluster().startNode();
        ensureGreen(INDEX_NAME);

        client().prepareIndex(INDEX_NAME).setId("1").setSource("foo", "bar").setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE).get();
        refresh(INDEX_NAME);

        final int initialDocCount = scaledRandomIntBetween(10000, 200000);
        try (
            BackgroundIndexer indexer = new BackgroundIndexer(
                INDEX_NAME,
                "_doc",
                client(),
                -1,
                RandomizedTest.scaledRandomIntBetween(2, 5),
                false,
                random()
            )
        ) {
            indexer.start(initialDocCount);
            waitForDocs(initialDocCount, indexer);
            refresh(INDEX_NAME);
            waitForReplicaUpdate();
        }

        flushAndRefresh(INDEX_NAME);
        waitForReplicaUpdate();

        logger.info("--> Closing the index ");
        client().admin().indices().prepareClose(INDEX_NAME).get();

        // Add another node to kick off TransportNodesListGatewayStartedShards which fetches latestReplicationCheckpoint for SegRep enabled indices
        final String replica2 = internalCluster().startNode();

        logger.info("--> Opening the index");
        client().admin().indices().prepareOpen(INDEX_NAME).get();
    }

Expected behavior
The NRTReplicationEngine should not fail preventing translog rollover.

Host/Environment (please complete the following information):

  • OS: iOS
  • Version : Latest changes on main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants