Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RemoteDirectory copyFrom leading to File handler leak #7687

Closed
ankitkala opened this issue May 23, 2023 · 2 comments
Closed

[BUG] RemoteDirectory copyFrom leading to File handler leak #7687

ankitkala opened this issue May 23, 2023 · 2 comments
Assignees
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework v2.8.0 'Issues and PRs related to version v2.8.0'

Comments

@ankitkala
Copy link
Member

Describe the bug
While running SegmentReplicationIT with remote store integration, I observed the following stacktrace:

SegmentReplicationUsingRemoteStoreIT#testIndexReopenClose]: cleaned up after test
  1> [2023-05-22T11:12:35,113][INFO ][o.o.r.SegmentReplicationUsingRemoteStoreIT] [testIndexReopenClose] after test
  2> java.lang.RuntimeException: file handle leaks: [InputStream(/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT_26D9A9B1B6E194FE-001/tempDir-005/repos/HkoVSnaMrz/u02fjWEgTN2tMRShFOP_5w/0/segments/data/segment_infos_snapshot_filename__15__EfInQ4gBDJnF5Qxb8N8J)]
        at __randomizedtesting.SeedInfo.seed([26D9A9B1B6E194FE]:0)
        at org.apache.lucene.tests.mockfile.LeakFS.onClose(LeakFS.java:63)
        at org.apache.lucene.tests.mockfile.FilterFileSystem.close(FilterFileSystem.java:69)
        at org.apache.lucene.tests.mockfile.FilterFileSystem.close(FilterFileSystem.java:70)
        at org.apache.lucene.tests.util.TestRuleTemporaryFilesCleanup.afterAlways(TestRuleTemporaryFilesCleanup.java:223)
        at com.carrotsearch.randomizedtesting.rules.TestRuleAdapter$1.afterAlways(TestRuleAdapter.java:31)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:43)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at java.****/java.lang.Thread.run(Thread.java:1589)

        Caused by:
        java.lang.Exception
            at org.apache.lucene.tests.mockfile.LeakFS.onOpen(LeakFS.java:46)
            at org.apache.lucene.tests.mockfile.HandleTrackingFS.callOpenHook(HandleTrackingFS.java:82)
            at org.apache.lucene.tests.mockfile.HandleTrackingFS.newInputStream(HandleTrackingFS.java:125)
            at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newInputStream(FilterFileSystemProvider.java:193)
            at org.apache.lucene.tests.mockfile.HandleTrackingFS.newInputStream(HandleTrackingFS.java:94)
            at java.****/java.nio.file.Files.newInputStream(Files.java:160)
            at org.opensearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:170)
            at org.opensearch.index.store.RemoteDirectory.openInput(RemoteDirectory.java:103)
            at org.opensearch.index.store.RemoteSegmentStoreDirectory.openInput(RemoteSegmentStoreDirectory.java:326)
            at org.apache.lucene.store.Directory.copyFrom(Directory.java:180)
            at org.opensearch.index.shard.IndexShard.syncSegmentsFromRemoteSegmentStore(IndexShard.java:4516)
            at org.opensearch.indices.replication.RemoteStoreReplicationSource.getSegmentFiles(RemoteStoreReplicationSource.java:89)
            at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:212)
            at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:170)
            at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80)
            at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126)
            at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
            at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341)
            at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120)
            at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82)
            at org.opensearch.action.StepListener.whenComplete(StepListener.java:93)
            at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:170)
            at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:372)
            at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.run(SegmentReplicationTargetService.java:360)
            at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747)
            at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
            at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
            ... 1 more

Reference: https://build.ci.opensearch.org/job/gradle-check/15842/consoleFull

To Reproduce
Run this test:
./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT"

@ankitkala ankitkala added bug Something isn't working untriaged labels May 23, 2023
@ankitkala
Copy link
Member Author

RemoteDirectory inherits the method copyFrom from Directory which doesn't close the input and output stream after copy is done. I suspect that this could be the culprit.

@sachinpkale sachinpkale added Storage:Durability Issues and PRs related to the durability framework v2.8.0 'Issues and PRs related to version v2.8.0' and removed untriaged labels May 23, 2023
@sachinpkale sachinpkale self-assigned this May 23, 2023
@ankitkala
Copy link
Member Author

[2023-05-22T11:11:23,804][ERROR][o.o.i.r.PrimaryShardReplicationSource] [node_t2] Failed to sync segments
  1> org.opensearch.index.shard.IndexShardRecoveryException: Exception while copying segment files from remote segment store
  1> 	at org.opensearch.index.shard.IndexShard.syncSegmentsFromRemoteSegmentStore(IndexShard.java:4581) ~[main/:?]
  1> 	at org.opensearch.indices.replication.RemoteStoreReplicationSource.getSegmentFiles(RemoteStoreReplicationSource.java:89) [main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:212) [main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:170) [main/:?]
  1> 	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) [main/:?]
  1> 	at org.opensearch.action.StepListener.whenComplete(StepListener.java:93) [main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:170) [main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:372) [main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.run(SegmentReplicationTargetService.java:360) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
  1> 	at java.lang.Thread.run(Thread.java:1589) [?:?]
  1> Caused by: java.nio.file.NoSuchFileException: segment_infos_snapshot_filename__15__EfInQ4gBDJnF5Qxb8N8J
  1> 	at org.opensearch.index.store.RemoteDirectory.fileLength(RemoteDirectory.java:129) ~[main/:?]
  1> 	at org.opensearch.index.store.RemoteDirectory.openInput(RemoteDirectory.java:103) ~[main/:?]
  1> 	at org.opensearch.index.store.RemoteSegmentStoreDirectory.openInput(RemoteSegmentStoreDirectory.java:326) ~[main/:?]
  1> 	at org.apache.lucene.store.Directory.copyFrom(Directory.java:180) ~[lucene-core-9.7.0-snapshot-4d1ed9e.jar:9.7.0-snapshot-4d1ed9e 4d1ed9ef9f69ebd032538ff4324fe8f6c8356f9a - 2023-05-19 14:51:47]
  1> 	at org.opensearch.index.shard.IndexShard.syncSegmentsFromRemoteSegmentStore(IndexShard.java:4516) ~[main/:?]
  1> 	... 17 more

Issue happens while reading segmentInfoSnapshot file from remote store. Since primary is continuously writing to the remote store, we run into situation where the we're able to acquire the InputStream from blob container but the fileLength method fails with NoSuchFileException. As a result, RemoteIndexInput doesn't get instantiated and the the InputStream is never closed. We'll need to add exception handling to ensure that stream is always closed.

Fixing as part of #7653

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework v2.8.0 'Issues and PRs related to version v2.8.0'
Projects
None yet
Development

No branches or pull requests

2 participants