[CI] FrozenSearchableSnapshotsIntegTests classMethod failing #77017

benwtrent · 2021-08-30T11:37:04Z

Very similar failure to previous tests failure: #75686

Build scan:
https://gradle-enterprise.elastic.co/s/bgh2wyeox4s52/tests/:x-pack:plugin:searchable-snapshots:internalClusterTest/org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests/classMethod

Reproduction line:
./gradlew ':x-pack:plugin:searchable-snapshots:internalClusterTest' --tests "org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests" -Dtests.seed=43D8141D2BF9B196 -Dtests.locale=en-US -Dtests.timezone=UTC -Druntime.java=11

Applicable branches:
master

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests&tests.test=classMethod

Failure excerpt:

java.lang.RuntimeException: file handle leaks: [FileChannel(/var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob+platform-support-unix/os/oraclelinux-8&&immutable/x-pack/plugin/searchable-snapshots/build/testrun/internalClusterTest/temp/org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests_43D8141D2BF9B196-001/tempDir-002/node_s2/indices/ryYrOlKMSpeuTI_3x-NpyA/0/snapshot_cache/ApsFMVSNSfySlTusg6ZA9g/e0lQKKZZSxyPbPSv-8rUAQ), FileChannel(/var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob+platform-support-unix/os/oraclelinux-8&&immutable/x-pack/plugin/searchable-snapshots/build/testrun/internalClusterTest/temp/org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests_43D8141D2BF9B196-001/tempDir-002/node_s2/indices/ryYrOlKMSpeuTI_3x-NpyA/1/snapshot_cache/ApsFMVSNSfySlTusg6ZA9g/H_OJBgzhSKignz0VELdrcQ)]

  at __randomizedtesting.SeedInfo.seed([43D8141D2BF9B196]:0)
  at org.apache.lucene.mockfile.LeakFS.onClose(LeakFS.java:63)
  at org.apache.lucene.mockfile.FilterFileSystem.close(FilterFileSystem.java:77)
  at org.apache.lucene.mockfile.FilterFileSystem.close(FilterFileSystem.java:78)
  at org.apache.lucene.util.TestRuleTemporaryFilesCleanup.afterAlways(TestRuleTemporaryFilesCleanup.java:228)
  at com.carrotsearch.randomizedtesting.rules.TestRuleAdapter$1.afterAlways(TestRuleAdapter.java:31)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:834)

  Caused by: java.lang.Exception: (No message provided)

    at org.apache.lucene.mockfile.LeakFS.onOpen(LeakFS.java:46)
    at org.apache.lucene.mockfile.HandleTrackingFS.callOpenHook(HandleTrackingFS.java:81)
    at org.apache.lucene.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:197)
    at org.apache.lucene.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:166)
    at java.nio.channels.FileChannel.open(FileChannel.java:292)
    at java.nio.channels.FileChannel.open(FileChannel.java:345)
    at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile$FileChannelReference.<init>(CacheFile.java:119)
    at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.acquire(CacheFile.java:200)
    at org.elasticsearch.xpack.searchablesnapshots.store.input.MetadataCachingIndexInput$CacheFileReference.get(MetadataCachingIndexInput.java:466)
    at org.elasticsearch.xpack.searchablesnapshots.store.input.MetadataCachingIndexInput.readWithBlobCache(MetadataCachingIndexInput.java:124)
    at org.elasticsearch.xpack.searchablesnapshots.store.input.MetadataCachingIndexInput.doReadInternal(MetadataCachingIndexInput.java:106)
    at org.elasticsearch.xpack.searchablesnapshots.store.input.BaseSearchableSnapshotIndexInput.readInternal(BaseSearchableSnapshotIndexInput.java:112)
    at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:315)
    at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:56)
    at org.apache.lucene.store.DataInput.readInt(DataInput.java:102)
    at org.apache.lucene.store.BufferedIndexInput.readInt(BufferedIndexInput.java:173)
    at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194)
    at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255)
    at org.apache.lucene.codecs.lucene50.Lucene50CompoundReader.<init>(Lucene50CompoundReader.java:79)
    at org.apache.lucene.codecs.lucene50.Lucene50CompoundFormat.getCompoundReader(Lucene50CompoundFormat.java:71)
    at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:101)
    at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:83)
    at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:668)
    at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:496)
    at org.elasticsearch.index.store.Store.checkIndex(Store.java:333)
    at org.elasticsearch.index.shard.IndexShard.doCheckIndex(IndexShard.java:2647)
    at org.elasticsearch.index.shard.IndexShard.checkIndex(IndexShard.java:2591)
    at org.elasticsearch.index.shard.IndexShard.maybeCheckIndex(IndexShard.java:2581)
    at org.elasticsearch.index.shard.IndexShard.recoverLocallyUpToGlobalCheckpoint(IndexShard.java:1488)
    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:191)
    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:552)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(Thread.java:834)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-08-30T11:37:07Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear · 2021-08-31T14:25:42Z

This is the same problem as in #75686 but from a different angle, seems we'll have to add another wait-for-released somewhere in the node shut down logic. On it.

tlrx · 2021-09-14T13:39:53Z

We used to always avoid to check index on startup in searchable snapshot tests but I changed this in #74235, making it more likely for this failure to happen.

I suspect that the recovery with the check index is running in the ~~generic~~ snapshot thread pool while the data node is waiting for the shard to be unlocked after integrity check so that it can close all resources (directory, store, IndexShard) to process pending shard deletes... and the node does not wait for pending deletes to be processed so maybe it hard close resources and detects files leaks (?)

original-brownbear · 2021-09-15T12:54:19Z

@tlrx I think the check at startup may be part of the problem here, but it seems that we are somehow leaking on the recovery target end during shutdown (not correctly waiting for everything to be released on the target before closing the target service). There' s a few obvious spots where that could be the case but I couldn't manually reproduce this so far.
I also wonder if this hasn't already been fixed by #66834 now which avoids a few spots of forking that could lead to a leak like this.

tlrx · 2021-09-15T13:01:09Z

@original-brownbear I suspect the check on startup is executed on the Store instance, acquiring a lock, but before the engine is created and the reference to it is set in the IndexShard. So when the IndexShard is closed because a recovery is cancelled or the shard is deleted or assigned to another node there is still a store instance somewhere waiting for the check to finish to release the lock. And I don't see anything in IndexService & Co that waits for the check to finish.

(sorry - I'm not very clear, I'm still investigating)

Don't fork when shutting down already as the generic pool may not execute all tasks queued up on it silently. Also, when handling exceptions in the peer recovery target service, don't try to re-schedule recoveries when the node is shutting down already and fail right away no matter the exception. closes elastic#77017

Don't fork when shutting down already as the generic pool may not execute all tasks queued up on it silently. Also, when handling exceptions in the peer recovery target service, don't try to re-schedule recoveries when the node is shutting down already and fail right away no matter the exception. closes #77017

…ic#77783) Don't fork when shutting down already as the generic pool may not execute all tasks queued up on it silently. Also, when handling exceptions in the peer recovery target service, don't try to re-schedule recoveries when the node is shutting down already and fail right away no matter the exception. closes elastic#77017

… (#77794) Don't fork when shutting down already as the generic pool may not execute all tasks queued up on it silently. Also, when handling exceptions in the peer recovery target service, don't try to re-schedule recoveries when the node is shutting down already and fail right away no matter the exception. closes #77017

pgomulka · 2021-11-19T13:18:20Z

it looks like the problem is still persisting. It failed 4times this week.

there was a failure in master
https://gradle-enterprise.elastic.co/s/bfgpj2dxjrdwu

martijnvg · 2021-11-30T11:04:20Z

I also ran into this test failure when back porting a change to the 8.0 branch: https://gradle-enterprise.elastic.co/s/5khiqxr5jghvs#tests

tlrx · 2021-12-17T13:23:25Z

I investigated this failure and #77178 and I identified a situation were some threads (in this case Lucene 9's threads that verify the index during recovery) can be blocked waiting for tasks in the searchable snapshots cache fetch async thread pool to be executed, but the thread pool is already shutdown because of the node restart or end of test.

Today scaling thread pools never reject tasks but always add them to the queue of task the execute, even in the case the thread pool executor is shutting down or terminated. This behaviour does not work great when a task is blocked waiting for another task from another scaling thread pool to complete an I/O operation which will never be executed if the task was enqueued just before the scaling thread pool was shutting down. This situation is more likely to happen with searchable snapshots in which multiple threads can be blocked waiting for parts of Lucene files to be fetched and made available in cache. We saw tests failures in CI where Lucene 9 uses concurrent threads (to asynchronously checks indices) that were blocked waiting for cache files to be available and failing because of leaking files handles (see #77017, #77178). This pull request changes the `ForceQueuePolicy` used by scaling thread pools so that it now accepts a `rejectAfterShutdown` flag which can be set on a per thread pool basis to indicate when tasks should just be rejected once the thread pool is shut down. Because we rely on many scaling thread pools to be black holes and never reject tasks, this flag is set to `false` on most of them to keep the current behavior. In some cases where the rejection logic was already implemented correctly this flag has been set to `true`. This pull request also reimplements the interface `XRejectedExecutionHandler` into an abstract class `EsRejectedExecutionHandler` that allows to share some logic for rejections.

Today scaling thread pools never reject tasks but always add them to the queue of task the execute, even in the case the thread pool executor is shutting down or terminated. This behaviour does not work great when a task is blocked waiting for another task from another scaling thread pool to complete an I/O operation which will never be executed if the task was enqueued just before the scaling thread pool was shutting down. This situation is more likely to happen with searchable snapshots in which multiple threads can be blocked waiting for parts of Lucene files to be fetched and made available in cache. We saw tests failures in CI where Lucene 9 uses concurrent threads (to asynchronously checks indices) that were blocked waiting for cache files to be available and failing because of leaking files handles (see elastic#77017, elastic#77178). This pull request changes the `ForceQueuePolicy` used by scaling thread pools so that it now accepts a `rejectAfterShutdown` flag which can be set on a per thread pool basis to indicate when tasks should just be rejected once the thread pool is shut down. Because we rely on many scaling thread pools to be black holes and never reject tasks, this flag is set to `false` on most of them to keep the current behavior. In some cases where the rejection logic was already implemented correctly this flag has been set to `true`. This pull request also reimplements the interface `XRejectedExecutionHandler` into an abstract class `EsRejectedExecutionHandler` that allows to share some logic for rejections.

Today scaling thread pools never reject tasks but always add them to the queue of task the execute, even in the case the thread pool executor is shutting down or terminated. This behaviour does not work great when a task is blocked waiting for another task from another scaling thread pool to complete an I/O operation which will never be executed if the task was enqueued just before the scaling thread pool was shutting down. This situation is more likely to happen with searchable snapshots in which multiple threads can be blocked waiting for parts of Lucene files to be fetched and made available in cache. We saw tests failures in CI where Lucene 9 uses concurrent threads (to asynchronously checks indices) that were blocked waiting for cache files to be available and failing because of leaking files handles (see #77017, #77178). This pull request changes the `ForceQueuePolicy` used by scaling thread pools so that it now accepts a `rejectAfterShutdown` flag which can be set on a per thread pool basis to indicate when tasks should just be rejected once the thread pool is shut down. Because we rely on many scaling thread pools to be black holes and never reject tasks, this flag is set to `false` on most of them to keep the current behavior. In some cases where the rejection logic was already implemented correctly this flag has been set to `true`. This pull request also reimplements the interface `XRejectedExecutionHandler` into an abstract class `EsRejectedExecutionHandler` that allows to share some logic for rejections.

tlrx · 2022-01-24T11:57:38Z

After investigation I noticed that the thread pool used by searchable snapshots to download and to cache data were not rejecting tasks after shutdown. With Lucene 9 executing index checks with multiple concurrent checks the integration tests were likely to have Lucene threads opening IndexInputs to check and blocking on cached data to be fetched and to be written on disk, but this will never complete/fail if the thread pool is shutting down (after a node stop or restart like it is done in many of the failing tests). The consequence here is leaked files that can cause the test to fail, or cause more failures if the node restarts (the leaked cache file is still opened, blocking the filling of the persistent cache at startup).

To solve this issue I merged #81856 that allows searchable snapshots thread pools to reject tasks after shutdown so that other tasks will fail (or retry) but won't be blocked. In the meanwhile Lucene has been configured to use a single thread for index checks (#82249) and that should also prevent blocked Lucene threads in these tests.

Today scaling thread pools never reject tasks but always add them to the queue of task the execute, even in the case the thread pool executor is shutting down or terminated. This behaviour does not work great when a task is blocked waiting for another task from another scaling thread pool to complete an I/O operation which will never be executed if the task was enqueued just before the scaling thread pool was shutting down. This situation is more likely to happen with searchable snapshots in which multiple threads can be blocked waiting for parts of Lucene files to be fetched and made available in cache. We saw tests failures in CI where Lucene 9 uses concurrent threads (to asynchronously checks indices) that were blocked waiting for cache files to be available and failing because of leaking files handles (see #77017, #77178). This pull request changes the `ForceQueuePolicy` used by scaling thread pools so that it now accepts a `rejectAfterShutdown` flag which can be set on a per thread pool basis to indicate when tasks should just be rejected once the thread pool is shut down. Because we rely on many scaling thread pools to be black holes and never reject tasks, this flag is set to `false` on most of them to keep the current behavior. In some cases where the rejection logic was already implemented correctly this flag has been set to `true`. This pull request also reimplements the interface `XRejectedExecutionHandler` into an abstract class `EsRejectedExecutionHandler` that allows to share some logic for rejections. Backport of #81856

benwtrent added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Aug 30, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 30, 2021

original-brownbear self-assigned this Aug 31, 2021

original-brownbear mentioned this issue Sep 15, 2021

Make Failing Transport Handlers on Node Shutdown More Reliable #77783

Merged

original-brownbear closed this as completed in #77783 Sep 15, 2021

original-brownbear mentioned this issue Sep 15, 2021

Make Failing Transport Handlers on Node Shutdown More Reliable (#77783) #77794

Merged

pgomulka reopened this Nov 19, 2021

tlrx self-assigned this Nov 24, 2021

tlrx mentioned this issue Dec 17, 2021

Allow scaling executors to reject tasks after shutdown #81856

Merged

tlrx mentioned this issue Jan 24, 2022

Allow scaling executors to reject tasks after shutdown (#81856) #82931

Merged

tlrx closed this as completed Jan 24, 2022

tlrx mentioned this issue Jan 24, 2022

[CI] Failure in FrozenSearchableSnapshotsIntegTests.testCreateAndRestorePartialSearchableSnapshot #77178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] FrozenSearchableSnapshotsIntegTests classMethod failing #77017

[CI] FrozenSearchableSnapshotsIntegTests classMethod failing #77017

benwtrent commented Aug 30, 2021 •

edited

Loading

elasticmachine commented Aug 30, 2021

original-brownbear commented Aug 31, 2021

tlrx commented Sep 14, 2021 •

edited

Loading

original-brownbear commented Sep 15, 2021

tlrx commented Sep 15, 2021 •

edited

Loading

pgomulka commented Nov 19, 2021

martijnvg commented Nov 30, 2021

tlrx commented Dec 17, 2021

tlrx commented Jan 24, 2022

[CI] FrozenSearchableSnapshotsIntegTests classMethod failing #77017

[CI] FrozenSearchableSnapshotsIntegTests classMethod failing #77017

Comments

benwtrent commented Aug 30, 2021 • edited Loading

elasticmachine commented Aug 30, 2021

original-brownbear commented Aug 31, 2021

tlrx commented Sep 14, 2021 • edited Loading

original-brownbear commented Sep 15, 2021

tlrx commented Sep 15, 2021 • edited Loading

pgomulka commented Nov 19, 2021

martijnvg commented Nov 30, 2021

tlrx commented Dec 17, 2021

tlrx commented Jan 24, 2022

benwtrent commented Aug 30, 2021 •

edited

Loading

tlrx commented Sep 14, 2021 •

edited

Loading

tlrx commented Sep 15, 2021 •

edited

Loading