Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] FrozenSearchableSnapshotsIntegTests classMethod failing #77017

Closed
benwtrent opened this issue Aug 30, 2021 · 9 comments · Fixed by #77783
Closed

[CI] FrozenSearchableSnapshotsIntegTests classMethod failing #77017

benwtrent opened this issue Aug 30, 2021 · 9 comments · Fixed by #77783
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@benwtrent
Copy link
Member

benwtrent commented Aug 30, 2021

Very similar failure to previous tests failure: #75686

Build scan:
https://gradle-enterprise.elastic.co/s/bgh2wyeox4s52/tests/:x-pack:plugin:searchable-snapshots:internalClusterTest/org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests/classMethod

Reproduction line:
./gradlew ':x-pack:plugin:searchable-snapshots:internalClusterTest' --tests "org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests" -Dtests.seed=43D8141D2BF9B196 -Dtests.locale=en-US -Dtests.timezone=UTC -Druntime.java=11

Applicable branches:
master

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests&tests.test=classMethod

Failure excerpt:

java.lang.RuntimeException: file handle leaks: [FileChannel(/var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob+platform-support-unix/os/oraclelinux-8&&immutable/x-pack/plugin/searchable-snapshots/build/testrun/internalClusterTest/temp/org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests_43D8141D2BF9B196-001/tempDir-002/node_s2/indices/ryYrOlKMSpeuTI_3x-NpyA/0/snapshot_cache/ApsFMVSNSfySlTusg6ZA9g/e0lQKKZZSxyPbPSv-8rUAQ), FileChannel(/var/lib/jenkins/workspace/elastic+elasticsearch+master+multijob+platform-support-unix/os/oraclelinux-8&&immutable/x-pack/plugin/searchable-snapshots/build/testrun/internalClusterTest/temp/org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests_43D8141D2BF9B196-001/tempDir-002/node_s2/indices/ryYrOlKMSpeuTI_3x-NpyA/1/snapshot_cache/ApsFMVSNSfySlTusg6ZA9g/H_OJBgzhSKignz0VELdrcQ)]

  at __randomizedtesting.SeedInfo.seed([43D8141D2BF9B196]:0)
  at org.apache.lucene.mockfile.LeakFS.onClose(LeakFS.java:63)
  at org.apache.lucene.mockfile.FilterFileSystem.close(FilterFileSystem.java:77)
  at org.apache.lucene.mockfile.FilterFileSystem.close(FilterFileSystem.java:78)
  at org.apache.lucene.util.TestRuleTemporaryFilesCleanup.afterAlways(TestRuleTemporaryFilesCleanup.java:228)
  at com.carrotsearch.randomizedtesting.rules.TestRuleAdapter$1.afterAlways(TestRuleAdapter.java:31)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:834)

  Caused by: java.lang.Exception: (No message provided)

    at org.apache.lucene.mockfile.LeakFS.onOpen(LeakFS.java:46)
    at org.apache.lucene.mockfile.HandleTrackingFS.callOpenHook(HandleTrackingFS.java:81)
    at org.apache.lucene.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:197)
    at org.apache.lucene.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:166)
    at java.nio.channels.FileChannel.open(FileChannel.java:292)
    at java.nio.channels.FileChannel.open(FileChannel.java:345)
    at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile$FileChannelReference.<init>(CacheFile.java:119)
    at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.acquire(CacheFile.java:200)
    at org.elasticsearch.xpack.searchablesnapshots.store.input.MetadataCachingIndexInput$CacheFileReference.get(MetadataCachingIndexInput.java:466)
    at org.elasticsearch.xpack.searchablesnapshots.store.input.MetadataCachingIndexInput.readWithBlobCache(MetadataCachingIndexInput.java:124)
    at org.elasticsearch.xpack.searchablesnapshots.store.input.MetadataCachingIndexInput.doReadInternal(MetadataCachingIndexInput.java:106)
    at org.elasticsearch.xpack.searchablesnapshots.store.input.BaseSearchableSnapshotIndexInput.readInternal(BaseSearchableSnapshotIndexInput.java:112)
    at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:315)
    at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:56)
    at org.apache.lucene.store.DataInput.readInt(DataInput.java:102)
    at org.apache.lucene.store.BufferedIndexInput.readInt(BufferedIndexInput.java:173)
    at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194)
    at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255)
    at org.apache.lucene.codecs.lucene50.Lucene50CompoundReader.<init>(Lucene50CompoundReader.java:79)
    at org.apache.lucene.codecs.lucene50.Lucene50CompoundFormat.getCompoundReader(Lucene50CompoundFormat.java:71)
    at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:101)
    at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:83)
    at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:668)
    at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:496)
    at org.elasticsearch.index.store.Store.checkIndex(Store.java:333)
    at org.elasticsearch.index.shard.IndexShard.doCheckIndex(IndexShard.java:2647)
    at org.elasticsearch.index.shard.IndexShard.checkIndex(IndexShard.java:2591)
    at org.elasticsearch.index.shard.IndexShard.maybeCheckIndex(IndexShard.java:2581)
    at org.elasticsearch.index.shard.IndexShard.recoverLocallyUpToGlobalCheckpoint(IndexShard.java:1488)
    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:191)
    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:552)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(Thread.java:834)

@benwtrent benwtrent added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Aug 30, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 30, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@original-brownbear original-brownbear self-assigned this Aug 31, 2021
@original-brownbear
Copy link
Member

This is the same problem as in #75686 but from a different angle, seems we'll have to add another wait-for-released somewhere in the node shut down logic. On it.

@tlrx
Copy link
Member

tlrx commented Sep 14, 2021

We used to always avoid to check index on startup in searchable snapshot tests but I changed this in #74235, making it more likely for this failure to happen.

I suspect that the recovery with the check index is running in the generic snapshot thread pool while the data node is waiting for the shard to be unlocked after integrity check so that it can close all resources (directory, store, IndexShard) to process pending shard deletes... and the node does not wait for pending deletes to be processed so maybe it hard close resources and detects files leaks (?)

@original-brownbear
Copy link
Member

@tlrx I think the check at startup may be part of the problem here, but it seems that we are somehow leaking on the recovery target end during shutdown (not correctly waiting for everything to be released on the target before closing the target service). There' s a few obvious spots where that could be the case but I couldn't manually reproduce this so far.
I also wonder if this hasn't already been fixed by #66834 now which avoids a few spots of forking that could lead to a leak like this.

@tlrx
Copy link
Member

tlrx commented Sep 15, 2021

@original-brownbear I suspect the check on startup is executed on the Store instance, acquiring a lock, but before the engine is created and the reference to it is set in the IndexShard. So when the IndexShard is closed because a recovery is cancelled or the shard is deleted or assigned to another node there is still a store instance somewhere waiting for the check to finish to release the lock. And I don't see anything in IndexService & Co that waits for the check to finish.

(sorry - I'm not very clear, I'm still investigating)

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Sep 15, 2021
Don't fork when shutting down already as the generic pool may not execute all
tasks queued up on it silently.
Also, when handling exceptions in the peer recovery target service, don't try to
re-schedule recoveries when the node is shutting down already and fail right away
no matter the exception.

closes elastic#77017
original-brownbear added a commit that referenced this issue Sep 15, 2021
Don't fork when shutting down already as the generic pool may not execute all
tasks queued up on it silently.
Also, when handling exceptions in the peer recovery target service, don't try to
re-schedule recoveries when the node is shutting down already and fail right away
no matter the exception.

closes #77017
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Sep 15, 2021
…ic#77783)

Don't fork when shutting down already as the generic pool may not execute all
tasks queued up on it silently.
Also, when handling exceptions in the peer recovery target service, don't try to
re-schedule recoveries when the node is shutting down already and fail right away
no matter the exception.

closes elastic#77017
original-brownbear added a commit that referenced this issue Sep 15, 2021
… (#77794)

Don't fork when shutting down already as the generic pool may not execute all
tasks queued up on it silently.
Also, when handling exceptions in the peer recovery target service, don't try to
re-schedule recoveries when the node is shutting down already and fail right away
no matter the exception.

closes #77017
@pgomulka
Copy link
Contributor

it looks like the problem is still persisting. It failed 4times this week.

there was a failure in master
https://gradle-enterprise.elastic.co/s/bfgpj2dxjrdwu

@pgomulka pgomulka reopened this Nov 19, 2021
@tlrx tlrx self-assigned this Nov 24, 2021
@martijnvg
Copy link
Member

I also ran into this test failure when back porting a change to the 8.0 branch: https://gradle-enterprise.elastic.co/s/5khiqxr5jghvs#tests

@tlrx
Copy link
Member

tlrx commented Dec 17, 2021

I investigated this failure and #77178 and I identified a situation were some threads (in this case Lucene 9's threads that verify the index during recovery) can be blocked waiting for tasks in the searchable snapshots cache fetch async thread pool to be executed, but the thread pool is already shutdown because of the node restart or end of test.

elasticsearchmachine pushed a commit that referenced this issue Jan 24, 2022
Today scaling thread pools never reject tasks but always add them to the
queue of task the execute, even in the case the thread pool executor is
shutting down or terminated. This behaviour does not work great when a
task is blocked waiting for another task from another scaling thread
pool to complete an I/O operation which will never be executed if the
task was enqueued just before the scaling thread pool was shutting down.


This situation is more likely to happen with searchable snapshots in
which multiple threads can be blocked waiting for parts of Lucene files
to be fetched and made available in cache. We saw tests failures in CI
where Lucene 9 uses concurrent threads (to asynchronously checks
indices) that were blocked waiting for cache files to be available and
failing because of leaking files handles (see #77017, #77178).

This pull request changes the `ForceQueuePolicy` used by scaling thread
pools so that it now accepts a `rejectAfterShutdown` flag which can be
set on a per thread pool basis to indicate when tasks should just be
rejected once the thread pool is shut down. Because we rely on many
scaling thread pools to be black holes and never reject tasks, this flag
is set to `false` on most of them to keep the current behavior. In some
cases where the rejection logic was already implemented correctly this
flag has been set to `true`.

This pull request also reimplements the interface
`XRejectedExecutionHandler` into an abstract class
`EsRejectedExecutionHandler` that allows to share some logic for
rejections.
tlrx added a commit to tlrx/elasticsearch that referenced this issue Jan 24, 2022
Today scaling thread pools never reject tasks but always add them to the
queue of task the execute, even in the case the thread pool executor is
shutting down or terminated. This behaviour does not work great when a
task is blocked waiting for another task from another scaling thread
pool to complete an I/O operation which will never be executed if the
task was enqueued just before the scaling thread pool was shutting down.


This situation is more likely to happen with searchable snapshots in
which multiple threads can be blocked waiting for parts of Lucene files
to be fetched and made available in cache. We saw tests failures in CI
where Lucene 9 uses concurrent threads (to asynchronously checks
indices) that were blocked waiting for cache files to be available and
failing because of leaking files handles (see elastic#77017, elastic#77178).

This pull request changes the `ForceQueuePolicy` used by scaling thread
pools so that it now accepts a `rejectAfterShutdown` flag which can be
set on a per thread pool basis to indicate when tasks should just be
rejected once the thread pool is shut down. Because we rely on many
scaling thread pools to be black holes and never reject tasks, this flag
is set to `false` on most of them to keep the current behavior. In some
cases where the rejection logic was already implemented correctly this
flag has been set to `true`.

This pull request also reimplements the interface
`XRejectedExecutionHandler` into an abstract class
`EsRejectedExecutionHandler` that allows to share some logic for
rejections.
tlrx added a commit to tlrx/elasticsearch that referenced this issue Jan 24, 2022
Today scaling thread pools never reject tasks but always add them to the
queue of task the execute, even in the case the thread pool executor is
shutting down or terminated. This behaviour does not work great when a
task is blocked waiting for another task from another scaling thread
pool to complete an I/O operation which will never be executed if the
task was enqueued just before the scaling thread pool was shutting down.

This situation is more likely to happen with searchable snapshots in
which multiple threads can be blocked waiting for parts of Lucene files
to be fetched and made available in cache. We saw tests failures in CI
where Lucene 9 uses concurrent threads (to asynchronously checks
indices) that were blocked waiting for cache files to be available and
failing because of leaking files handles (see elastic#77017, elastic#77178).

This pull request changes the `ForceQueuePolicy` used by scaling thread
pools so that it now accepts a `rejectAfterShutdown` flag which can be
set on a per thread pool basis to indicate when tasks should just be
rejected once the thread pool is shut down. Because we rely on many
scaling thread pools to be black holes and never reject tasks, this flag
is set to `false` on most of them to keep the current behavior. In some
cases where the rejection logic was already implemented correctly this
flag has been set to `true`.

This pull request also reimplements the interface
`XRejectedExecutionHandler` into an abstract class
`EsRejectedExecutionHandler` that allows to share some logic for
rejections.
elasticsearchmachine pushed a commit that referenced this issue Jan 24, 2022
Today scaling thread pools never reject tasks but always add them to the
queue of task the execute, even in the case the thread pool executor is
shutting down or terminated. This behaviour does not work great when a
task is blocked waiting for another task from another scaling thread
pool to complete an I/O operation which will never be executed if the
task was enqueued just before the scaling thread pool was shutting down.


This situation is more likely to happen with searchable snapshots in
which multiple threads can be blocked waiting for parts of Lucene files
to be fetched and made available in cache. We saw tests failures in CI
where Lucene 9 uses concurrent threads (to asynchronously checks
indices) that were blocked waiting for cache files to be available and
failing because of leaking files handles (see #77017, #77178).

This pull request changes the `ForceQueuePolicy` used by scaling thread
pools so that it now accepts a `rejectAfterShutdown` flag which can be
set on a per thread pool basis to indicate when tasks should just be
rejected once the thread pool is shut down. Because we rely on many
scaling thread pools to be black holes and never reject tasks, this flag
is set to `false` on most of them to keep the current behavior. In some
cases where the rejection logic was already implemented correctly this
flag has been set to `true`.

This pull request also reimplements the interface
`XRejectedExecutionHandler` into an abstract class
`EsRejectedExecutionHandler` that allows to share some logic for
rejections.
@tlrx
Copy link
Member

tlrx commented Jan 24, 2022

After investigation I noticed that the thread pool used by searchable snapshots to download and to cache data were not rejecting tasks after shutdown. With Lucene 9 executing index checks with multiple concurrent checks the integration tests were likely to have Lucene threads opening IndexInputs to check and blocking on cached data to be fetched and to be written on disk, but this will never complete/fail if the thread pool is shutting down (after a node stop or restart like it is done in many of the failing tests). The consequence here is leaked files that can cause the test to fail, or cause more failures if the node restarts (the leaked cache file is still opened, blocking the filling of the persistent cache at startup).

To solve this issue I merged #81856 that allows searchable snapshots thread pools to reject tasks after shutdown so that other tasks will fail (or retry) but won't be blocked. In the meanwhile Lucene has been configured to use a single thread for index checks (#82249) and that should also prevent blocked Lucene threads in these tests.

@tlrx tlrx closed this as completed Jan 24, 2022
tlrx added a commit that referenced this issue Jan 24, 2022
Today scaling thread pools never reject tasks but always add them to the
queue of task the execute, even in the case the thread pool executor is
shutting down or terminated. This behaviour does not work great when a
task is blocked waiting for another task from another scaling thread
pool to complete an I/O operation which will never be executed if the
task was enqueued just before the scaling thread pool was shutting down.

This situation is more likely to happen with searchable snapshots in
which multiple threads can be blocked waiting for parts of Lucene files
to be fetched and made available in cache. We saw tests failures in CI
where Lucene 9 uses concurrent threads (to asynchronously checks
indices) that were blocked waiting for cache files to be available and
failing because of leaking files handles (see #77017, #77178).

This pull request changes the `ForceQueuePolicy` used by scaling thread
pools so that it now accepts a `rejectAfterShutdown` flag which can be
set on a per thread pool basis to indicate when tasks should just be
rejected once the thread pool is shut down. Because we rely on many
scaling thread pools to be black holes and never reject tasks, this flag
is set to `false` on most of them to keep the current behavior. In some
cases where the rejection logic was already implemented correctly this
flag has been set to `true`.

This pull request also reimplements the interface
`XRejectedExecutionHandler` into an abstract class
`EsRejectedExecutionHandler` that allows to share some logic for
rejections.

Backport of #81856
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants