-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] FrozenSearchableSnapshotsIntegTests classMethod failing #77017
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
This is the same problem as in #75686 but from a different angle, seems we'll have to add another wait-for-released somewhere in the node shut down logic. On it. |
We used to always avoid to check index on startup in searchable snapshot tests but I changed this in #74235, making it more likely for this failure to happen. I suspect that the recovery with the check index is running in the |
@tlrx I think the check at startup may be part of the problem here, but it seems that we are somehow leaking on the recovery target end during shutdown (not correctly waiting for everything to be released on the target before closing the target service). There' s a few obvious spots where that could be the case but I couldn't manually reproduce this so far. |
@original-brownbear I suspect the check on startup is executed on the (sorry - I'm not very clear, I'm still investigating) |
Don't fork when shutting down already as the generic pool may not execute all tasks queued up on it silently. Also, when handling exceptions in the peer recovery target service, don't try to re-schedule recoveries when the node is shutting down already and fail right away no matter the exception. closes elastic#77017
Don't fork when shutting down already as the generic pool may not execute all tasks queued up on it silently. Also, when handling exceptions in the peer recovery target service, don't try to re-schedule recoveries when the node is shutting down already and fail right away no matter the exception. closes #77017
…ic#77783) Don't fork when shutting down already as the generic pool may not execute all tasks queued up on it silently. Also, when handling exceptions in the peer recovery target service, don't try to re-schedule recoveries when the node is shutting down already and fail right away no matter the exception. closes elastic#77017
… (#77794) Don't fork when shutting down already as the generic pool may not execute all tasks queued up on it silently. Also, when handling exceptions in the peer recovery target service, don't try to re-schedule recoveries when the node is shutting down already and fail right away no matter the exception. closes #77017
it looks like the problem is still persisting. It failed 4times this week. there was a failure in master |
I also ran into this test failure when back porting a change to the 8.0 branch: https://gradle-enterprise.elastic.co/s/5khiqxr5jghvs#tests |
I investigated this failure and #77178 and I identified a situation were some threads (in this case Lucene 9's threads that verify the index during recovery) can be blocked waiting for tasks in the searchable snapshots cache fetch async thread pool to be executed, but the thread pool is already shutdown because of the node restart or end of test. |
Today scaling thread pools never reject tasks but always add them to the queue of task the execute, even in the case the thread pool executor is shutting down or terminated. This behaviour does not work great when a task is blocked waiting for another task from another scaling thread pool to complete an I/O operation which will never be executed if the task was enqueued just before the scaling thread pool was shutting down. This situation is more likely to happen with searchable snapshots in which multiple threads can be blocked waiting for parts of Lucene files to be fetched and made available in cache. We saw tests failures in CI where Lucene 9 uses concurrent threads (to asynchronously checks indices) that were blocked waiting for cache files to be available and failing because of leaking files handles (see #77017, #77178). This pull request changes the `ForceQueuePolicy` used by scaling thread pools so that it now accepts a `rejectAfterShutdown` flag which can be set on a per thread pool basis to indicate when tasks should just be rejected once the thread pool is shut down. Because we rely on many scaling thread pools to be black holes and never reject tasks, this flag is set to `false` on most of them to keep the current behavior. In some cases where the rejection logic was already implemented correctly this flag has been set to `true`. This pull request also reimplements the interface `XRejectedExecutionHandler` into an abstract class `EsRejectedExecutionHandler` that allows to share some logic for rejections.
Today scaling thread pools never reject tasks but always add them to the queue of task the execute, even in the case the thread pool executor is shutting down or terminated. This behaviour does not work great when a task is blocked waiting for another task from another scaling thread pool to complete an I/O operation which will never be executed if the task was enqueued just before the scaling thread pool was shutting down. This situation is more likely to happen with searchable snapshots in which multiple threads can be blocked waiting for parts of Lucene files to be fetched and made available in cache. We saw tests failures in CI where Lucene 9 uses concurrent threads (to asynchronously checks indices) that were blocked waiting for cache files to be available and failing because of leaking files handles (see elastic#77017, elastic#77178). This pull request changes the `ForceQueuePolicy` used by scaling thread pools so that it now accepts a `rejectAfterShutdown` flag which can be set on a per thread pool basis to indicate when tasks should just be rejected once the thread pool is shut down. Because we rely on many scaling thread pools to be black holes and never reject tasks, this flag is set to `false` on most of them to keep the current behavior. In some cases where the rejection logic was already implemented correctly this flag has been set to `true`. This pull request also reimplements the interface `XRejectedExecutionHandler` into an abstract class `EsRejectedExecutionHandler` that allows to share some logic for rejections.
Today scaling thread pools never reject tasks but always add them to the queue of task the execute, even in the case the thread pool executor is shutting down or terminated. This behaviour does not work great when a task is blocked waiting for another task from another scaling thread pool to complete an I/O operation which will never be executed if the task was enqueued just before the scaling thread pool was shutting down. This situation is more likely to happen with searchable snapshots in which multiple threads can be blocked waiting for parts of Lucene files to be fetched and made available in cache. We saw tests failures in CI where Lucene 9 uses concurrent threads (to asynchronously checks indices) that were blocked waiting for cache files to be available and failing because of leaking files handles (see elastic#77017, elastic#77178). This pull request changes the `ForceQueuePolicy` used by scaling thread pools so that it now accepts a `rejectAfterShutdown` flag which can be set on a per thread pool basis to indicate when tasks should just be rejected once the thread pool is shut down. Because we rely on many scaling thread pools to be black holes and never reject tasks, this flag is set to `false` on most of them to keep the current behavior. In some cases where the rejection logic was already implemented correctly this flag has been set to `true`. This pull request also reimplements the interface `XRejectedExecutionHandler` into an abstract class `EsRejectedExecutionHandler` that allows to share some logic for rejections.
Today scaling thread pools never reject tasks but always add them to the queue of task the execute, even in the case the thread pool executor is shutting down or terminated. This behaviour does not work great when a task is blocked waiting for another task from another scaling thread pool to complete an I/O operation which will never be executed if the task was enqueued just before the scaling thread pool was shutting down. This situation is more likely to happen with searchable snapshots in which multiple threads can be blocked waiting for parts of Lucene files to be fetched and made available in cache. We saw tests failures in CI where Lucene 9 uses concurrent threads (to asynchronously checks indices) that were blocked waiting for cache files to be available and failing because of leaking files handles (see #77017, #77178). This pull request changes the `ForceQueuePolicy` used by scaling thread pools so that it now accepts a `rejectAfterShutdown` flag which can be set on a per thread pool basis to indicate when tasks should just be rejected once the thread pool is shut down. Because we rely on many scaling thread pools to be black holes and never reject tasks, this flag is set to `false` on most of them to keep the current behavior. In some cases where the rejection logic was already implemented correctly this flag has been set to `true`. This pull request also reimplements the interface `XRejectedExecutionHandler` into an abstract class `EsRejectedExecutionHandler` that allows to share some logic for rejections.
After investigation I noticed that the thread pool used by searchable snapshots to download and to cache data were not rejecting tasks after shutdown. With Lucene 9 executing index checks with multiple concurrent checks the integration tests were likely to have Lucene threads opening To solve this issue I merged #81856 that allows searchable snapshots thread pools to reject tasks after shutdown so that other tasks will fail (or retry) but won't be blocked. In the meanwhile Lucene has been configured to use a single thread for index checks (#82249) and that should also prevent blocked Lucene threads in these tests. |
Today scaling thread pools never reject tasks but always add them to the queue of task the execute, even in the case the thread pool executor is shutting down or terminated. This behaviour does not work great when a task is blocked waiting for another task from another scaling thread pool to complete an I/O operation which will never be executed if the task was enqueued just before the scaling thread pool was shutting down. This situation is more likely to happen with searchable snapshots in which multiple threads can be blocked waiting for parts of Lucene files to be fetched and made available in cache. We saw tests failures in CI where Lucene 9 uses concurrent threads (to asynchronously checks indices) that were blocked waiting for cache files to be available and failing because of leaking files handles (see #77017, #77178). This pull request changes the `ForceQueuePolicy` used by scaling thread pools so that it now accepts a `rejectAfterShutdown` flag which can be set on a per thread pool basis to indicate when tasks should just be rejected once the thread pool is shut down. Because we rely on many scaling thread pools to be black holes and never reject tasks, this flag is set to `false` on most of them to keep the current behavior. In some cases where the rejection logic was already implemented correctly this flag has been set to `true`. This pull request also reimplements the interface `XRejectedExecutionHandler` into an abstract class `EsRejectedExecutionHandler` that allows to share some logic for rejections. Backport of #81856
Very similar failure to previous tests failure: #75686
Build scan:
https://gradle-enterprise.elastic.co/s/bgh2wyeox4s52/tests/:x-pack:plugin:searchable-snapshots:internalClusterTest/org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests/classMethod
Reproduction line:
./gradlew ':x-pack:plugin:searchable-snapshots:internalClusterTest' --tests "org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests" -Dtests.seed=43D8141D2BF9B196 -Dtests.locale=en-US -Dtests.timezone=UTC -Druntime.java=11
Applicable branches:
master
Reproduces locally?:
No
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.searchablesnapshots.FrozenSearchableSnapshotsIntegTests&tests.test=classMethod
Failure excerpt:
The text was updated successfully, but these errors were encountered: