-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watcher: Fix race condition when reloading watches #33157
Watcher: Fix race condition when reloading watches #33157
Conversation
The current watcher implementation had two issues on reload, that could lead to the existing watches not properly cleared out. One fix here was to ensure that when `TriggerService.start()` is called, we ensure in the trigger engine implementations that current watches are removed instead of adding to the existing ones in `TickerScheduleTriggerEngine.start()` The second fix is a bit more subtle, as the underlying issue is rooted more subtle due concurrent code. When `WatcherService.reload()` is called it calls in turn `WatcherService.reloadInner()`, which is synchronized. In the reload method we cleared out existing watches and executions. However there was still a small window of time, where the clearing could happen in relatively quick succession for two cluster states coming in, however due to timing issues the second clearing happened before the first starting of the trigger engine. This could lead due `TriggerEngine.start(Collection<Watch> watches)` being called twice with a different set of watches, without the existing ones being cleared out, resulting in the execution of watches on this node, that should not be executed. Also, there were two minor fixes 1. If the node is not a data node, we forgot to set the status to STARTING when watcher is being started. This should not be a big issue, because a non-data node does not spent a lot of time loading as there are no watches which need loading. 2. If a new cluster state came in during a reload, we had two checks in place to abort loading the current one. The first one before we load all the watches of the local node and the second before watcher is starting with those new watches. Turned out that the first check was not returning, which meant we always tried to load all the watches, and then would fail on the second check. This has been fixed here.
Pinging @elastic/es-core-infra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putAll
considered evil :P Another good find!
I removed the 6.4.1 one label for this one, as the main issue cannot be triggered by the cluster state listener currently due to checking the watcher state before and only call |
This commit ensures that when `TriggerService.start()` is called, we ensure in the trigger engine implementations that current watches are removed instead of adding to the existing ones in `TickerScheduleTriggerEngine.start()` Two additional minor fixes, where the result remains the same but less code gets executed. 1. If the node is not a data node, we forgot to set the status to STARTING when watcher is being started. This should not be a big issue, because a non-data node does not spent a lot of time loading as there are no watches which need loading. 2. If a new cluster state came in during a reload, we had two checks in place to abort loading the current one. The first one before we load all the watches of the local node and the second before watcher is starting with those new watches. Turned out that the first check was not returning, which meant we always tried to load all the watches, and then would fail on the second check. This has been fixed here.
* 6.x: Mute test watcher usage stats output [Rollup] Fix FullClusterRestart test TEST: Disable soft-deletes in ParentChildTestCase TEST: Disable randomized soft-deletes settings Integrates soft-deletes into Elasticsearch (#33222) drop `index.shard.check_on_startup: fix` (#32279) Fix AwaitsFix issue number Mute SmokeTestWatcherWithSecurityIT testsi [DOCS] Moves ml folder from x-pack/docs to docs (#33248) TEST: mute more SmokeTestWatcherWithSecurityIT tests [DOCS] Move rollup APIs to docs (#31450) [DOCS] Rename X-Pack Commands section (#33005) Fixes SecurityIntegTestCase so it always adds at least one alias (#33296) TESTS: Fix Random Fail in MockTcpTransportTests (#33061) (#33307) MINOR: Remove Dead Code from PathTrie (#33280) (#33306) Fix pom for build-tools (#33300) Lazy evaluate java9home (#33301) SQL: test coverage for JdbcResultSet (#32813) Work around to be able to generate eclipse projects (#33295) Different handling for security specific errors in the CLI. Fix for #33230 (#33255) [ML] Refactor delimited file structure detection (#33233) SQL: Support multi-index format as table identifier (#33278) Enable forbiddenapis server java9 (#33245) [MUTE] SmokeTestWatcherWithSecurityIT flaky tests Add region ISO code to GeoIP Ingest plugin (#31669) (#33276) Don't be strict for 6.x Update serialization versions for custom IndexMetaData backport Replace IndexMetaData.Custom with Map-based custom metadata (#32749) Painless: Fix Bindings Bug (#33274) SQL: prevent duplicate generation for repeated aggs (#33252) TEST: Mute testMonitorClusterHealth Fix serialization of empty field capabilities response (#33263) Fix nested _source retrieval with includes/excludes (#33180) [DOCS] TLS file resources are reloadable (#33258) Watcher: Ensure TriggerEngine start replaces existing watches (#33157) Ignore module-info in jar hell checks (#33011) Fix docs build after #33241 [DOC] Repository GCS ADC not supported (#33238) Upgrade to latest Gradle 4.10 (#32801) Fix/30904 cluster formation part2 (#32877) Move file-based discovery to core (#33241) HLRC: add client side RefreshPolicy (#33209) [Kerberos] Add unsupported languages for tests (#33253) Watcher: Reload properly on remote shard change (#33167) Fix classpath security checks for external tests. (#33066) [Rollup] Only allow aggregating on multiples of configured interval (#32052) Added deprecation warning for rescore in scroll queries (#33070) Apply settings filter to get cluster settings API (#33247) [Rollup] Re-factor Rollup Indexer into a generic indexer for re-usability (#32743) HLRC: create base timed request class (#33216) HLRC: Use Optional in validation logic (#33104) Painless: Add Bindings (#33042)
This commit reverts most of elastic#33157 as it introduces another race condition and breaks a common case of watcher, when the first watch is added to the system and the index does not exist yet. This means, that the index will be created, which triggers a reload, but during this time the put watch operation that triggered this is not yet indexed, so that both processes finish roughly add the same time and should not overwrite each other but act complementary. This commit reverts the logic of cleaning out the ticker engine watches on start up, as this is done already when the execution is paused - which also gets paused on the cluster state listener again, as we can be sure here, that the watches index has not yet been created. This also adds a new test, that starts a one node cluster and emulates the case of a non existing watches index and a watch being added, which should result in proper execution. Closes elastic#33320
…3360) This commit reverts most of #33157 as it introduces another race condition and breaks a common case of watcher, when the first watch is added to the system and the index does not exist yet. This means, that the index will be created, which triggers a reload, but during this time the put watch operation that triggered this is not yet indexed, so that both processes finish roughly add the same time and should not overwrite each other but act complementary. This commit reverts the logic of cleaning out the ticker engine watches on start up, as this is done already when the execution is paused - which also gets paused on the cluster state listener again, as we can be sure here, that the watches index has not yet been created. This also adds a new test, that starts a one node cluster and emulates the case of a non existing watches index and a watch being added, which should result in proper execution. Closes #33320
…3360) This commit reverts most of #33157 as it introduces another race condition and breaks a common case of watcher, when the first watch is added to the system and the index does not exist yet. This means, that the index will be created, which triggers a reload, but during this time the put watch operation that triggered this is not yet indexed, so that both processes finish roughly add the same time and should not overwrite each other but act complementary. This commit reverts the logic of cleaning out the ticker engine watches on start up, as this is done already when the execution is paused - which also gets paused on the cluster state listener again, as we can be sure here, that the watches index has not yet been created. This also adds a new test, that starts a one node cluster and emulates the case of a non existing watches index and a watch being added, which should result in proper execution. Closes #33320
…3360) This commit reverts most of #33157 as it introduces another race condition and breaks a common case of watcher, when the first watch is added to the system and the index does not exist yet. This means, that the index will be created, which triggers a reload, but during this time the put watch operation that triggered this is not yet indexed, so that both processes finish roughly add the same time and should not overwrite each other but act complementary. This commit reverts the logic of cleaning out the ticker engine watches on start up, as this is done already when the execution is paused - which also gets paused on the cluster state listener again, as we can be sure here, that the watches index has not yet been created. This also adds a new test, that starts a one node cluster and emulates the case of a non existing watches index and a watch being added, which should result in proper execution. Closes #33320
The current watcher implementation had two issues on reload, that could
lead to the existing watches not properly cleared out.
One fix here was to ensure that when
TriggerService.start()
is called,we ensure in the trigger engine implementations that current watches are
removed instead of adding to the existing ones in
TickerScheduleTriggerEngine.start()
The second fix is a bit more subtle, as the underlying issue is rooted
more subtle due concurrent code.
When
WatcherService.reload()
is called it calls in turnWatcherService.reloadInner()
, which is synchronized. In the reloadmethod we cleared out existing watches and executions.
Also, there were two additional minor fixes
STARTING when watcher is being started. This should not be a big issue,
because a non-data node does not spent a lot of time loading as there
are no watches which need loading.
place to abort loading the current one. The first one before we load all
the watches of the local node and the second before watcher is starting
with those new watches. Turned out that the first check was not
returning, which meant we always tried to load all the watches, and then
would fail on the second check. This has been fixed here.