Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds plugin version sweep background job #434

Conversation

downsrob
Copy link
Contributor

Issue #, if available:
#207

Description of changes:
Index Management currently skips all job executions when there are two differing versions of Index Management on the cluster. The plugin currently does this by performing a NodesInfoRequest to get and compare plugin versions whenever there is a node added or a new cluster, and set a flag, SkipExecution, to true when there are multiple plugin versions. We have seen cases where the SkipExecution flag is still set to true even though the upgrade process (early ES 7.x to later ES 7.x) has finished and the cluster is on the latest version w/ all nodes containing the same version of IM plugin.

From analyzing the code, we can see race conditions that would allow multiple requests to overwrite each other in the wrong order. Though the cluster changed events would come in order, the NodesInfoRequests may actually overwrite the flag out of order.

To resolve this race condition, this PR adds a background job which will run every five minutes to poll the plugin versions if the flag is currently set to true.

This is an alternative strategy to #423 and is also entirely by Stevan Buzejic, @stevanbz, I am just raising the PR for an early review.

CheckList:

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@downsrob downsrob requested a review from a team July 28, 2022 18:07
@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2022

Codecov Report

Merging #434 (47b7a24) into main (39be4e3) will increase coverage by 0.00%.
The diff coverage is 78.26%.

@@            Coverage Diff            @@
##               main     #434   +/-   ##
=========================================
  Coverage     75.94%   75.95%           
- Complexity     2480     2492   +12     
=========================================
  Files           315      316    +1     
  Lines         14500    14547   +47     
  Branches       2243     2248    +5     
=========================================
+ Hits          11012    11049   +37     
- Misses         2239     2246    +7     
- Partials       1249     1252    +3     
Impacted Files Coverage Δ
...exstatemanagement/PluginVersionSweepCoordinator.kt 69.69% <69.69%> (ø)
...pensearch/indexmanagement/IndexManagementPlugin.kt 90.00% <100.00%> (+0.11%) ⬆️
...exmanagement/indexstatemanagement/SkipExecution.kt 61.29% <100.00%> (-5.38%) ⬇️
...exstatemanagement/settings/ManagedIndexSettings.kt 98.49% <100.00%> (+0.05%) ⬆️
...ment/indexstatemanagement/util/RestHandlerUtils.kt 88.88% <0.00%> (-11.12%) ⬇️
...arch/indexmanagement/rollup/RollupSearchService.kt 57.40% <0.00%> (-3.71%) ⬇️
...exstatemanagement/resthandler/RestExplainAction.kt 100.00% <0.00%> (ø)
.../opensearch/indexmanagement/rollup/model/Rollup.kt 86.04% <0.00%> (+0.46%) ⬆️
...management/rollup/interceptor/RollupInterceptor.kt 80.15% <0.00%> (+0.79%) ⬆️
...t/resthandler/RestRetryFailedManagedIndexAction.kt 88.00% <0.00%> (+1.04%) ⬆️
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@bowenlan-amzn bowenlan-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general questions:

  1. can we disable the trigger logic in skipExecution since we now have this background loop.
    trigger logic I am referring to
override fun clusterChanged(event: ClusterChangedEvent) {
        if (event.nodesChanged() || event.isNewCluster) {
            sweepISMPluginVersion()
        }
    }

in SkipExecution

Comment on lines 111 to 118
val SWEEP_SKIP_PERIOD: Setting<TimeValue> = Setting.timeSetting(
"opendistro.index_state_management.coordinator.sweep_skip_period",
TimeValue.timeValueMinutes(10),
TimeValue.timeValueMinutes(5),
Setting.Property.NodeScope,
Setting.Property.Dynamic,
Setting.Property.Deprecated
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to have this if we are adding a new setting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Tnx!

Comment on lines +65 to +80
if (!skipExecution.flag) {
logger.info("Canceling sweep ism plugin version job")
scheduledSkipExecution?.cancel()
} else {
skipExecution.sweepISMPluginVersion()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to cancel this job or let it run forever?

…he case of version discrepancy

Signed-off-by: Stevan Buzejic <[email protected]>
…r scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <[email protected]>
@stevanbz stevanbz force-pushed the bugfix/207-skip-execution-not-properly-set-job-scheduler-solution branch 2 times, most recently from 027e78e to 151fec9 Compare September 21, 2022 15:10
private fun isIndexStateManagementEnabled(): Boolean = indexStateManagementEnabled == true

companion object {
private const val RETRY_PERIOD_IN_MINUTES = 5L
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this same as sweepSkipPeriod? If so, should we use sweepSkipPeriod instead?

@stevanbz
Copy link
Contributor

stevanbz commented Sep 22, 2022

A general questions:

1. can we disable the trigger logic in skipExecution since we now have this background loop.
   trigger logic I am referring to
override fun clusterChanged(event: ClusterChangedEvent) {
        if (event.nodesChanged() || event.isNewCluster) {
            sweepISMPluginVersion()
        }
    }

in SkipExecution

Good question. And you are right - I am thinking the same. SkipExecution class should do only sweepISMPluginVersion, while the caller class will be responsible for triggering the request.

So, my proposal is:

Caller class, PluginVersionSweepCoordinator, will listen for cluster changed events and will be responsible for calling the sweepISM method. This class already has a scheduled job that can be canceled optionally (ie. if the skip flag is being set to true).

ie.


 override fun clusterChanged(event: ClusterChangedEvent) {
        if (event.nodesChanged() || event.isNewCluster) {
            skipExecution.sweepISMPluginVersion()
            initBackgroundSweepISMPluginVersionExecution()
        }
    }

@stevanbz stevanbz force-pushed the bugfix/207-skip-execution-not-properly-set-job-scheduler-solution branch from 85cca3c to 47b7a24 Compare September 22, 2022 21:10
Signed-off-by: Stevan Buzejic <[email protected]>
@Angie-Zhang Angie-Zhang merged commit 4d844fa into opensearch-project:main Oct 4, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 4, 2022
* [207]: Added 5 min scheduled job for sweeping ISM plugin version in the case of version discrepancy

Signed-off-by: Stevan Buzejic <[email protected]>

* [207]: Created pluginVersionSweepCoordinator component responsible for scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <[email protected]>

* [207]: Increased retry period for background job that sets the skip flag up to 5 mins

Signed-off-by: Stevan Buzejic <[email protected]>

* Empty-Commit

Signed-off-by: Stevan Buzejic <[email protected]>

Signed-off-by: Stevan Buzejic <[email protected]>
Co-authored-by: Stevan Buzejic <[email protected]>
(cherry picked from commit 4d844fa)
Angie-Zhang pushed a commit that referenced this pull request Oct 4, 2022
* [207]: Added 5 min scheduled job for sweeping ISM plugin version in the case of version discrepancy

Signed-off-by: Stevan Buzejic <[email protected]>

* [207]: Created pluginVersionSweepCoordinator component responsible for scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <[email protected]>

* [207]: Increased retry period for background job that sets the skip flag up to 5 mins

Signed-off-by: Stevan Buzejic <[email protected]>

* Empty-Commit

Signed-off-by: Stevan Buzejic <[email protected]>

Signed-off-by: Stevan Buzejic <[email protected]>
Co-authored-by: Stevan Buzejic <[email protected]>
(cherry picked from commit 4d844fa)

Co-authored-by: Clay Downs <[email protected]>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 6, 2022
* [207]: Added 5 min scheduled job for sweeping ISM plugin version in the case of version discrepancy

Signed-off-by: Stevan Buzejic <[email protected]>

* [207]: Created pluginVersionSweepCoordinator component responsible for scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <[email protected]>

* [207]: Increased retry period for background job that sets the skip flag up to 5 mins

Signed-off-by: Stevan Buzejic <[email protected]>

* Empty-Commit

Signed-off-by: Stevan Buzejic <[email protected]>

Signed-off-by: Stevan Buzejic <[email protected]>
Co-authored-by: Stevan Buzejic <[email protected]>
(cherry picked from commit 4d844fa)
Angie-Zhang added a commit that referenced this pull request Oct 14, 2022
* initial framework

Signed-off-by: Joanne Wang <[email protected]>

* Removed recursion from Explain Action to avoid stackoverflow in some situations (#419)

Signed-off-by: Petar Dzepina <[email protected]>
Signed-off-by: Joanne Wang <[email protected]>

* enabled by default integrated

Signed-off-by: Joanne Wang <[email protected]>

* cleaned up comments and logs, created unit test and updated previous integration tests

Signed-off-by: Joanne Wang <[email protected]>

* added delete validation logic

Signed-off-by: Joanne Wang <[email protected]>

* fixed rollover validation unit tests

Signed-off-by: Joanne Wang <[email protected]>

* added validation info field to ManagedIndexMetaData

Signed-off-by: Joanne Wang <[email protected]>

* removed step context as input

Signed-off-by: Joanne Wang <[email protected]>

* added validationmetadata class

Signed-off-by: Joanne Wang <[email protected]>

* restored old integration tests and changed validation service output

Signed-off-by: Joanne Wang <[email protected]>

* before integrated validation meta data into managed index meta data

Signed-off-by: Joanne Wang <[email protected]>

* integrated validation meta data

Signed-off-by: Joanne Wang <[email protected]>

* working version

Signed-off-by: Joanne Wang <[email protected]>

* added validation mapping

Signed-off-by: Joanne Wang <[email protected]>

* fixed integ tests

Signed-off-by: Joanne Wang <[email protected]>

* renamed some values

Signed-off-by: Joanne Wang <[email protected]>

* before removing from managed index meta data

Signed-off-by: Joanne Wang <[email protected]>

* created validation result object in explain

Signed-off-by: Joanne Wang <[email protected]>

* testing

Signed-off-by: Joanne Wang <[email protected]>

* run fails

Signed-off-by: Joanne Wang <[email protected]>

* integration test for delete + added framework for force merge

Signed-off-by: Joanne Wang <[email protected]>

* removed step validation metadata and still testing explain results

Signed-off-by: Joanne Wang <[email protected]>

* before removing from managed index runner

Signed-off-by: Joanne Wang <[email protected]>

* removed from managed index runner

Signed-off-by: Joanne Wang <[email protected]>

* clean up and tests

Signed-off-by: Joanne Wang <[email protected]>

* all validation tests pass

Signed-off-by: Joanne Wang <[email protected]>

* removed validation result from all managed index meta data

Signed-off-by: Joanne Wang <[email protected]>

* restored old IT tests

Signed-off-by: Joanne Wang <[email protected]>

* fixed it tests, set explain validation to false

Signed-off-by: Joanne Wang <[email protected]>

* clean up

Signed-off-by: Joanne Wang <[email protected]>

* Change test page size to avoid index/search TimeInMillis < 1 issue. (#460)

* Change test page size to avoid indexTimeInMillis < 1 issue.

Signed-off-by: Angie Zhang <[email protected]>

* Change test page size to avoid indexTimeInMillis < 1 issue.

Signed-off-by: Angie Zhang <[email protected]>

Signed-off-by: Angie Zhang <[email protected]>

* Transform maxclauses fix (#477)

* transform maxClauses fix

Signed-off-by: Petar Dzepina <[email protected]>

* added bucket log to track processed buckets

Signed-off-by: Petar Dzepina <[email protected]>

* various renames/changes

Signed-off-by: Petar Dzepina <[email protected]>

* fixed detekt issues

Signed-off-by: Petar Dzepina <[email protected]>

* added comments to test

Signed-off-by: Petar Dzepina <[email protected]>

* removed debug logging

Signed-off-by: Petar Dzepina <[email protected]>

* empty commit to trigger checks

Signed-off-by: Petar Dzepina <[email protected]>

* reduced pageSize to 1 in few ITs to avoid flaky tests; fixed bug where pagesProcessed was calculated incorrectly

Signed-off-by: Petar Dzepina <[email protected]>

* reverted pagesProcessed change; fixed few ITs

Signed-off-by: Petar Dzepina <[email protected]>

Signed-off-by: Petar Dzepina <[email protected]>

* 483: Updated detekt plugin and snakeyaml dependency. Updated a code t… (#485)

* 483: Updated detekt plugin and snakeyaml dependency. Updated a code to reduce the number of issues after static analysis

Signed-off-by: Stevan Buzejic <[email protected]>

* 483: Updated snakeyaml version to use the latest

Signed-off-by: Stevan Buzejic <[email protected]>

Signed-off-by: Stevan Buzejic <[email protected]>

* Remove HOST_DENY_LIST usage as Notification plugin will own it (#471)

(#107)

Signed-off-by: Xuesong Luo <[email protected]>

Signed-off-by: Xuesong Luo <[email protected]>

* Disable detekt because of the CVE (#497)

Signed-off-by: bowenlan-amzn <[email protected]>

Signed-off-by: bowenlan-amzn <[email protected]>

* Deprecate Master nonmenclature (#501)

Signed-off-by: bowenlan-amzn <[email protected]>

Signed-off-by: bowenlan-amzn <[email protected]>

* [AUTO] Increment version to 2.3.0-SNAPSHOT (#484) (#503)

* fix#921-README-forum-link-index_mgmnt (#499)

Signed-off-by: cwillum <[email protected]>

Signed-off-by: cwillum <[email protected]>

* 64: Added rounding when using aggreagate script for avg metric. Added… (#490)

* 64: Added rounding when using aggreagate script for avg metric. Added unit tests for checking average aggregations against the target rollup index

Signed-off-by: Stevan Buzejic <[email protected]>

* 64: Rollup job renamed

Signed-off-by: Stevan Buzejic <[email protected]>

* 64: Removed unrelevant metrics for the avg calculation test

Signed-off-by: Stevan Buzejic <[email protected]>

Signed-off-by: Stevan Buzejic <[email protected]>

* Revert Disable detekt and force choose snakeyml 1.32 (#528)

* Revert Disable detekt: 50ac1e9

Signed-off-by: Siddhant Deshmukh <[email protected]>

* Remove force choosing snakeyml 1.31

Signed-off-by: Siddhant Deshmukh <[email protected]>

* Force snakeyaml 1.32

Signed-off-by: Siddhant Deshmukh <[email protected]>

* Empty commit

Signed-off-by: Siddhant Deshmukh <[email protected]>

Signed-off-by: Siddhant Deshmukh <[email protected]>

* Added 2.3 release note (#507) (#515) (#517)

* Update 2.3 release note

Signed-off-by: Angie Zhang <[email protected]>

* Update 2.3 release note

Signed-off-by: Angie Zhang <[email protected]>

* Update 2.3 release note

Signed-off-by: Angie Zhang <[email protected]>

* Update 2.3 release note

Signed-off-by: Angie Zhang <[email protected]>

* Update 2.3 release note

Signed-off-by: Angie Zhang <[email protected]>

Signed-off-by: Angie Zhang <[email protected]>
(cherry picked from commit d9793ac)
Signed-off-by: Angie Zhang <[email protected]>

Signed-off-by: Angie Zhang <[email protected]>
(cherry picked from commit 7217b5b)

Co-authored-by: Angie Zhang <[email protected]>

* Add 2.2 release note (#450) (#452) (#516)

* Add 2.2 release note

Signed-off-by: Angie Zhang <[email protected]>

* Add 2.2 release note

Signed-off-by: Angie Zhang <[email protected]>

Co-authored-by: Angie Zhang <[email protected]>
(cherry picked from commit 8eb5da6)
Signed-off-by: Angie Zhang <[email protected]>

Signed-off-by: Angie Zhang <[email protected]>
Co-authored-by: Ashish Agrawal <[email protected]>

* Adds plugin version sweep background job (#434)

* [207]: Added 5 min scheduled job for sweeping ISM plugin version in the case of version discrepancy

Signed-off-by: Stevan Buzejic <[email protected]>

* [207]: Created pluginVersionSweepCoordinator component responsible for scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <[email protected]>

* [207]: Increased retry period for background job that sets the skip flag up to 5 mins

Signed-off-by: Stevan Buzejic <[email protected]>

* Empty-Commit

Signed-off-by: Stevan Buzejic <[email protected]>

Signed-off-by: Stevan Buzejic <[email protected]>
Co-authored-by: Stevan Buzejic <[email protected]>

* flaky transform test fix attempt (#542)

* flaky transform test fix attempt

Signed-off-by: Petar Dzepina <[email protected]>

* accidental paste fix

Signed-off-by: Petar Dzepina <[email protected]>

Signed-off-by: Petar Dzepina <[email protected]>
Co-authored-by: Petar Dzepina <[email protected]>

Signed-off-by: Joanne Wang <[email protected]>
Signed-off-by: Petar Dzepina <[email protected]>
Signed-off-by: Angie Zhang <[email protected]>
Signed-off-by: Stevan Buzejic <[email protected]>
Signed-off-by: Xuesong Luo <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: cwillum <[email protected]>
Signed-off-by: Siddhant Deshmukh <[email protected]>
Signed-off-by: Petar Dzepina <[email protected]>
Co-authored-by: Petar <[email protected]>
Co-authored-by: Angie Zhang <[email protected]>
Co-authored-by: Stevan Buzejic <[email protected]>
Co-authored-by: xluo-aws <[email protected]>
Co-authored-by: bowenlan-amzn <[email protected]>
Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com>
Co-authored-by: Chris Moore <[email protected]>
Co-authored-by: Siddhant Deshmukh <[email protected]>
Co-authored-by: Angie Zhang <[email protected]>
Co-authored-by: Ashish Agrawal <[email protected]>
Co-authored-by: Clay Downs <[email protected]>
Co-authored-by: Stevan Buzejic <[email protected]>
Co-authored-by: Petar Dzepina <[email protected]>
wuychn pushed a commit to ochprince/index-management that referenced this pull request Mar 16, 2023
…ensearch-project#539)

* [207]: Added 5 min scheduled job for sweeping ISM plugin version in the case of version discrepancy

Signed-off-by: Stevan Buzejic <[email protected]>

* [207]: Created pluginVersionSweepCoordinator component responsible for scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <[email protected]>

* [207]: Increased retry period for background job that sets the skip flag up to 5 mins

Signed-off-by: Stevan Buzejic <[email protected]>

* Empty-Commit

Signed-off-by: Stevan Buzejic <[email protected]>

Signed-off-by: Stevan Buzejic <[email protected]>
Co-authored-by: Stevan Buzejic <[email protected]>
(cherry picked from commit 4d844fa)

Co-authored-by: Clay Downs <[email protected]>
ronnaksaxena pushed a commit to ronnaksaxena/index-management that referenced this pull request Jul 19, 2023
…ensearch-project#539)

* [207]: Added 5 min scheduled job for sweeping ISM plugin version in the case of version discrepancy

Signed-off-by: Stevan Buzejic <[email protected]>

* [207]: Created pluginVersionSweepCoordinator component responsible for scheduling the skip execution task. Annotated tests in order to prevent thread leak error during integrational tests

Signed-off-by: Stevan Buzejic <[email protected]>

* [207]: Increased retry period for background job that sets the skip flag up to 5 mins

Signed-off-by: Stevan Buzejic <[email protected]>

* Empty-Commit

Signed-off-by: Stevan Buzejic <[email protected]>

Signed-off-by: Stevan Buzejic <[email protected]>
Co-authored-by: Stevan Buzejic <[email protected]>
(cherry picked from commit 4d844fa)

Co-authored-by: Clay Downs <[email protected]>
Signed-off-by: Ronnak Saxena <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants