-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ILM] fix retry so it picks up latest policy and executes async action #35406
Conversation
Before, moving to a failed step would only change the step info to be that of the failed step. This means two things. 1. Async Steps would never be triggered to execute 2. If there are inherent problems with the action definition that can be fixed with a policy update, these changes were not being reflected by the new execution info. Changes now 1. Async steps are executed after the move to the failed step in cluster state 2. the lifecycle execution info's phase definition is updated from the current latest policy definition, even though the index isn't moving to a new phase. Closes elastic#35397.
Pinging @elastic/es-core-infra |
I attempted to solve #35397 with minimal changes, but I'm not too happy about leaking ErrorStep specific things within the policyDefinition update code. I'd like to think about this a little bit more. thought it would be good to put this up here sooner to see what others think |
@@ -369,7 +369,7 @@ private static LifecycleExecutionState moveExecutionStateToNextStep(LifecyclePol | |||
updatedState.setFailedStep(null); | |||
updatedState.setStepInfo(null); | |||
|
|||
if (currentStep.getPhase().equals(nextStep.getPhase()) == false) { | |||
if (currentStep.getPhase().equals(nextStep.getPhase()) == false || ErrorStep.NAME.equals(currentStep.getName())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the right pace to do this. We only want this functionality to occur in the retry API so I think it should only be that code path that gets this phase definition refresh. Maybe we can have a boolean forcePhaseDefinitionRefresh
parameter and use that to override the behaviour for the retry method. I can also see a case in the future where we might want to have an option on the moveToStep API to force the phase definition to be updated too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup. this was my issue with this same thing. I thought about creating the boolean flag, but I find flags like this that significantly change behavior to be confusing. I didn't think about move-to-step potentially wanting the same thing, though. Given this potential usage outside of just errors, I think the flag would be appropriate as to not duplicate much of the code. thanks. I will try it and see how it reads
lifecycleState.getStep() + "], skipping async action check"); | ||
return; | ||
} | ||
logger.error("TRYING TRYING"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this log statement got left in accidentally.
refreshed to leverage the flag since we concluded this was a reasonable approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the build passes
#35406) Before, moving to a failed step would only change the step info to be that of the failed step. This means two things. 1. Async Steps would never be triggered to execute 2. If there are inherent problems with the action definition that can be fixed with a policy update, these changes were not being reflected by the new execution info. Changes now 1. Async steps are executed after the move to the failed step in cluster state 2. the lifecycle execution info's phase definition is updated from the current latest policy definition, even though the index isn't moving to a new phase. Closes #35397.
* master: (22 commits) Introduce CCR getting started guide (elastic#35434) Correct typo in BuildPlugin exception message (elastic#35458) Correct implemented interface of ParsedReverseNested (elastic#35455) [ML] Fix find_file_structure NPE with should_trim_fields (elastic#35465) Handle OS pretty name on old OS without OS release (elastic#35453) Register remote cluster compress setting (elastic#35464) [DOCS] Clarify results_index_name description (elastic#35463) [TEST] add missing doc tests for HLRC GetRollupIndexCaps API [HLRC] Add GetRollupIndexCaps API (elastic#35102) Geo: enables coerce support in WKT polygon parser (elastic#35414) [ILM] fix retry so it picks up latest policy and executes async action (elastic#35406) Address handling of OS pretty name on some OS (elastic#35451) HLRC support for getTask (elastic#35166) upgrade to lucene-8.0.0-snapshot-6d9c714052 (elastic#35428) Add docs for CCR stats API (elastic#35439) Fix the names of CCR stats endpoints in usage API (elastic#35438) Switch to default to no build qualifier (elastic#35415) Clarify S3 repository storage class parameter (elastic#35400) SQL: Fix query translation for scripted queries (elastic#35408) [TEST] Instead of ignoring the ccr downgrade to basic license qa test avoid the assertions that check the log files, because that does not work on Windows. The rest of the test is still useful and should work on Windows CI. ...
Before, moving to a failed step would only change the step info
to be that of the failed step. This means two things.
be fixed with a policy update, these changes were not being reflected
by the new execution info.
Changes now
latest policy definition, even though the index isn't moving to a new phase.
Closes #35397.