[ILM] fix retry so it picks up latest policy and executes async action #35406

talevy · 2018-11-09T06:51:48Z

Before, moving to a failed step would only change the step info
to be that of the failed step. This means two things.

Async Steps would never be triggered to execute
If there are inherent problems with the action definition that can
be fixed with a policy update, these changes were not being reflected
by the new execution info.

Changes now

Async steps are executed after the move to the failed step in cluster state
the lifecycle execution info's phase definition is updated from the current
latest policy definition, even though the index isn't moving to a new phase.

Before, moving to a failed step would only change the step info to be that of the failed step. This means two things. 1. Async Steps would never be triggered to execute 2. If there are inherent problems with the action definition that can be fixed with a policy update, these changes were not being reflected by the new execution info. Changes now 1. Async steps are executed after the move to the failed step in cluster state 2. the lifecycle execution info's phase definition is updated from the current latest policy definition, even though the index isn't moving to a new phase. Closes elastic#35397.

elasticmachine · 2018-11-09T06:51:50Z

Pinging @elastic/es-core-infra

talevy · 2018-11-09T06:54:52Z

I attempted to solve #35397 with minimal changes, but I'm not too happy about leaking ErrorStep specific things within the policyDefinition update code. I'd like to think about this a little bit more.

thought it would be good to put this up here sooner to see what others think

colings86 · 2018-11-09T09:29:51Z

...ck/plugin/ilm/src/main/java/org/elasticsearch/xpack/indexlifecycle/IndexLifecycleRunner.java

@@ -369,7 +369,7 @@ private static LifecycleExecutionState moveExecutionStateToNextStep(LifecyclePol
        updatedState.setFailedStep(null);
        updatedState.setStepInfo(null);

-        if (currentStep.getPhase().equals(nextStep.getPhase()) == false) {
+        if (currentStep.getPhase().equals(nextStep.getPhase()) == false || ErrorStep.NAME.equals(currentStep.getName())) {


I'm not sure this is the right pace to do this. We only want this functionality to occur in the retry API so I think it should only be that code path that gets this phase definition refresh. Maybe we can have a boolean forcePhaseDefinitionRefresh parameter and use that to override the behaviour for the retry method. I can also see a case in the future where we might want to have an option on the moveToStep API to force the phase definition to be updated too.

yup. this was my issue with this same thing. I thought about creating the boolean flag, but I find flags like this that significantly change behavior to be confusing. I didn't think about move-to-step potentially wanting the same thing, though. Given this potential usage outside of just errors, I think the flag would be appropriate as to not duplicate much of the code. thanks. I will try it and see how it reads

gwbrown · 2018-11-09T22:44:49Z

...in/ilm/src/main/java/org/elasticsearch/xpack/indexlifecycle/action/TransportRetryAction.java

+                                lifecycleState.getStep() + "], skipping async action check");
+                            return;
+                        }
+                        logger.error("TRYING TRYING");


I think this log statement got left in accidentally.

talevy · 2018-11-09T23:48:27Z

refreshed to leverage the flag since we concluded this was a reasonable approach.

colings86

LGTM once the build passes

#35406) Before, moving to a failed step would only change the step info to be that of the failed step. This means two things. 1. Async Steps would never be triggered to execute 2. If there are inherent problems with the action definition that can be fixed with a policy update, these changes were not being reflected by the new execution info. Changes now 1. Async steps are executed after the move to the failed step in cluster state 2. the lifecycle execution info's phase definition is updated from the current latest policy definition, even though the index isn't moving to a new phase. Closes #35397.

* master: (22 commits) Introduce CCR getting started guide (elastic#35434) Correct typo in BuildPlugin exception message (elastic#35458) Correct implemented interface of ParsedReverseNested (elastic#35455) [ML] Fix find_file_structure NPE with should_trim_fields (elastic#35465) Handle OS pretty name on old OS without OS release (elastic#35453) Register remote cluster compress setting (elastic#35464) [DOCS] Clarify results_index_name description (elastic#35463) [TEST] add missing doc tests for HLRC GetRollupIndexCaps API [HLRC] Add GetRollupIndexCaps API (elastic#35102) Geo: enables coerce support in WKT polygon parser (elastic#35414) [ILM] fix retry so it picks up latest policy and executes async action (elastic#35406) Address handling of OS pretty name on some OS (elastic#35451) HLRC support for getTask (elastic#35166) upgrade to lucene-8.0.0-snapshot-6d9c714052 (elastic#35428) Add docs for CCR stats API (elastic#35439) Fix the names of CCR stats endpoints in usage API (elastic#35438) Switch to default to no build qualifier (elastic#35415) Clarify S3 repository storage class parameter (elastic#35400) SQL: Fix query translation for scripted queries (elastic#35408) [TEST] Instead of ignoring the ccr downgrade to basic license qa test avoid the assertions that check the log files, because that does not work on Windows. The rest of the test is still useful and should work on Windows CI. ...

talevy added >bug blocker :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Nov 9, 2018

talevy requested review from dakrone, colings86 and gwbrown November 9, 2018 06:51

colings86 requested changes Nov 9, 2018

View reviewed changes

gwbrown reviewed Nov 9, 2018

View reviewed changes

talevy added 2 commits November 9, 2018 15:01

Merge remote-tracking branch 'upstream/master' into ilm-fix-35397

7432320

add flag for fetching latest policy definition

13f196b

talevy requested review from gwbrown and colings86 November 9, 2018 23:48

colings86 approved these changes Nov 12, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into ilm-fix-35397

9e5ba12

talevy merged commit 16cbbab into elastic:master Nov 12, 2018

talevy deleted the ilm-fix-35397 branch November 12, 2018 19:33

tomcallahan added the >non-issue label Nov 16, 2018

dakrone mentioned this pull request Nov 1, 2019

Re-read policy phase JSON when using ILM's move-to-step API #48827

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ILM] fix retry so it picks up latest policy and executes async action #35406

[ILM] fix retry so it picks up latest policy and executes async action #35406

talevy commented Nov 9, 2018

elasticmachine commented Nov 9, 2018

talevy commented Nov 9, 2018

colings86 Nov 9, 2018

talevy Nov 9, 2018

gwbrown Nov 9, 2018

talevy commented Nov 9, 2018

colings86 left a comment

[ILM] fix retry so it picks up latest policy and executes async action #35406

[ILM] fix retry so it picks up latest policy and executes async action #35406

Conversation

talevy commented Nov 9, 2018

elasticmachine commented Nov 9, 2018

talevy commented Nov 9, 2018

colings86 Nov 9, 2018

Choose a reason for hiding this comment

talevy Nov 9, 2018

Choose a reason for hiding this comment

gwbrown Nov 9, 2018

Choose a reason for hiding this comment

talevy commented Nov 9, 2018

colings86 left a comment

Choose a reason for hiding this comment