Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ILM] fix retry so it picks up latest policy and executes async action #35406

Merged
merged 4 commits into from
Nov 12, 2018

Conversation

talevy
Copy link
Contributor

@talevy talevy commented Nov 9, 2018

Before, moving to a failed step would only change the step info
to be that of the failed step. This means two things.

  1. Async Steps would never be triggered to execute
  2. If there are inherent problems with the action definition that can
    be fixed with a policy update, these changes were not being reflected
    by the new execution info.

Changes now

  1. Async steps are executed after the move to the failed step in cluster state
  2. the lifecycle execution info's phase definition is updated from the current
    latest policy definition, even though the index isn't moving to a new phase.

Closes #35397.

Before, moving to a failed step would only change the step info
to be that of the failed step. This means two things.

1. Async Steps would never be triggered to execute
2. If there are inherent problems with the action definition that can
be fixed with a policy update, these changes were not being reflected
by the new execution info.

Changes now

1. Async steps are executed after the move to the failed step in cluster state
2. the lifecycle execution info's phase definition is updated from the current
latest policy definition, even though the index isn't moving to a new phase.

Closes elastic#35397.
@talevy talevy added >bug blocker :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Nov 9, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@talevy
Copy link
Contributor Author

talevy commented Nov 9, 2018

I attempted to solve #35397 with minimal changes, but I'm not too happy about leaking ErrorStep specific things within the policyDefinition update code. I'd like to think about this a little bit more.

thought it would be good to put this up here sooner to see what others think

@@ -369,7 +369,7 @@ private static LifecycleExecutionState moveExecutionStateToNextStep(LifecyclePol
updatedState.setFailedStep(null);
updatedState.setStepInfo(null);

if (currentStep.getPhase().equals(nextStep.getPhase()) == false) {
if (currentStep.getPhase().equals(nextStep.getPhase()) == false || ErrorStep.NAME.equals(currentStep.getName())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the right pace to do this. We only want this functionality to occur in the retry API so I think it should only be that code path that gets this phase definition refresh. Maybe we can have a boolean forcePhaseDefinitionRefresh parameter and use that to override the behaviour for the retry method. I can also see a case in the future where we might want to have an option on the moveToStep API to force the phase definition to be updated too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. this was my issue with this same thing. I thought about creating the boolean flag, but I find flags like this that significantly change behavior to be confusing. I didn't think about move-to-step potentially wanting the same thing, though. Given this potential usage outside of just errors, I think the flag would be appropriate as to not duplicate much of the code. thanks. I will try it and see how it reads

lifecycleState.getStep() + "], skipping async action check");
return;
}
logger.error("TRYING TRYING");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this log statement got left in accidentally.

@talevy
Copy link
Contributor Author

talevy commented Nov 9, 2018

refreshed to leverage the flag since we concluded this was a reasonable approach.

@talevy talevy requested review from gwbrown and colings86 November 9, 2018 23:48
Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the build passes

@talevy talevy merged commit 16cbbab into elastic:master Nov 12, 2018
@talevy talevy deleted the ilm-fix-35397 branch November 12, 2018 19:33
talevy added a commit that referenced this pull request Nov 12, 2018
#35406)

Before, moving to a failed step would only change the step info
to be that of the failed step. This means two things.

1. Async Steps would never be triggered to execute
2. If there are inherent problems with the action definition that can
be fixed with a policy update, these changes were not being reflected
by the new execution info.

Changes now

1. Async steps are executed after the move to the failed step in cluster state
2. the lifecycle execution info's phase definition is updated from the current
latest policy definition, even though the index isn't moving to a new phase.

Closes #35397.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Nov 13, 2018
* master: (22 commits)
  Introduce CCR getting started guide (elastic#35434)
  Correct typo in BuildPlugin exception message (elastic#35458)
  Correct implemented interface of ParsedReverseNested (elastic#35455)
  [ML] Fix find_file_structure NPE with should_trim_fields (elastic#35465)
  Handle OS pretty name on old OS without OS release (elastic#35453)
  Register remote cluster compress setting (elastic#35464)
  [DOCS] Clarify results_index_name description (elastic#35463)
  [TEST] add missing doc tests for HLRC GetRollupIndexCaps API
  [HLRC] Add GetRollupIndexCaps API (elastic#35102)
  Geo: enables coerce support in WKT polygon parser (elastic#35414)
  [ILM] fix retry so it picks up latest policy and executes async action (elastic#35406)
  Address handling of OS pretty name on some OS (elastic#35451)
  HLRC support for getTask (elastic#35166)
  upgrade to lucene-8.0.0-snapshot-6d9c714052 (elastic#35428)
  Add docs for CCR stats API (elastic#35439)
  Fix the names of CCR stats endpoints in usage API (elastic#35438)
  Switch to default to no build qualifier (elastic#35415)
  Clarify S3 repository storage class parameter (elastic#35400)
  SQL: Fix query translation for scripted queries (elastic#35408)
  [TEST] Instead of ignoring the ccr downgrade to basic license qa test avoid the assertions that check the log files, because that does not work on Windows. The rest of the test is still useful and should work on Windows CI.
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management >non-issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retrying an ILM action that is an AsyncActionStep does not work due to execution model changes
5 participants