-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ILM Make the check-rollover-ready
step retryable
#48256
Conversation
The open and close follower steps didn't check if the index is closed, open respectively, before executing the open/close request. This changes the steps to check the index state and only perform the open/close operation if the index is not already open/closed.
This adds the infrastructure to be able to retry the execution of retryable steps up to a configurable number of times (controlled via the setting `index.lifecycle.max_failed_step_retries_count`) and makes the `check-rollover-ready` retryable as an initial step to make the `rollover` action more resilient to transitive errors.
Pinging @elastic/es-core-features (:Core/Features/ILM+SLM) |
Created this as a draft to get some feedback on the progress. What's left is adding the retry count and the flag to signal the transitive error to the |
What is the reasoning behind adding a setting for this? If we do add this, why is the default finite (currently in the implementation, it's 15). |
@jasontedor we have integrations tests that expect the Regarding being finite or not, I don't have a strong opinion on this, just genuinely scared of unbounded loops. Although the retries are reacting to cluster state changes so there is a bit of backoff involved there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Andrei, I have some comments about it.
With regard to the number of retries, I think if we drop the retries to be on the periodic interval (so every 10 minutes by default), we should change the setting to -1 as a default, meaning to try forever. This lets us keep the ability for a user to set retries not to run infinitely, but still have the default keep trying forever by default (albeit spaced out by 10 minutes between each retry).
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/CloseFollowerIndexStep.java
Outdated
Show resolved
Hide resolved
...lugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/IndexLifecycleExplainResponse.java
Outdated
Show resolved
Hide resolved
...lugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/IndexLifecycleExplainResponse.java
Outdated
Show resolved
Hide resolved
...lugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/IndexLifecycleExplainResponse.java
Outdated
Show resolved
Hide resolved
...lugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/IndexLifecycleExplainResponse.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java
Outdated
Show resolved
Hide resolved
return moveClusterStateToFailedStep(currentState, index, true); | ||
} | ||
|
||
private ClusterState moveClusterStateToFailedStep(ClusterState currentState, String index, boolean isRetry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's getting confusing having moveClusterStateToErrorStep
and moveClusterStateToFailedStep
. The names are very similar and it's difficult to tell at a glance how the two differ.
How about we rename this to something like moveClusterStateOutOfErrorStep
so it makes it clearer that one is entering the error state and the other is exiting the error state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the isRetry
is an overloaded term since we have a "retry" API, maybe it should be isAutomaticRetry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this rename some more, it seems to me that moveClusterStateToFailedStep
is actually more descriptive than moveClusterStateOutOfErrorStep
(seeing a method like this, I'd immediately ask myself "moving where?" and I'd have to look at what it does to answer that question)
I appreciate the number of moveClusterStateToX
methods increased with this PR. The methods we currently have are moveClusterStateTo
:
FailedStep
ErrorStep
NextStep
Step
RetryFailedStep
- added in this PR but it's just an overload which I'll drop as there's no need for an extra term in the vocabulary
Unfortunately, I don't have a better suggestion to clean up the namings (probably because I've spent a lot of time recently in this code and it makes sense to me) so if moveClusterStateOutOfErrorStep
makes more sense to you I'll rename it, just wanted to express my concerns before doing so. @dakrone @gwbrown
x-pack/plugin/ilm/src/test/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunnerTests.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments - I tried to deduplicate my comments that overlapped with Lee's, but may have missed some.
Another couple thoughts:
- I'm not sure
transitive
is the right adjective here as it doesn't clearly communicate what the intent is in the code. I thinkautoretryable
or something may be a better choice. - While I'm with you on infinite retries being a little scary, it does limit the usefulness of auto-retry in some cases. For example, if an index has a misconfigured alias (a common issue in the field), it would be nice if ILM could figure itself out automatically once the configuration issue is resolved. But the automatic retries will run out in 2.5 hours, which is a fairly short time window to allow for an administrator to notice and correct the issue. IMO it's worth thinking about if we need to cap the retries, and even if we do in general, if some errors can ignore the cap.
if (indexMetaData.getState() == IndexMetaData.State.OPEN) { | ||
CloseIndexRequest closeIndexRequest = new CloseIndexRequest(followerIndex); | ||
getClient().admin().indices().close(closeIndexRequest, ActionListener.wrap( | ||
r -> { | ||
assert r.isAcknowledged() : "close index response is not acknowledged"; | ||
listener.onResponse(true); | ||
}, | ||
listener::onFailure) | ||
); | ||
} else { | ||
listener.onResponse(true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change can (and should be IMO) be broken out into a separate PR that can be merged before this one.
@@ -104,6 +112,12 @@ static LifecycleExecutionState fromCustomMetadata(Map<String, String> customData | |||
if (customData.containsKey(FAILED_STEP)) { | |||
builder.setFailedStep(customData.get(FAILED_STEP)); | |||
} | |||
if (customData.containsKey(IS_TRANSITIVE_ERROR)) { | |||
builder.setIsTransitiveError(Boolean.parseBoolean(customData.get(IS_TRANSITIVE_ERROR))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and the one below it) should probably be wrapped in a try/catch to clarify the exception if the value isn't parseable, like the other fields in this method:
Lines 152 to 156 in 016d1c9
try { | |
builder.setStepTime(Long.parseLong(customData.get(STEP_TIME))); | |
} catch (NumberFormatException e) { | |
throw new ElasticsearchException("Custom metadata field [{}] does not contain a valid long. Actual value: [{}]", | |
e, STEP_TIME, customData.get(STEP_TIME)); |
if (indexMetaData.getState() == IndexMetaData.State.CLOSE) { | ||
OpenIndexRequest request = new OpenIndexRequest(indexMetaData.getIndex().getName()); | ||
getClient().admin().indices().open(request, ActionListener.wrap( | ||
r -> { | ||
assert r.isAcknowledged() : "open index response is not acknowledged"; | ||
listener.onResponse(true); | ||
}, | ||
listener::onFailure | ||
)); | ||
} else { | ||
listener.onResponse(true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think this should be broken out into another PR along with the change to CloseFollowerIndexStep
.
public void testCloseFollowerIndexIsNoopForAlreadyClosedIndex() { | ||
IndexMetaData indexMetadata = IndexMetaData.builder("follower-index") | ||
.settings(settings(Version.CURRENT).put(LifecycleSettings.LIFECYCLE_INDEXING_COMPLETE, "true")) | ||
.putCustom(CCR_METADATA_KEY, Collections.emptyMap()) | ||
.state(IndexMetaData.State.CLOSE) | ||
.numberOfShards(1) | ||
.numberOfReplicas(0) | ||
.build(); | ||
Client client = Mockito.mock(Client.class); | ||
CloseFollowerIndexStep step = new CloseFollowerIndexStep(randomStepKey(), randomStepKey(), client); | ||
step.performAction(indexMetadata, null, null, new AsyncActionStep.Listener() { | ||
@Override | ||
public void onResponse(boolean complete) { | ||
assertThat(complete, is(true)); | ||
} | ||
|
||
@Override | ||
public void onFailure(Exception e) { | ||
} | ||
}); | ||
|
||
Mockito.verifyZeroInteractions(client); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same deal - break out into separate PR
@@ -52,6 +52,8 @@ private static IndexLifecycleExplainResponse randomManagedIndexExplainResponse() | |||
stepNull ? null : randomAlphaOfLength(10), | |||
stepNull ? null : randomAlphaOfLength(10), | |||
randomBoolean() ? null : randomAlphaOfLength(10), | |||
stepNull ? null : randomBoolean(), | |||
stepNull ? null : randomInt(15), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this isn't arbitrary (as opposed to, for example, the 10
in randomAlphaOfLength(10)
above) this should be made a constant instead of a magic number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is somewhat arbitrary (I set it up to the previous rather arbitrary retry count I configured). I'll change it to 10 to not stand out of the bunch (especially as the retry count default will be infinite)
Request updateLifecylePollSetting = new Request("PUT", "_cluster/settings"); | ||
updateLifecylePollSetting.setJsonEntity("{" + | ||
" \"transient\": {\n" + | ||
" \"indices.lifecycle.poll_interval\" : \"1s\" \n" + | ||
" }\n" + | ||
"}"); | ||
client().performRequest(updateLifecylePollSetting); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and the later one) shouldn't be necessary as we set this via Gradle for this test suite:
setting 'indices.lifecycle.poll_interval', '1000ms' |
x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java
Outdated
Show resolved
Hide resolved
clusterService.submitStateUpdateTask("ilm-retry-failed-step", new ClusterStateUpdateTask() { | ||
@Override | ||
public ClusterState execute(ClusterState currentState) { | ||
return moveClusterStateToRetryFailedStep(currentState, index); | ||
} | ||
|
||
@Override | ||
public void onFailure(String source, Exception e) { | ||
logger.error("retry execution of step [{}] failed due to [{}]", failedStep.getKey().getName(), e); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this update task need a clusterStateProcessed()
method that calls maybeRunAsyncAction
? Like here:
Lines 69 to 80 in 96b4f3d
public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) { | |
for (String index : request.indices()) { | |
IndexMetaData idxMeta = newState.metaData().index(index); | |
LifecycleExecutionState lifecycleState = LifecycleExecutionState.fromIndexMetadata(idxMeta); | |
StepKey retryStep = new StepKey(lifecycleState.getPhase(), lifecycleState.getAction(), lifecycleState.getStep()); | |
if (idxMeta == null) { | |
// The index has somehow been deleted - there shouldn't be any opportunity for this to happen, but just in case. | |
logger.debug("index [" + index + "] has been deleted after moving to step [" + | |
lifecycleState.getStep() + "], skipping async action check"); | |
return; | |
} | |
indexLifecycleService.maybeRunAsyncAction(newState, idxMeta, retryStep); |
Or am I missing where we call maybeRunAsyncAction
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll follow up on running async steps ( #48010 (comment) )
|
||
@Override | ||
public void onFailure(String source, Exception e) { | ||
logger.error("retry execution of step [{}] failed due to [{}]", failedStep.getKey().getName(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log message should specify the index in question.
Step failedStep = stepRegistry.getStep(indexMetaData, new StepKey(lifecycleState.getPhase(), lifecycleState.getAction(), | ||
lifecycleState.getFailedStep())); | ||
if (failedStep == null) { | ||
logger.warn("failed step [{}] is not part of policy [{}] anymore, or it is invalid. skipping execution", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log message should specify the index in question.
@gwbrown @dakrone thanks for the review. I'll move on to implement your suggestions. I'll create a separate PR for the open/close follower steps changes (made them in order to make the steps a bit more resilient as the api calls would fail when executed against a read-only index for example and we execute them regardless if the index is already in the desired state ie. open or closed) |
@elasticmachine run elasticsearch-ci/bwc |
@elasticmachine update branch |
@elasticmachine retest this please |
This fixes the version checks to use 7.6.0 instead of 8.0.0 since the changes in elastic#48256 were backported to the 7.x branch.
This fixes the version checks to use 7.6.0 instead of 8.0.0 since the changes in #48256 were backported to the 7.x branch.
This test used an index without an alias to simulate a failure in the `check-rollover-ready` step. However, with elastic#48256 that step automatically retries, meaning that the index may not always be in the ERROR step. This commit changes the test to use a shrink action with an invalid number of shards so that it stays in the ERROR step. Resolves elastic#48767
This test used an index without an alias to simulate a failure in the `check-rollover-ready` step. However, with #48256 that step automatically retries, meaning that the index may not always be in the ERROR step. This commit changes the test to use a shrink action with an invalid number of shards so that it stays in the ERROR step. Resolves #48767
This test used an index without an alias to simulate a failure in the `check-rollover-ready` step. However, with elastic#48256 that step automatically retries, meaning that the index may not always be in the ERROR step. This commit changes the test to use a shrink action with an invalid number of shards so that it stays in the ERROR step. Resolves elastic#48767
This test used an index without an alias to simulate a failure in the `check-rollover-ready` step. However, with #48256 that step automatically retries, meaning that the index may not always be in the ERROR step. This commit changes the test to use a shrink action with an invalid number of shards so that it stays in the ERROR step. Resolves #48767
The rollover action is now a retryable step (see elastic#48256) so ILM will keep retrying until it succeeds as opposed to stopping and moving the execution in the ERROR step. Fixes elastic#49073
The rollover action is now a retryable step (see elastic#48256) so ILM will keep retrying until it succeeds as opposed to stopping and moving the execution in the ERROR step. Fixes elastic#49073 (cherry picked from commit 3ae9089) Signed-off-by: Andrei Dan <[email protected]> # Conflicts: # x-pack/plugin/ilm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java
The rollover action is now a retryable step (see #48256) so ILM will keep retrying until it succeeds as opposed to stopping and moving the execution in the ERROR step. Fixes #49073 (cherry picked from commit 3ae9089) Signed-off-by: Andrei Dan <[email protected]> # Conflicts: # x-pack/plugin/ilm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java
Relates: #4341, elastic/elasticsearch#48256 This commit adds FailedStepRetryCount and IsAutoRetryableError properties to ILM explain, applicable if a step fails.
Relates: #4341, elastic/elasticsearch#48256 This commit adds FailedStepRetryCount and IsAutoRetryableError properties to ILM explain, applicable if a step fails.
Relates: #4341, elastic/elasticsearch#48256 This commit adds FailedStepRetryCount and IsAutoRetryableError properties to ILM explain, applicable if a step fails. (cherry picked from commit 4dc0e83)
Relates: #4341, elastic/elasticsearch#48256 This commit adds FailedStepRetryCount and IsAutoRetryableError properties to ILM explain, applicable if a step fails. (cherry picked from commit 4dc0e83)
This adds the infrastructure to be able to retry the execution of retryable
steps and makes the
check-rollover-ready
retryable as an initial step tomake the
rollover
action more resilient to transient errors.This is the initial effort to tackle #44135