-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ILM fix the init step to actually be retryable #52076
Conversation
We marked the `init` ILM step as retryable but our test used `waitUntil` without an assert so we didn’t catch the fact that we were not actually able to retry this step as our ILM state didn’t contain any information about the policy execution (as we were in the process of initialising it). This commit manually sets the current step to `init` when we’re moving the ilm policy into the ERROR step (this enables us to successfully move to the error step and later retry the step)
Pinging @elastic/es-core-features (:Core/Features/ILM+SLM) |
@@ -1120,26 +1120,26 @@ public void testRolloverStepRetriesUntilRolledOverIndexIsDeleted() throws Except | |||
// {@link org.elasticsearch.xpack.core.ilm.ErrorStep} in order to retry the failing step. As {@link #assertBusy} | |||
// increases the wait time between calls exponentially, we might miss the window where the policy is on | |||
// {@link WaitForRolloverReadyStep} and the move to `attempt-rollover` request will not be successful. | |||
waitUntil(() -> { | |||
assertThat(waitUntil(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider using assertTrue
, we have long body here, it's hard to tell what we asserting here. The same applies to all instances below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be changed to just assertBusy(() -> {...}, 30, TimeUnit.SECONDS)
also and use assertions to signal success/failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the assertThat(waitUntil(), is(true))
to assertTrue(waitUntil())
but the reason to use waitUntil
as opposed to assertBusy
is to avoid having this test flake. We are asserting that the failed step and the retry count are bound to particular values. The fact that we want to check the failed step is what we expect it to be means we have to catch ILM in the ERROR step, but with retryable steps ILM keeps moving back and forth between the failing step (on retry) and the ERROR step (when the step execution fails).
assertBusy
increases the wait time exponentially once it gets to 1 second, while waitUntil
keeps it at 1 second. This means we're bound to catch ILM in the ERROR step using waitUntil
(but with assertBusy we probe the ILM state only at seconds 1, 2, 4, 8 and 16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh okay, I see the difference between them now, in that case I agree that assertTrue
would be better than assertThat
in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of comments about this, in particular I'm not sure we need to add a new exception type just for this?
.build() | ||
); | ||
} catch (Exception e) { | ||
throw new InitializePolicyException(e.getMessage(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the reason for wrapping this in an ElasticsearchException
type exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely happy with using an exception for flow control but since we don't have any lifecycle state yet (as we're just initialising the policy) I used this exception to signal the current failing step is the "init policy step" when moving the policy into the ERROR
step.
Another option would be to pass the current step key to IndexLifecycleTransition.moveClusterStateToErrorStep
but this seemed more intrusive.
@@ -1120,26 +1120,26 @@ public void testRolloverStepRetriesUntilRolledOverIndexIsDeleted() throws Except | |||
// {@link org.elasticsearch.xpack.core.ilm.ErrorStep} in order to retry the failing step. As {@link #assertBusy} | |||
// increases the wait time between calls exponentially, we might miss the window where the policy is on | |||
// {@link WaitForRolloverReadyStep} and the move to `attempt-rollover` request will not be successful. | |||
waitUntil(() -> { | |||
assertThat(waitUntil(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be changed to just assertBusy(() -> {...}, 30, TimeUnit.SECONDS)
also and use assertions to signal success/failure
|
||
// Similar to above, using {@link #waitUntil} as we want to make sure the `attempt-rollover` step started failing and is being | ||
// retried (which means ILM moves back and forth between the `attempt-rollover` step and the `error` step) | ||
waitUntil(() -> { | ||
assertThat("ILM did not start retrying the attempt-rollover step", waitUntil(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the assertThat
-> waitUntil
-> is(true)
is a little hard to follow, I think just a single assertBusy()
would be better, since you can add the assertions in the body itself rather than returning a boolean?
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few more comments, thanks Andrei!
.put(LifecycleSettings.LIFECYCLE_ORIGINATION_DATE, parsedOriginationDate) | ||
.build() | ||
); | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to move the try
to surround more of the function (for example, the fromIndexMetadata(...)
call
*/ | ||
public class InitializePolicyException extends ElasticsearchException { | ||
|
||
public InitializePolicyException(String msg, Throwable cause, Object... args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include the policy name in the exception somewhere? I think that might be helpful if the setting were to change but the error was still there
x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleTransition.java
Show resolved
Hide resolved
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left another comment about not passing the policy in, just deriving it from the index metadata
super(key, nextStepKey); | ||
this.policy = policy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to pass the policy in and store it in the step, we can get it directly out of the index metadata (the index.lifecycle.name
setting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, great point!
} | ||
|
||
@Override | ||
public boolean equals(Object o) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get the policy id from the index metadata we don't need to add this stuff either, which will be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Andrei!
We marked the `init` ILM step as retryable but our test used `waitUntil` without an assert so we didn’t catch the fact that we were not actually able to retry this step as our ILM state didn’t contain any information about the policy execution (as we were in the process of initialising it). This commit manually sets the current step to `init` when we’re moving the ilm policy into the ERROR step (this enables us to successfully move to the error step and later retry the step) * ShrunkenIndexCheckStep: Use correct logger (cherry picked from commit f78d4b3) Signed-off-by: Andrei Dan <[email protected]>
We marked the `init` ILM step as retryable but our test used `waitUntil` without an assert so we didn’t catch the fact that we were not actually able to retry this step as our ILM state didn’t contain any information about the policy execution (as we were in the process of initialising it). This commit manually sets the current step to `init` when we’re moving the ilm policy into the ERROR step (this enables us to successfully move to the error step and later retry the step) * ShrunkenIndexCheckStep: Use correct logger (cherry picked from commit f78d4b3) Signed-off-by: Andrei Dan <[email protected]>
We marked the `init` ILM step as retryable but our test used `waitUntil` without an assert so we didn’t catch the fact that we were not actually able to retry this step as our ILM state didn’t contain any information about the policy execution (as we were in the process of initialising it). This commit manually sets the current step to `init` when we’re moving the ilm policy into the ERROR step (this enables us to successfully move to the error step and later retry the step) * ShrunkenIndexCheckStep: Use correct logger (cherry picked from commit f78d4b3) Signed-off-by: Andrei Dan <[email protected]>
We marked the `init` ILM step as retryable but our test used `waitUntil` without an assert so we didn’t catch the fact that we were not actually able to retry this step as our ILM state didn’t contain any information about the policy execution (as we were in the process of initialising it). This commit manually sets the current step to `init` when we’re moving the ilm policy into the ERROR step (this enables us to successfully move to the error step and later retry the step) * ShrunkenIndexCheckStep: Use correct logger (cherry picked from commit f78d4b3) Signed-off-by: Andrei Dan <[email protected]>
We marked the
init
ILM step as retryable but our test usedwaitUntil
without an assert so we didn’t catch the fact that we were not actually
able to retry this step as our ILM state didn’t contain any information
about the policy execution (as we were in the process of initialising
it).
This commit manually sets the current step to
init
when we’re movingthe ilm policy into the ERROR step (this enables us to successfully
move to the error step and later retry the step)