Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILM fix the init step to actually be retryable #52076

Merged
merged 12 commits into from
Feb 13, 2020

Conversation

andreidan
Copy link
Contributor

We marked the init ILM step as retryable but our test used waitUntil
without an assert so we didn’t catch the fact that we were not actually
able to retry this step as our ILM state didn’t contain any information
about the policy execution (as we were in the process of initialising
it).

This commit manually sets the current step to init when we’re moving
the ilm policy into the ERROR step (this enables us to successfully
move to the error step and later retry the step)

We marked the `init` ILM step as retryable but our test used `waitUntil`
without an assert so we didn’t catch the fact that we were not actually
able to retry this step as our ILM state didn’t contain any information
about the policy execution (as we were in the process of initialising
it). 

This commit manually sets the current step to `init` when we’re moving
the ilm policy into the ERROR step (this enables us to successfully
move to the error step and later retry the step)
@andreidan andreidan added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v7.7.0 v7.6.1 labels Feb 7, 2020
@andreidan andreidan requested a review from dakrone February 7, 2020 19:37
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@@ -1120,26 +1120,26 @@ public void testRolloverStepRetriesUntilRolledOverIndexIsDeleted() throws Except
// {@link org.elasticsearch.xpack.core.ilm.ErrorStep} in order to retry the failing step. As {@link #assertBusy}
// increases the wait time between calls exponentially, we might miss the window where the policy is on
// {@link WaitForRolloverReadyStep} and the move to `attempt-rollover` request will not be successful.
waitUntil(() -> {
assertThat(waitUntil(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider using assertTrue, we have long body here, it's hard to tell what we asserting here. The same applies to all instances below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be changed to just assertBusy(() -> {...}, 30, TimeUnit.SECONDS) also and use assertions to signal success/failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the assertThat(waitUntil(), is(true)) to assertTrue(waitUntil()) but the reason to use waitUntil as opposed to assertBusy is to avoid having this test flake. We are asserting that the failed step and the retry count are bound to particular values. The fact that we want to check the failed step is what we expect it to be means we have to catch ILM in the ERROR step, but with retryable steps ILM keeps moving back and forth between the failing step (on retry) and the ERROR step (when the step execution fails).

assertBusy increases the wait time exponentially once it gets to 1 second, while waitUntil keeps it at 1 second. This means we're bound to catch ILM in the ERROR step using waitUntil (but with assertBusy we probe the ILM state only at seconds 1, 2, 4, 8 and 16)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay, I see the difference between them now, in that case I agree that assertTrue would be better than assertThat in this case

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments about this, in particular I'm not sure we need to add a new exception type just for this?

.build()
);
} catch (Exception e) {
throw new InitializePolicyException(e.getMessage(), e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for wrapping this in an ElasticsearchException type exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely happy with using an exception for flow control but since we don't have any lifecycle state yet (as we're just initialising the policy) I used this exception to signal the current failing step is the "init policy step" when moving the policy into the ERROR step.

Another option would be to pass the current step key to IndexLifecycleTransition.moveClusterStateToErrorStep but this seemed more intrusive.

@@ -1120,26 +1120,26 @@ public void testRolloverStepRetriesUntilRolledOverIndexIsDeleted() throws Except
// {@link org.elasticsearch.xpack.core.ilm.ErrorStep} in order to retry the failing step. As {@link #assertBusy}
// increases the wait time between calls exponentially, we might miss the window where the policy is on
// {@link WaitForRolloverReadyStep} and the move to `attempt-rollover` request will not be successful.
waitUntil(() -> {
assertThat(waitUntil(() -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be changed to just assertBusy(() -> {...}, 30, TimeUnit.SECONDS) also and use assertions to signal success/failure


// Similar to above, using {@link #waitUntil} as we want to make sure the `attempt-rollover` step started failing and is being
// retried (which means ILM moves back and forth between the `attempt-rollover` step and the `error` step)
waitUntil(() -> {
assertThat("ILM did not start retrying the attempt-rollover step", waitUntil(() -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assertThat -> waitUntil -> is(true) is a little hard to follow, I think just a single assertBusy() would be better, since you can add the assertions in the body itself rather than returning a boolean?

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more comments, thanks Andrei!

.put(LifecycleSettings.LIFECYCLE_ORIGINATION_DATE, parsedOriginationDate)
.build()
);
try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to move the try to surround more of the function (for example, the fromIndexMetadata(...) call

*/
public class InitializePolicyException extends ElasticsearchException {

public InitializePolicyException(String msg, Throwable cause, Object... args) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include the policy name in the exception somewhere? I think that might be helpful if the setting were to change but the error was still there

@andreidan
Copy link
Contributor Author

@elasticmachine update branch

@andreidan andreidan requested a review from dakrone February 12, 2020 14:11
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left another comment about not passing the policy in, just deriving it from the index metadata

super(key, nextStepKey);
this.policy = policy;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to pass the policy in and store it in the step, we can get it directly out of the index metadata (the index.lifecycle.name setting)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, great point!

}

@Override
public boolean equals(Object o) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get the policy id from the index metadata we don't need to add this stuff either, which will be nice.

@andreidan andreidan requested a review from dakrone February 12, 2020 16:42
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Andrei!

@andreidan andreidan merged commit f78d4b3 into elastic:master Feb 13, 2020
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Feb 14, 2020
We marked the `init` ILM step as retryable but our test used `waitUntil`
without an assert so we didn’t catch the fact that we were not actually
able to retry this step as our ILM state didn’t contain any information
about the policy execution (as we were in the process of initialising
it).

This commit manually sets the current step to `init` when we’re moving
the ilm policy into the ERROR step (this enables us to successfully
move to the error step and later retry the step)

* ShrunkenIndexCheckStep: Use correct logger

(cherry picked from commit f78d4b3)
Signed-off-by: Andrei Dan <[email protected]>
andreidan added a commit to andreidan/elasticsearch that referenced this pull request Feb 14, 2020
We marked the `init` ILM step as retryable but our test used `waitUntil`
without an assert so we didn’t catch the fact that we were not actually
able to retry this step as our ILM state didn’t contain any information
about the policy execution (as we were in the process of initialising
it).

This commit manually sets the current step to `init` when we’re moving
the ilm policy into the ERROR step (this enables us to successfully
move to the error step and later retry the step)

* ShrunkenIndexCheckStep: Use correct logger

(cherry picked from commit f78d4b3)
Signed-off-by: Andrei Dan <[email protected]>
andreidan added a commit that referenced this pull request Feb 15, 2020
We marked the `init` ILM step as retryable but our test used `waitUntil`
without an assert so we didn’t catch the fact that we were not actually
able to retry this step as our ILM state didn’t contain any information
about the policy execution (as we were in the process of initialising
it).

This commit manually sets the current step to `init` when we’re moving
the ilm policy into the ERROR step (this enables us to successfully
move to the error step and later retry the step)

* ShrunkenIndexCheckStep: Use correct logger

(cherry picked from commit f78d4b3)
Signed-off-by: Andrei Dan <[email protected]>
andreidan added a commit that referenced this pull request Feb 15, 2020
We marked the `init` ILM step as retryable but our test used `waitUntil`
without an assert so we didn’t catch the fact that we were not actually
able to retry this step as our ILM state didn’t contain any information
about the policy execution (as we were in the process of initialising
it).

This commit manually sets the current step to `init` when we’re moving
the ilm policy into the ERROR step (this enables us to successfully
move to the error step and later retry the step)

* ShrunkenIndexCheckStep: Use correct logger

(cherry picked from commit f78d4b3)
Signed-off-by: Andrei Dan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants