ILM fix the init step to actually be retryable #52076

andreidan · 2020-02-07T19:37:34Z

We marked the init ILM step as retryable but our test used waitUntil
without an assert so we didn’t catch the fact that we were not actually
able to retry this step as our ILM state didn’t contain any information
about the policy execution (as we were in the process of initialising
it).

This commit manually sets the current step to init when we’re moving
the ilm policy into the ERROR step (this enables us to successfully
move to the error step and later retry the step)

We marked the `init` ILM step as retryable but our test used `waitUntil` without an assert so we didn’t catch the fact that we were not actually able to retry this step as our ILM state didn’t contain any information about the policy execution (as we were in the process of initialising it). This commit manually sets the current step to `init` when we’re moving the ilm policy into the ERROR step (this enables us to successfully move to the error step and later retry the step)

elasticmachine · 2020-02-07T19:37:36Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

probakowski · 2020-02-07T20:16:18Z

...lm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java

@@ -1120,26 +1120,26 @@ public void testRolloverStepRetriesUntilRolledOverIndexIsDeleted() throws Except
        // {@link org.elasticsearch.xpack.core.ilm.ErrorStep} in order to retry the failing step. As {@link #assertBusy}
        // increases the wait time between calls exponentially, we might miss the window where the policy is on
        // {@link WaitForRolloverReadyStep} and the move to `attempt-rollover` request will not be successful.
-        waitUntil(() -> {
+        assertThat(waitUntil(() -> {


I'd consider using assertTrue, we have long body here, it's hard to tell what we asserting here. The same applies to all instances below

It could be changed to just assertBusy(() -> {...}, 30, TimeUnit.SECONDS) also and use assertions to signal success/failure

I've changed the assertThat(waitUntil(), is(true)) to assertTrue(waitUntil()) but the reason to use waitUntil as opposed to assertBusy is to avoid having this test flake. We are asserting that the failed step and the retry count are bound to particular values. The fact that we want to check the failed step is what we expect it to be means we have to catch ILM in the ERROR step, but with retryable steps ILM keeps moving back and forth between the failing step (on retry) and the ERROR step (when the step execution fails).

assertBusy increases the wait time exponentially once it gets to 1 second, while waitUntil keeps it at 1 second. This means we're bound to catch ILM in the ERROR step using waitUntil (but with assertBusy we probe the ILM state only at seconds 1, 2, 4, 8 and 16)

Ahh okay, I see the difference between them now, in that case I agree that assertTrue would be better than assertThat in this case

dakrone

I left a couple of comments about this, in particular I'm not sure we need to add a new exception type just for this?

dakrone · 2020-02-07T21:21:18Z

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/InitializePolicyContextStep.java

+                        .build()
+                    );
+            } catch (Exception e) {
+                throw new InitializePolicyException(e.getMessage(), e);


What was the reason for wrapping this in an ElasticsearchException type exception?

I'm not entirely happy with using an exception for flow control but since we don't have any lifecycle state yet (as we're just initialising the policy) I used this exception to signal the current failing step is the "init policy step" when moving the policy into the ERROR step.

Another option would be to pass the current step key to IndexLifecycleTransition.moveClusterStateToErrorStep but this seemed more intrusive.

dakrone · 2020-02-07T21:22:46Z

...lm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java

@@ -1120,26 +1120,26 @@ public void testRolloverStepRetriesUntilRolledOverIndexIsDeleted() throws Except
        // {@link org.elasticsearch.xpack.core.ilm.ErrorStep} in order to retry the failing step. As {@link #assertBusy}
        // increases the wait time between calls exponentially, we might miss the window where the policy is on
        // {@link WaitForRolloverReadyStep} and the move to `attempt-rollover` request will not be successful.
-        waitUntil(() -> {
+        assertThat(waitUntil(() -> {


It could be changed to just assertBusy(() -> {...}, 30, TimeUnit.SECONDS) also and use assertions to signal success/failure

dakrone · 2020-02-07T21:23:44Z

...lm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java


        // Similar to above, using {@link #waitUntil} as we want to make sure the `attempt-rollover` step started failing and is being
        // retried (which means ILM moves back and forth between the `attempt-rollover` step and the `error` step)
-        waitUntil(() -> {
+        assertThat("ILM did not start retrying the attempt-rollover step", waitUntil(() -> {


I think the assertThat -> waitUntil -> is(true) is a little hard to follow, I think just a single assertBusy() would be better, since you can add the assertions in the body itself rather than returning a boolean?

andreidan · 2020-02-10T10:52:16Z

@elasticmachine update branch

dakrone

I left a few more comments, thanks Andrei!

dakrone · 2020-02-11T14:56:59Z

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/InitializePolicyContextStep.java

-                    .put(LifecycleSettings.LIFECYCLE_ORIGINATION_DATE, parsedOriginationDate)
-                    .build()
-                );
+            try {


I think we may want to move the try to surround more of the function (for example, the fromIndexMetadata(...) call

dakrone · 2020-02-11T14:58:06Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/InitializePolicyException.java

+ */
+public class InitializePolicyException extends ElasticsearchException {
+
+    public InitializePolicyException(String msg, Throwable cause, Object... args) {


Can we include the policy name in the exception somewhere? I think that might be helpful if the setting were to change but the error was still there

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleTransition.java

andreidan · 2020-02-12T13:10:57Z

@elasticmachine update branch

dakrone

I left another comment about not passing the policy in, just deriving it from the index metadata

dakrone · 2020-02-12T16:16:04Z

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/InitializePolicyContextStep.java

        super(key, nextStepKey);
+        this.policy = policy;


We don't need to pass the policy in and store it in the step, we can get it directly out of the index metadata (the index.lifecycle.name setting)

Ah yes, great point!

dakrone · 2020-02-12T16:17:45Z

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/InitializePolicyContextStep.java

+    }
+
+    @Override
+    public boolean equals(Object o) {


If we get the policy id from the index metadata we don't need to add this stuff either, which will be nice.

dakrone

LGTM, thanks Andrei!

We marked the `init` ILM step as retryable but our test used `waitUntil` without an assert so we didn’t catch the fact that we were not actually able to retry this step as our ILM state didn’t contain any information about the policy execution (as we were in the process of initialising it). This commit manually sets the current step to `init` when we’re moving the ilm policy into the ERROR step (this enables us to successfully move to the error step and later retry the step) * ShrunkenIndexCheckStep: Use correct logger (cherry picked from commit f78d4b3) Signed-off-by: Andrei Dan <[email protected]>

andreidan added 2 commits February 7, 2020 19:32

ILM assert on the results of waitUntil

af73584

andreidan added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v7.7.0 v7.6.1 labels Feb 7, 2020

andreidan requested a review from dakrone February 7, 2020 19:37

Add license header

6e46ae3

probakowski reviewed Feb 7, 2020

View reviewed changes

dakrone reviewed Feb 7, 2020

View reviewed changes

Use assertTrue

26a18af

Merge branch 'master' into ilm-init-step-retryable

a180e61

dakrone requested changes Feb 11, 2020

View reviewed changes

andreidan added 4 commits February 12, 2020 12:24

Add policy and index in InitalizePolicyExeeption

27708dd

Document moving to error step from the init step

8523f21

ShrunkenIndexCheckStep: Use correct logger

b609366

break in case statement

3355825

Merge branch 'master' into ilm-init-step-retryable

d7f0828

andreidan requested a review from dakrone February 12, 2020 14:11

dakrone requested changes Feb 12, 2020

View reviewed changes

andreidan added 2 commits February 12, 2020 16:41

Derive the policy name from the index metadata

04c444c

Remove unused import

5695091

andreidan requested a review from dakrone February 12, 2020 16:42

dakrone approved these changes Feb 12, 2020

View reviewed changes

andreidan merged commit f78d4b3 into elastic:master Feb 13, 2020

andreidan added the backport pending label Feb 13, 2020

andreidan mentioned this pull request Feb 14, 2020

[7x] ILM fix the init step to actually be retryable (#52076) #52375

Merged

andreidan mentioned this pull request Feb 14, 2020

[7.6] ILM fix the init step to actually be retryable (#52076) #52376

Merged

andreidan removed the backport pending label Feb 15, 2020

codebrain mentioned this pull request Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILM fix the init step to actually be retryable #52076

ILM fix the init step to actually be retryable #52076

andreidan commented Feb 7, 2020

elasticmachine commented Feb 7, 2020

probakowski Feb 7, 2020

dakrone Feb 7, 2020

andreidan Feb 10, 2020

dakrone Feb 10, 2020

dakrone left a comment

dakrone Feb 7, 2020

andreidan Feb 10, 2020

dakrone Feb 7, 2020

dakrone Feb 7, 2020

andreidan commented Feb 10, 2020

dakrone left a comment

dakrone Feb 11, 2020

dakrone Feb 11, 2020

andreidan commented Feb 12, 2020

dakrone left a comment

dakrone Feb 12, 2020

andreidan Feb 12, 2020

dakrone Feb 12, 2020

dakrone left a comment

ILM fix the init step to actually be retryable #52076

ILM fix the init step to actually be retryable #52076

Conversation

andreidan commented Feb 7, 2020

elasticmachine commented Feb 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan commented Feb 10, 2020

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan commented Feb 12, 2020

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dakrone left a comment

Choose a reason for hiding this comment