-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ILM: Changing a policy does not have desired effect. #48431
Comments
Pinging @elastic/es-core-features (:Core/Features/ILM+SLM) |
The discussion on #46357 is highly related. |
This comment is extremely relevant to my ILM experience: #46357 (comment) It’s possible the design of ILM is incompatible with what I want to do: describe retention policy for my data. |
Maybe the broader question is this: Is ILM even for my use case? ILM applies when the index is created, or osmething like that? Because we cannot "update" a policy, we are left with a very on-your-own "change the policy" scenario. From discussions today, we would need to:
"Find all affected indices" is basically "Query /*/_settings and select all indices with a specific lifecycle name" -- Then apply two steps (create a policy, apply to those indices). Now, if we look at what we're doing: For each item in /*/_settings, If the But looking at this specific scenario, "Delete older than 30 days". The resutl of /*/_settings gives us In these two scenarios, it's fewer steps to just implement ILM myself with curl than it is to create a new policy and apply it to all affected indices, then wait for ILM to execute and hope I did it right. This seems backwards. Compare this scenario to Curator:
With Curator, the "create a policy" and "change a policy" are identical steps:
With ILM, the "create a policy" and "change a policy" are very different steps. |
This is a blocker for us using ILM beyond "hot phase rollover" |
I believe the problem stems from a difference in execution style between Curator and ILM. I have an illustration that might help think about the problem. In ILM. You can think of an ILM policy as a factory assembly line (pardon my poor ASCII art)
When an index is first assigned a policy, it begins at the "start" and starts executing actions (numbered 1, 2, and 3 above). These actions may be quick, or they may block for a long time (such as waiting for the Curator on the other hand, can be thought of as a list of index.rolloverIfNeeded()
// 1
if (index.age() > "5d" && index.needForceMerge()) {
index.forceMerge()
}
// 2
if (index.age() > "30d" && index.isNotFrozen()) {
index.freeze()
}
// 3
if (index.age() > "60d") {
index.moveToColdNodes()
} Now. Let's tackle the scenario where you want to add a fourth step, a "delete" step (but it could be any step). In Curator's case this looks like adding an additional // 4
if (index.age() > "90d") {
index.delete()
} For ILM however, it's a little different. If we change the policy to add the In our industrial analogy, if we add a machine at the end of the car assembly line to paint the car yellow, all of the previously assembled cars don't immediately turn yellow. Curator, on the other hand, is going to all the cars ever created, checking whether they meet the criteria to be painted yellow, and painting them yellow, regardless of what other steps they have previously run or not run. Okay, hopefully that clarifies the difference in execution model between the two. There are two main ways to tackle the cognitive mismatch this leads to.
There are probably other ways to tackle this too, so feel free to leave ideas. This is just the start of a discussion to hopefully frame the mismatch in execution models that we've seen to cause some incorrect assumptions. |
This factory/cars metaphor is a good one. I suppose in this model, we've "finished" making the cars, but there's a defect, and now we need to issue a recall to apply a correction of some kind. To do a recall with ILM, we're on our own (Find all affected indices ourselves, carefully apply a new policy, etc). To do a recall with Curator, we just update the curator config. The way I'm thinking about this is that I am not particularly curious about the implementation details. I want a way to describe my retention policy, "Delete data older than 30 days" and implementing this with ILM is only possible if we never make mistakes or never change our minds. Scenarios:
My illustration is to highlight that Curator allows me to describe my objectives and it implements my objectives. With ILM, unless I am perfect in describing my objectives the first time, then I have to do a bunch of work that might break depending on the phase ILM is in. The concept of "phases" a specific set of constants (hot, warm, delete, etc) is a state machine I must be completely aware of whenever changing ILM settings -- If I'm in the middle of the machine (between "hot" and "completed" ?), changing ILM settings will depend on the current state. If I'm updating a phase configuration after ILM has executed that phase, I have to find-and-replace all instances of that ILM policy myself, based on what we've discussed so far, right? |
The way this is italicized around "every time curator runs" makes me think there's some problem with that? Back to ILM and thinking out loud about how to make this repair, "I forgot to implement 30 days retention" # Get a list of indices affected:
curl -s --cacert <(echo $cert) --resolve foo-es:9200:127.0.0.1 -u elastic:$pass https://foo-es:9200/*/_settings | jq -r '.[] | select(.settings.index.lifecycle.name == "infosec-beats") | .settings.index.provided_name' With this list, we need to update each index's lifecycle name, right? My list of indices is too long for a single request:
Ok, so we have to make request lines shorter. Let's make the index list 2000 chars long each.
With this, we replace the policy, right?
I haven't tesetd the actual "delete and apply new policy" but the "Find affected indices" and "split requests t stay within ES's max request length boundary" parts work. Is this what you expect for the repair steps? |
Any thoughts? |
Not necessarily, in your case, you delete indices that are older than 30 days and change the policy so future indices get automatically deleted after 30 days.
No, for this you can change the policy, no need for manual discover or work.
No, for this you can change the policy from "30d" to "15d".
If you want to perform an action after ILM has executed that phase, then you'll have to do those actions manually. In your previous examples, however, changing the
None other than emphasizing the difference in execution models.
This is really complicating it I think. Here's how to get a list of indices to delete and delete them: curl -u elastic:password -s "localhost:9200/_cat/indices?h=i,cd&format=json" | jq 'if 1572362742828 - (.[].cd|tonumber) >= 2592000000 then .[].i else "" end' | fgrep -v "\"\"" Replace 1572362742828 above with the One quick side note (not related to ILM), don't use |
Okay, that got a little bit on a tangent focusing solely on deletions. There are other considerations we should address, specifically: How do we address the disconnect between the execution model of ILM and the model that users expect? I've been thinking about this recently and have a few half-formed ideas to brainstorm around. One would be that we stored the previously executed steps somewhere in the index lifecycle state. This would allow us to "re-apply" a policy idempotently, skipping the steps that had already been executed. Then a user could update a policy to add a delete phase and have the other steps that had been performed be skipped. There are some downsides to this, especially when it comes to policies without a delete, we wouldn't want to allocate a bunch of indices to a warm node, only to turn around and immediately move them to cold nodes just because a user happened to add both allocation steps to a policy that was already complete. Another idea would be to execute the policy of an index in the "complete" phase backwards when changes to the policy happened, allowing us to immediately execute a 'delete' action if one were added to a policy where indices had already finished execution. Neither of these are fully fleshed out, just some ideas to get us started on brainstorming. Feel free to suggest other ideas for our discussion. |
There are some cases when updating a policy does not change the structure in a significant way. In these cases, we can reread the policy definition for any indices using the updated policy. This commit adds this refreshing to the `TransportPutLifecycleAction` to allow this. It allows us to do things like change the configuration values for a particular step, even when on that step (for example, changing the rollover criteria while on the `check-rollover-ready` step). There are more cases where the phase definition can be reread that just the ones checked here (for example, removing an action that has already been passed), and those will be added in subsequent work. Relates to elastic#48431
* Refresh cached phase policy definition if possible on new policy There are some cases when updating a policy does not change the structure in a significant way. In these cases, we can reread the policy definition for any indices using the updated policy. This commit adds this refreshing to the `TransportPutLifecycleAction` to allow this. It allows us to do things like change the configuration values for a particular step, even when on that step (for example, changing the rollover criteria while on the `check-rollover-ready` step). There are more cases where the phase definition can be reread that just the ones checked here (for example, removing an action that has already been passed), and those will be added in subsequent work. Relates to #48431
…tic#50820) * Refresh cached phase policy definition if possible on new policy There are some cases when updating a policy does not change the structure in a significant way. In these cases, we can reread the policy definition for any indices using the updated policy. This commit adds this refreshing to the `TransportPutLifecycleAction` to allow this. It allows us to do things like change the configuration values for a particular step, even when on that step (for example, changing the rollover criteria while on the `check-rollover-ready` step). There are more cases where the phase definition can be reread that just the ones checked here (for example, removing an action that has already been passed), and those will be added in subsequent work. Relates to elastic#48431
* Refresh cached phase policy definition if possible on new policy There are some cases when updating a policy does not change the structure in a significant way. In these cases, we can reread the policy definition for any indices using the updated policy. This commit adds this refreshing to the `TransportPutLifecycleAction` to allow this. It allows us to do things like change the configuration values for a particular step, even when on that step (for example, changing the rollover criteria while on the `check-rollover-ready` step). There are more cases where the phase definition can be reread that just the ones checked here (for example, removing an action that has already been passed), and those will be added in subsequent work. Relates to #48431
…tic#50820) * Refresh cached phase policy definition if possible on new policy There are some cases when updating a policy does not change the structure in a significant way. In these cases, we can reread the policy definition for any indices using the updated policy. This commit adds this refreshing to the `TransportPutLifecycleAction` to allow this. It allows us to do things like change the configuration values for a particular step, even when on that step (for example, changing the rollover criteria while on the `check-rollover-ready` step). There are more cases where the phase definition can be reread that just the ones checked here (for example, removing an action that has already been passed), and those will be added in subsequent work. Relates to elastic#48431
Currently when an ILM policy finishes its execution, the index moves into the `TerminalPolicyStep`, denoted by a completed/completed/completed phase/action/step lifecycle execution state. This commit changes the behavior so that the index lifecycle execution state halts at the last configured phase's `PhaseCompleteStep`, so for instance, if an index were configured with a policy containing a `hot` and `cold` phase, the index would stop at the `cold/complete/complete` `PhaseCompleteStep`. This allows an ILM user to update the policy to add any later phases and have indices configured to use that policy pick up execution at the newly added "later" phase. For example, if a `delete` phase were added to the policy specified about, the index would then move from `cold/complete/complete` into the `delete` phase. Relates to elastic#48431
Currently when an ILM policy finishes its execution, the index moves into the `TerminalPolicyStep`, denoted by a completed/completed/completed phase/action/step lifecycle execution state. This commit changes the behavior so that the index lifecycle execution state halts at the last configured phase's `PhaseCompleteStep`, so for instance, if an index were configured with a policy containing a `hot` and `cold` phase, the index would stop at the `cold/complete/complete` `PhaseCompleteStep`. This allows an ILM user to update the policy to add any later phases and have indices configured to use that policy pick up execution at the newly added "later" phase. For example, if a `delete` phase were added to the policy specified about, the index would then move from `cold/complete/complete` into the `delete` phase. Relates to #48431
…tic#51631) Currently when an ILM policy finishes its execution, the index moves into the `TerminalPolicyStep`, denoted by a completed/completed/completed phase/action/step lifecycle execution state. This commit changes the behavior so that the index lifecycle execution state halts at the last configured phase's `PhaseCompleteStep`, so for instance, if an index were configured with a policy containing a `hot` and `cold` phase, the index would stop at the `cold/complete/complete` `PhaseCompleteStep`. This allows an ILM user to update the policy to add any later phases and have indices configured to use that policy pick up execution at the newly added "later" phase. For example, if a `delete` phase were added to the policy specified about, the index would then move from `cold/complete/complete` into the `delete` phase. Relates to elastic#48431
Currently when an ILM policy finishes its execution, the index moves into the `TerminalPolicyStep`, denoted by a completed/completed/completed phase/action/step lifecycle execution state. This commit changes the behavior so that the index lifecycle execution state halts at the last configured phase's `PhaseCompleteStep`, so for instance, if an index were configured with a policy containing a `hot` and `cold` phase, the index would stop at the `cold/complete/complete` `PhaseCompleteStep`. This allows an ILM user to update the policy to add any later phases and have indices configured to use that policy pick up execution at the newly added "later" phase. For example, if a `delete` phase were added to the policy specified about, the index would then move from `cold/complete/complete` into the `delete` phase. Relates to #48431
@JacksonHacker whoops yep, thanks for pointing that out, I edited my comment. |
Tested on Elasticsearch 7.3.0
We ran out of disk today. We noticed that Curator for some reason was not deleting indices, possibly because it detects ILM and ignores indices with ILM enabled.
I updated our existing ILM policy to add:
This policy change never takes effect due to some behavior in ILM that I am not understanding.
If index
foo
has lifecycle policybar
andbar
policy is updated to say "Delete when older than 30 days", thenfoo
should be deleted when it is older than 30 days. Iffoo
is already older than 30 days, then it should be deleted by ILM.Without this, attempting to modify policy when there are hundreds of indices leaves us without much obvious recourse besides doing it ourselves with scripting or with curator.
The text was updated successfully, but these errors were encountered: