-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Stacktraces Taking Memory in ILM Error Step Serialization #84266
Fix Stacktraces Taking Memory in ILM Error Step Serialization #84266
Conversation
Don't serialize stack-traces to ILM error steps. This will take multiple kb per index in the CS. In bad cases where an error might affect thousands of indices this can mean hundreds of MB of heap memory used on every node in the cluster just to keep the serialized stack-traces around in the CS that provide limited value anyway.
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @original-brownbear, I've created a changelog YAML for you. |
stepInfo.toXContent(infoXContentBuilder, ToXContent.EMPTY_PARAMS); | ||
stepInfoString = BytesReference.bytes(infoXContentBuilder).utf8ToString(); | ||
} | ||
final String stepInfoString = Strings.toString(stepInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly slightly unrelated but I figured I'd fix it here since I fixed the other spot.
5ccd8e9
to
28cab49
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to make debugging ILM a lot harder. Adding a big red ❌ so that this can get wider discussion with the rest of @elastic/es-data-management.
Is it though? We get the same errors in the logging anyway? I think it's not even a discussion whether or not we should store this level of detail in the CS. => either we store this somewhere else or are happy with logs. Logs seem good enough to me? :) |
Unfortunately logs are often not good enough on Cloud for diagnosis, as they are difficult to retrieve. We need to weigh the cost for this and make sure we can still find the root cause of an error. |
Hi @original-brownbear, I've created a changelog YAML for you. |
If there's trouble finding appropriate cloud logs then that's a topic to address with Cloud IMO. Currently, we are in a situation where running into a large number of ILM failures can escalate into a serious increase in heap usage on every cluster node. (in the case that motivated this, it was a bunch of rollovers failing in parallel because of shard count limits in a cluster) |
Actually, we do actually this already outside the logs, in the ILM history index. So I think that is reasonable for retrieving the error outside of looking at the logs, @joegallo what do you think? |
Could we do that in one go such that Or coming at this from a different tack, I think as it is this PR deserves a |
I wonder if that's even wise. I can see the point of not wanting to break the API but the fact that we can run into hundreds of MB added to the CS also means that this response can be hundreds of MB doesn't it? :) Obviously, this only negatively affects a small minority of deployments but I think this is exactly the kind of API that we don't want to have going forward because there's a very clear bound on how many errors until it either breaks or has a destabilising effect on the cluster when used. => I'd vote for just ripping off the bandaid here + not bothering with keeping the stack-traces around because that just doesn't scale without adjustments (pagination in some form I guess). |
I think that's a great brainstorming idea, but a little out of scope for this particular PR. I don't necessarily think this is a For the brainstorming about debugging part, I think we should open a separate issue where we can talk about it. I think we can do some things like construct a timeline from the ILM history, that may be interesting (but again, out of scope for this). |
/cc @cjcenizal |
@joegallo @dakrone anything left to do here? I think the discussion on another channel about this resolved itself. I tried looking into adding a line about "check the logs" as suggest by Jake but I couldn't find a neat way that wouldn't break the formatting. |
I'd love a 👍 from somebody on the Kibana side that this isn't going to break the UI -- assuming not, I still imagine they'll need to do a little bit of followup to remove any UI bits that would have showed the now-removed stacktrace. |
Can someone please provide a "Before" and "After" example of how the API response will change in the PR description? This type of information helps consumers (like Kibana) understand how proposed changes will affect the consuming code. Thanks! |
@cjcenizal here's an example before: {
"indices" : {
"myindex" : {
"index" : "myindex",
"managed" : true,
"policy" : "logs",
"index_creation_date" : "2022-03-08T17:02:15.806Z",
"index_creation_date_millis" : 1646758935806,
"time_since_index_creation" : "24.11s",
"lifecycle_date" : "2022-03-08T17:02:15.806Z",
"lifecycle_date_millis" : 1646758935806,
"age" : "24.11s",
"phase" : "hot",
"phase_time" : "2022-03-08T17:02:15.828Z",
"phase_time_millis" : 1646758935828,
"action" : "rollover",
"action_time" : "2022-03-08T17:02:15.828Z",
"action_time_millis" : 1646758935828,
"step" : "ERROR",
"step_time" : "2022-03-08T17:02:39.236Z",
"step_time_millis" : 1646758959236,
"failed_step" : "check-rollover-ready",
"is_auto_retryable_error" : true,
"step_info" : {
"type" : "illegal_argument_exception",
"reason" : "setting [index.lifecycle.rollover_alias] for index [myindex] is empty or not defined",
"stack_trace" : "java.lang.IllegalArgumentException: setting [index.lifecycle.rollover_alias] for index [myindex] is empty or not defined\n\tat org.elasticsearch.xpack.core.ilm.WaitForRolloverReadyStep.evaluateCondition(WaitForRolloverReadyStep.java:92)\n\tat org.elasticsearch.xpack.ilm.IndexLifecycleRunner.runPeriodicStep(IndexLifecycleRunner.java:225)\n\tat org.elasticsearch.xpack.ilm.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:421)\n\tat org.elasticsearch.xpack.ilm.IndexLifecycleService.triggered(IndexLifecycleService.java:352)\n\tat org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:186)\n\tat org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:220)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"
},
"phase_execution" : {
"policy" : "logs",
"phase_definition" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_primary_shard_size" : "50gb",
"max_age" : "30d"
}
}
},
"version" : 1,
"modified_date" : "2022-03-08T17:01:39.584Z",
"modified_date_in_millis" : 1646758899584
}
}
}
} And here's an after: {
"indices" : {
"myindex" : {
"index" : "myindex",
"managed" : true,
"policy" : "logs",
"index_creation_date" : "2022-03-08T17:05:38.119Z",
"index_creation_date_millis" : 1646759138119,
"time_since_index_creation" : "33.72s",
"lifecycle_date" : "2022-03-08T17:05:38.119Z",
"lifecycle_date_millis" : 1646759138119,
"age" : "33.72s",
"phase" : "hot",
"phase_time" : "2022-03-08T17:06:01.634Z",
"phase_time_millis" : 1646759161634,
"action" : "rollover",
"action_time" : "2022-03-08T17:05:38.227Z",
"action_time_millis" : 1646759138227,
"step" : "ERROR",
"step_time" : "2022-03-08T17:06:11.638Z",
"step_time_millis" : 1646759171638,
"failed_step" : "check-rollover-ready",
"is_auto_retryable_error" : true,
"failed_step_retry_count" : 1,
"step_info" : {
"type" : "illegal_argument_exception",
"reason" : "setting [index.lifecycle.rollover_alias] for index [myindex] is empty or not defined"
},
"phase_execution" : {
"policy" : "logs",
"phase_definition" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_primary_shard_size" : "50gb",
"max_age" : "30d"
}
}
},
"version" : 1,
"modified_date" : "2022-03-08T17:05:24.406Z",
"modified_date_in_millis" : 1646759124406
}
}
}
} |
Given the invasiveness of this change, I suggest we limit this change to 8.2.0 so we're not racing against time for 8.1.1. |
Thanks everyone, merging to 8.2 only then! |
Don't serialize stack-traces to ILM error steps. This will take multiple kb
per index in the CS. In bad cases where an error might affect thousands of
indices this can mean hundreds of MB of heap memory used on every node in
the cluster just to keep the serialized stack-traces around in the CS
that provide limited value anyway.
relates #77466