Fix Stacktraces Taking Memory in ILM Error Step Serialization #84266

original-brownbear · 2022-02-23T09:04:29Z

Don't serialize stack-traces to ILM error steps. This will take multiple kb
per index in the CS. In bad cases where an error might affect thousands of
indices this can mean hundreds of MB of heap memory used on every node in
the cluster just to keep the serialized stack-traces around in the CS
that provide limited value anyway.

relates #77466

Don't serialize stack-traces to ILM error steps. This will take multiple kb per index in the CS. In bad cases where an error might affect thousands of indices this can mean hundreds of MB of heap memory used on every node in the cluster just to keep the serialized stack-traces around in the CS that provide limited value anyway.

elasticmachine · 2022-02-23T09:04:32Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2022-02-23T09:04:53Z

Hi @original-brownbear, I've created a changelog YAML for you.

original-brownbear · 2022-02-23T09:12:38Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleTransition.java

-            stepInfo.toXContent(infoXContentBuilder, ToXContent.EMPTY_PARAMS);
-            stepInfoString = BytesReference.bytes(infoXContentBuilder).utf8ToString();
-        }
+        final String stepInfoString = Strings.toString(stepInfo);


Admittedly slightly unrelated but I figured I'd fix it here since I fixed the other spot.

…erialization

joegallo

This is going to make debugging ILM a lot harder. Adding a big red ❌ so that this can get wider discussion with the rest of @elastic/es-data-management.

original-brownbear · 2022-02-23T16:10:41Z

This is going to make debugging ILM a lot harder.

Is it though? We get the same errors in the logging anyway? I think it's not even a discussion whether or not we should store this level of detail in the CS. => either we store this somewhere else or are happy with logs. Logs seem good enough to me? :)

dakrone · 2022-02-23T16:33:03Z

Logs seem good enough to me? :)

Unfortunately logs are often not good enough on Cloud for diagnosis, as they are difficult to retrieve. We need to weigh the cost for this and make sure we can still find the root cause of an error.

elasticsearchmachine · 2022-02-23T16:33:32Z

Hi @original-brownbear, I've created a changelog YAML for you.

original-brownbear · 2022-02-23T16:40:55Z

Unfortunately logs are often not good enough on Cloud for diagnosis, as they are difficult to retrieve. We need to weigh the cost for this and make sure we can still find the root cause of an error.

If there's trouble finding appropriate cloud logs then that's a topic to address with Cloud IMO.

Currently, we are in a situation where running into a large number of ILM failures can escalate into a serious increase in heap usage on every cluster node. (in the case that motivated this, it was a bunch of rollovers failing in parallel because of shard count limits in a cluster)
If we want to store this information outside the logs it must go somewhere else. The CS is not an option when thinking about the possible impact in large clusters.

dakrone · 2022-02-23T16:53:07Z

If we want to store this information outside the logs it must go somewhere else.

Actually, we do actually this already outside the logs, in the ILM history index. So I think that is reasonable for retrieving the error outside of looking at the logs, @joegallo what do you think?

joegallo · 2022-02-23T17:41:43Z

Actually, we do actually this already outside the logs, in the ILM history index. So I think that is reasonable for retrieving the error outside of looking at the logs, @joegallo what do you think?

Could we do that in one go such that _ilm/explain doesn't change formats? That is, -- on this PR as it is right this very second, but ++ to a more complicated PR where the stacktrace is provided on-demand to the APIs that still want it.

Or coming at this from a different tack, I think as it is this PR deserves a >breaking tag due to changing the payload, but it'd be neat if it didn't need to have that because it didn't change the payload.

original-brownbear · 2022-02-23T17:50:46Z

Could we do that in one go such that _ilm/explain doesn't change formats?

I wonder if that's even wise. I can see the point of not wanting to break the API but the fact that we can run into hundreds of MB added to the CS also means that this response can be hundreds of MB doesn't it? :) Obviously, this only negatively affects a small minority of deployments but I think this is exactly the kind of API that we don't want to have going forward because there's a very clear bound on how many errors until it either breaks or has a destabilising effect on the cluster when used.
Also, the API doesn't really change in terms of fields and what we always had just gets truncated. That seems fine to me, particularly since the specific stack-traces could change at any time anyway so we'll not break something for machines hitting this API anyway.

=> I'd vote for just ripping off the bandaid here + not bothering with keeping the stack-traces around because that just doesn't scale without adjustments (pagination in some form I guess).

dakrone · 2022-02-23T17:51:16Z

I think that's a great brainstorming idea, but a little out of scope for this particular PR. I don't necessarily think this is a >breaking PR since stack traces have always been an optional part of the explain response, and the step info object doesn't really have a defined schema.

For the brainstorming about debugging part, I think we should open a separate issue where we can talk about it. I think we can do some things like construct a timeline from the ILM history, that may be interesting (but again, out of scope for this).

joegallo · 2022-02-24T16:13:12Z

/cc @cjcenizal

original-brownbear · 2022-03-08T13:41:05Z

@joegallo @dakrone anything left to do here? I think the discussion on another channel about this resolved itself. I tried looking into adding a line about "check the logs" as suggest by Jake but I couldn't find a neat way that wouldn't break the formatting.
-> IMO we're good to go here as per our discussion?

joegallo · 2022-03-08T15:02:27Z

I'd love a 👍 from somebody on the Kibana side that this isn't going to break the UI -- assuming not, I still imagine they'll need to do a little bit of followup to remove any UI bits that would have showed the now-removed stacktrace.

cjcenizal · 2022-03-08T15:04:37Z

Can someone please provide a "Before" and "After" example of how the API response will change in the PR description? This type of information helps consumers (like Kibana) understand how proposed changes will affect the consuming code. Thanks!

dakrone · 2022-03-08T17:06:26Z

@cjcenizal here's an example before:

{
  "indices" : {
    "myindex" : {
      "index" : "myindex",
      "managed" : true,
      "policy" : "logs",
      "index_creation_date" : "2022-03-08T17:02:15.806Z",
      "index_creation_date_millis" : 1646758935806,
      "time_since_index_creation" : "24.11s",
      "lifecycle_date" : "2022-03-08T17:02:15.806Z",
      "lifecycle_date_millis" : 1646758935806,
      "age" : "24.11s",
      "phase" : "hot",
      "phase_time" : "2022-03-08T17:02:15.828Z",
      "phase_time_millis" : 1646758935828,
      "action" : "rollover",
      "action_time" : "2022-03-08T17:02:15.828Z",
      "action_time_millis" : 1646758935828,
      "step" : "ERROR",
      "step_time" : "2022-03-08T17:02:39.236Z",
      "step_time_millis" : 1646758959236,
      "failed_step" : "check-rollover-ready",
      "is_auto_retryable_error" : true,
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "setting [index.lifecycle.rollover_alias] for index [myindex] is empty or not defined",
        "stack_trace" : "java.lang.IllegalArgumentException: setting [index.lifecycle.rollover_alias] for index [myindex] is empty or not defined\n\tat org.elasticsearch.xpack.core.ilm.WaitForRolloverReadyStep.evaluateCondition(WaitForRolloverReadyStep.java:92)\n\tat org.elasticsearch.xpack.ilm.IndexLifecycleRunner.runPeriodicStep(IndexLifecycleRunner.java:225)\n\tat org.elasticsearch.xpack.ilm.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:421)\n\tat org.elasticsearch.xpack.ilm.IndexLifecycleService.triggered(IndexLifecycleService.java:352)\n\tat org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:186)\n\tat org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:220)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"
      },
      "phase_execution" : {
        "policy" : "logs",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_primary_shard_size" : "50gb",
              "max_age" : "30d"
            }
          }
        },
        "version" : 1,
        "modified_date" : "2022-03-08T17:01:39.584Z",
        "modified_date_in_millis" : 1646758899584
      }
    }
  }
}

And here's an after:

{
  "indices" : {
    "myindex" : {
      "index" : "myindex",
      "managed" : true,
      "policy" : "logs",
      "index_creation_date" : "2022-03-08T17:05:38.119Z",
      "index_creation_date_millis" : 1646759138119,
      "time_since_index_creation" : "33.72s",
      "lifecycle_date" : "2022-03-08T17:05:38.119Z",
      "lifecycle_date_millis" : 1646759138119,
      "age" : "33.72s",
      "phase" : "hot",
      "phase_time" : "2022-03-08T17:06:01.634Z",
      "phase_time_millis" : 1646759161634,
      "action" : "rollover",
      "action_time" : "2022-03-08T17:05:38.227Z",
      "action_time_millis" : 1646759138227,
      "step" : "ERROR",
      "step_time" : "2022-03-08T17:06:11.638Z",
      "step_time_millis" : 1646759171638,
      "failed_step" : "check-rollover-ready",
      "is_auto_retryable_error" : true,
      "failed_step_retry_count" : 1,
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "setting [index.lifecycle.rollover_alias] for index [myindex] is empty or not defined"
      },
      "phase_execution" : {
        "policy" : "logs",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_primary_shard_size" : "50gb",
              "max_age" : "30d"
            }
          }
        },
        "version" : 1,
        "modified_date" : "2022-03-08T17:05:24.406Z",
        "modified_date_in_millis" : 1646759124406
      }
    }
  }
}

cjcenizal · 2022-03-08T18:06:22Z

Thanks @dakrone. @joegallo Yes this change will break the UI. We'll need to remove the debugging functionality that depends upon stack_trace.

cjcenizal · 2022-03-08T18:08:01Z

Given the invasiveness of this change, I suggest we limit this change to 8.2.0 so we're not racing against time for 8.1.1.

original-brownbear · 2022-03-10T11:16:13Z

Thanks everyone, merging to 8.2 only then!

original-brownbear added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.2.0 v8.1.1 labels Feb 23, 2022

elasticmachine added the Team:Data Management Meta label for data/management team label Feb 23, 2022

original-brownbear commented Feb 23, 2022

View reviewed changes

original-brownbear added 2 commits February 23, 2022 12:12

Merge remote-tracking branch 'elastic/master' into better-ilm-error-s…

5c1985c

…erialization

random tests

28cab49

original-brownbear force-pushed the better-ilm-error-serialization branch from 5ccd8e9 to 28cab49 Compare February 23, 2022 11:28

original-brownbear requested a review from joegallo February 23, 2022 14:50

joegallo requested changes Feb 23, 2022

View reviewed changes

original-brownbear requested a review from joegallo February 23, 2022 16:10

dakrone added the team-discuss label Feb 23, 2022

Update docs/changelog/84266.yaml

572e16a

original-brownbear requested a review from dakrone February 23, 2022 16:41

dakrone removed their request for review February 23, 2022 16:53

dakrone removed the team-discuss label Feb 24, 2022

cjcenizal mentioned this pull request Mar 8, 2022

Remove stack_trace from ILM enricher for Index Management elastic/kibana#127190

Closed

Merge branch 'master' into better-ilm-error-serialization

df455b8

joegallo removed the v8.1.1 label Mar 8, 2022

joegallo approved these changes Mar 8, 2022

View reviewed changes

original-brownbear merged commit e3a8475 into elastic:master Mar 10, 2022

original-brownbear deleted the better-ilm-error-serialization branch March 10, 2022 11:16

original-brownbear mentioned this pull request Mar 10, 2022

Add the Ability to Disable certain REST APIs via a Cluster Setting #84876

Open

yuliacech mentioned this pull request Mar 23, 2022

[ILM] Removed stack_trace usage from ILM extension for Index Management elastic/kibana#128397

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Stacktraces Taking Memory in ILM Error Step Serialization #84266

Fix Stacktraces Taking Memory in ILM Error Step Serialization #84266

original-brownbear commented Feb 23, 2022

elasticmachine commented Feb 23, 2022

elasticsearchmachine commented Feb 23, 2022

original-brownbear Feb 23, 2022

joegallo left a comment

original-brownbear commented Feb 23, 2022

dakrone commented Feb 23, 2022

elasticsearchmachine commented Feb 23, 2022

original-brownbear commented Feb 23, 2022

dakrone commented Feb 23, 2022

joegallo commented Feb 23, 2022

original-brownbear commented Feb 23, 2022

dakrone commented Feb 23, 2022

joegallo commented Feb 24, 2022

original-brownbear commented Mar 8, 2022

joegallo commented Mar 8, 2022

cjcenizal commented Mar 8, 2022

dakrone commented Mar 8, 2022 •

edited

Loading

cjcenizal commented Mar 8, 2022

cjcenizal commented Mar 8, 2022 •

edited

Loading

original-brownbear commented Mar 10, 2022

Fix Stacktraces Taking Memory in ILM Error Step Serialization #84266

Fix Stacktraces Taking Memory in ILM Error Step Serialization #84266

Conversation

original-brownbear commented Feb 23, 2022

elasticmachine commented Feb 23, 2022

elasticsearchmachine commented Feb 23, 2022

original-brownbear Feb 23, 2022

Choose a reason for hiding this comment

joegallo left a comment

Choose a reason for hiding this comment

original-brownbear commented Feb 23, 2022

dakrone commented Feb 23, 2022

elasticsearchmachine commented Feb 23, 2022

original-brownbear commented Feb 23, 2022

dakrone commented Feb 23, 2022

joegallo commented Feb 23, 2022

original-brownbear commented Feb 23, 2022

dakrone commented Feb 23, 2022

joegallo commented Feb 24, 2022

original-brownbear commented Mar 8, 2022

joegallo commented Mar 8, 2022

cjcenizal commented Mar 8, 2022

dakrone commented Mar 8, 2022 • edited Loading

cjcenizal commented Mar 8, 2022

cjcenizal commented Mar 8, 2022 • edited Loading

original-brownbear commented Mar 10, 2022

dakrone commented Mar 8, 2022 •

edited

Loading

cjcenizal commented Mar 8, 2022 •

edited

Loading