Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Stacktraces Taking Memory in ILM Error Step Serialization #84266

Conversation

original-brownbear
Copy link
Member

Don't serialize stack-traces to ILM error steps. This will take multiple kb
per index in the CS. In bad cases where an error might affect thousands of
indices this can mean hundreds of MB of heap memory used on every node in
the cluster just to keep the serialized stack-traces around in the CS
that provide limited value anyway.

relates #77466

Don't serialize stack-traces to ILM error steps. This will take multiple kb
per index in the CS. In bad cases where an error might affect thousands of
indices this can mean hundreds of MB of heap memory used on every node in
the cluster just to keep the serialized stack-traces around in the CS
that provide limited value anyway.
@original-brownbear original-brownbear added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.2.0 v8.1.1 labels Feb 23, 2022
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Feb 23, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @original-brownbear, I've created a changelog YAML for you.

stepInfo.toXContent(infoXContentBuilder, ToXContent.EMPTY_PARAMS);
stepInfoString = BytesReference.bytes(infoXContentBuilder).utf8ToString();
}
final String stepInfoString = Strings.toString(stepInfo);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly slightly unrelated but I figured I'd fix it here since I fixed the other spot.

Copy link
Contributor

@joegallo joegallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to make debugging ILM a lot harder. Adding a big red ❌ so that this can get wider discussion with the rest of @elastic/es-data-management.

@original-brownbear
Copy link
Member Author

This is going to make debugging ILM a lot harder.

Is it though? We get the same errors in the logging anyway? I think it's not even a discussion whether or not we should store this level of detail in the CS. => either we store this somewhere else or are happy with logs. Logs seem good enough to me? :)

@dakrone
Copy link
Member

dakrone commented Feb 23, 2022

Logs seem good enough to me? :)

Unfortunately logs are often not good enough on Cloud for diagnosis, as they are difficult to retrieve. We need to weigh the cost for this and make sure we can still find the root cause of an error.

@elasticsearchmachine
Copy link
Collaborator

Hi @original-brownbear, I've created a changelog YAML for you.

@original-brownbear
Copy link
Member Author

Unfortunately logs are often not good enough on Cloud for diagnosis, as they are difficult to retrieve. We need to weigh the cost for this and make sure we can still find the root cause of an error.

If there's trouble finding appropriate cloud logs then that's a topic to address with Cloud IMO.

Currently, we are in a situation where running into a large number of ILM failures can escalate into a serious increase in heap usage on every cluster node. (in the case that motivated this, it was a bunch of rollovers failing in parallel because of shard count limits in a cluster)
If we want to store this information outside the logs it must go somewhere else. The CS is not an option when thinking about the possible impact in large clusters.

@dakrone
Copy link
Member

dakrone commented Feb 23, 2022

If we want to store this information outside the logs it must go somewhere else.

Actually, we do actually this already outside the logs, in the ILM history index. So I think that is reasonable for retrieving the error outside of looking at the logs, @joegallo what do you think?

@dakrone dakrone removed their request for review February 23, 2022 16:53
@joegallo
Copy link
Contributor

Actually, we do actually this already outside the logs, in the ILM history index. So I think that is reasonable for retrieving the error outside of looking at the logs, @joegallo what do you think?

Could we do that in one go such that _ilm/explain doesn't change formats? That is, -- on this PR as it is right this very second, but ++ to a more complicated PR where the stacktrace is provided on-demand to the APIs that still want it.

Or coming at this from a different tack, I think as it is this PR deserves a >breaking tag due to changing the payload, but it'd be neat if it didn't need to have that because it didn't change the payload.

@original-brownbear
Copy link
Member Author

Could we do that in one go such that _ilm/explain doesn't change formats?

I wonder if that's even wise. I can see the point of not wanting to break the API but the fact that we can run into hundreds of MB added to the CS also means that this response can be hundreds of MB doesn't it? :) Obviously, this only negatively affects a small minority of deployments but I think this is exactly the kind of API that we don't want to have going forward because there's a very clear bound on how many errors until it either breaks or has a destabilising effect on the cluster when used.
Also, the API doesn't really change in terms of fields and what we always had just gets truncated. That seems fine to me, particularly since the specific stack-traces could change at any time anyway so we'll not break something for machines hitting this API anyway.

=> I'd vote for just ripping off the bandaid here + not bothering with keeping the stack-traces around because that just doesn't scale without adjustments (pagination in some form I guess).

@dakrone
Copy link
Member

dakrone commented Feb 23, 2022

I think that's a great brainstorming idea, but a little out of scope for this particular PR. I don't necessarily think this is a >breaking PR since stack traces have always been an optional part of the explain response, and the step info object doesn't really have a defined schema.

For the brainstorming about debugging part, I think we should open a separate issue where we can talk about it. I think we can do some things like construct a timeline from the ILM history, that may be interesting (but again, out of scope for this).

@joegallo
Copy link
Contributor

/cc @cjcenizal

@original-brownbear
Copy link
Member Author

@joegallo @dakrone anything left to do here? I think the discussion on another channel about this resolved itself. I tried looking into adding a line about "check the logs" as suggest by Jake but I couldn't find a neat way that wouldn't break the formatting.
-> IMO we're good to go here as per our discussion?

@joegallo
Copy link
Contributor

joegallo commented Mar 8, 2022

I'd love a 👍 from somebody on the Kibana side that this isn't going to break the UI -- assuming not, I still imagine they'll need to do a little bit of followup to remove any UI bits that would have showed the now-removed stacktrace.

@cjcenizal
Copy link
Contributor

Can someone please provide a "Before" and "After" example of how the API response will change in the PR description? This type of information helps consumers (like Kibana) understand how proposed changes will affect the consuming code. Thanks!

@dakrone
Copy link
Member

dakrone commented Mar 8, 2022

@cjcenizal here's an example before:

{
  "indices" : {
    "myindex" : {
      "index" : "myindex",
      "managed" : true,
      "policy" : "logs",
      "index_creation_date" : "2022-03-08T17:02:15.806Z",
      "index_creation_date_millis" : 1646758935806,
      "time_since_index_creation" : "24.11s",
      "lifecycle_date" : "2022-03-08T17:02:15.806Z",
      "lifecycle_date_millis" : 1646758935806,
      "age" : "24.11s",
      "phase" : "hot",
      "phase_time" : "2022-03-08T17:02:15.828Z",
      "phase_time_millis" : 1646758935828,
      "action" : "rollover",
      "action_time" : "2022-03-08T17:02:15.828Z",
      "action_time_millis" : 1646758935828,
      "step" : "ERROR",
      "step_time" : "2022-03-08T17:02:39.236Z",
      "step_time_millis" : 1646758959236,
      "failed_step" : "check-rollover-ready",
      "is_auto_retryable_error" : true,
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "setting [index.lifecycle.rollover_alias] for index [myindex] is empty or not defined",
        "stack_trace" : "java.lang.IllegalArgumentException: setting [index.lifecycle.rollover_alias] for index [myindex] is empty or not defined\n\tat org.elasticsearch.xpack.core.ilm.WaitForRolloverReadyStep.evaluateCondition(WaitForRolloverReadyStep.java:92)\n\tat org.elasticsearch.xpack.ilm.IndexLifecycleRunner.runPeriodicStep(IndexLifecycleRunner.java:225)\n\tat org.elasticsearch.xpack.ilm.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:421)\n\tat org.elasticsearch.xpack.ilm.IndexLifecycleService.triggered(IndexLifecycleService.java:352)\n\tat org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:186)\n\tat org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:220)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"
      },
      "phase_execution" : {
        "policy" : "logs",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_primary_shard_size" : "50gb",
              "max_age" : "30d"
            }
          }
        },
        "version" : 1,
        "modified_date" : "2022-03-08T17:01:39.584Z",
        "modified_date_in_millis" : 1646758899584
      }
    }
  }
}

And here's an after:

{
  "indices" : {
    "myindex" : {
      "index" : "myindex",
      "managed" : true,
      "policy" : "logs",
      "index_creation_date" : "2022-03-08T17:05:38.119Z",
      "index_creation_date_millis" : 1646759138119,
      "time_since_index_creation" : "33.72s",
      "lifecycle_date" : "2022-03-08T17:05:38.119Z",
      "lifecycle_date_millis" : 1646759138119,
      "age" : "33.72s",
      "phase" : "hot",
      "phase_time" : "2022-03-08T17:06:01.634Z",
      "phase_time_millis" : 1646759161634,
      "action" : "rollover",
      "action_time" : "2022-03-08T17:05:38.227Z",
      "action_time_millis" : 1646759138227,
      "step" : "ERROR",
      "step_time" : "2022-03-08T17:06:11.638Z",
      "step_time_millis" : 1646759171638,
      "failed_step" : "check-rollover-ready",
      "is_auto_retryable_error" : true,
      "failed_step_retry_count" : 1,
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "setting [index.lifecycle.rollover_alias] for index [myindex] is empty or not defined"
      },
      "phase_execution" : {
        "policy" : "logs",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_primary_shard_size" : "50gb",
              "max_age" : "30d"
            }
          }
        },
        "version" : 1,
        "modified_date" : "2022-03-08T17:05:24.406Z",
        "modified_date_in_millis" : 1646759124406
      }
    }
  }
}

@cjcenizal
Copy link
Contributor

Thanks @dakrone. @joegallo Yes this change will break the UI. We'll need to remove the debugging functionality that depends upon stack_trace.

@cjcenizal
Copy link
Contributor

cjcenizal commented Mar 8, 2022

Given the invasiveness of this change, I suggest we limit this change to 8.2.0 so we're not racing against time for 8.1.1.

@joegallo joegallo removed the v8.1.1 label Mar 8, 2022
@original-brownbear
Copy link
Member Author

Thanks everyone, merging to 8.2 only then!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants