Improve Kibana logging for OOM crashes #109602

mshustov · 2021-08-23T10:05:47Z

Summary

Given: an ESS instance with 1Gb RAM migrates from v7.13.2 to v7.14.0.
Docker image crashes without any logs:

2021-08-22 11:16:44.319 | *** setuser exited with status 0.
2021-08-22 11:16:44.319 | *** Killing all processes...

According to the logs, Kibana started SO migration and suddenly bootstrapped again:

{"type":"log","@timestamp":"2021-08-22T11:16:31+00:00","tags":["info","savedobjects-service"],"pid":53,"message":"[.kibana] REINDEX_SOURCE_TO_TEMP_OPEN_PIT -> REINDEX_SOURCE_TO_TEMP_READ. took: 6ms."}
{"type":"log","@timestamp":"2021-08-22T11:16:31+00:00","tags":["info","savedobjects-service"],"pid":53,"message":"[.kibana_task_manager] UPDATE_TARGET_MAPPINGS_WAIT_FOR_TASK -> DONE. took: 307ms."}
{"type":"log","@timestamp":"2021-08-22T11:16:31+00:00","tags":["info","savedobjects-service"],"pid":53,"message":"[.kibana_task_manager] Migration completed after 396ms"}
<---- NO LOGS HERE --->
{"type":"log","@timestamp":"2021-08-22T11:16:54+00:00","tags":["info","plugins-service"],"pid":52,"message":"Plugin \"metricsEntities\" is disabled."}

The deployment has been re-configured to run a Kibana instance with 2Gb RAM. After that, the migration failed with [undefined]: Response Error message. This might be an indicator of 413 payload too large error we investigated in #107288

The migration has been fixed by reducing savedObjects.batchSize from 1000 to 200.

According to the proxy logs, some of the responses of /_search ES endpoint during migration were 400Mb. This might have caused Kibana (or a Docker container) with 1Gb RAM to fail with the OOM exception. Note, max_old_space_size for nodejs run on ESS instance with 1Gb RAM is set to 800MB.

Impact and Concerns

We need to investigate how Kibana behaves in the case of the OOM exception in the ESS environment to improve error logging. Otherwise, it's extremely hard for users to investigate such problems without any actionable logs.

Acceptance criteria

Users can get feedback from Kibana whenever it fails in the ESS environment due to the OOM problems.
Users can find recommendations (in the logs or documentation) about alleviating the OOM problem.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-08-23T10:05:48Z

Pinging @elastic/kibana-core (Team:Core)

rudolf · 2021-08-23T11:24:00Z

In this specific case it seems like the first batch is already causing the OOM, but when there's many batches memory could also keep on growing because we add transformedDocs to the executionLog after every state transition.

#109540 will reduce the memory consumption by redacting the transformedDocs that's kept in the executionLog after every state transition.

rudolf · 2021-09-08T12:16:14Z

Confirmed that Node OOM logs are present in the docker logs, but not being indexed on ESS

pgayvallet · 2021-09-09T08:15:21Z

I don't think there's any trivial way to catch an OOM, as those are non-recoverable errors that don't simply bubble up.

Looking at https://github.com/blueconic/node-oom-heapdump's implementation, it requires either c bindings, or monitoring the GC.

Either way, it may be very tedious to be able to plug it into our logging system.

rudolf · 2021-09-10T13:40:02Z

Yeah, I don't think we can intercept it and log it, but these are logged to stderr so the docker container should be able to index this.

mshustov · 2021-09-15T10:06:31Z

I assigned @rudolf as discussed yesterday since he is in contact with Cloud to ingest these logs.
TIL: nodejs exits with 134 code in the event of OOM nodejs/node#12271

rudolf · 2023-10-13T13:20:22Z

This has been fixed upstream (downstream?) in https://github.com/elastic/cloud/issues/88114

mshustov added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Aug 23, 2021

mshustov assigned rudolf Sep 15, 2021

lukeelmers added EnableJiraSync and removed EnableJiraSync labels Nov 1, 2021

exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Nov 4, 2021

legrego removed the EnableJiraSync label Aug 18, 2022

rudolf closed this as completed Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Kibana logging for OOM crashes #109602

Improve Kibana logging for OOM crashes #109602

mshustov commented Aug 23, 2021

elasticmachine commented Aug 23, 2021

rudolf commented Aug 23, 2021

rudolf commented Sep 8, 2021

pgayvallet commented Sep 9, 2021

rudolf commented Sep 10, 2021

mshustov commented Sep 15, 2021

rudolf commented Oct 13, 2023

Improve Kibana logging for OOM crashes #109602

Improve Kibana logging for OOM crashes #109602

Comments

mshustov commented Aug 23, 2021

Summary

Impact and Concerns

Acceptance criteria

elasticmachine commented Aug 23, 2021

rudolf commented Aug 23, 2021

rudolf commented Sep 8, 2021

pgayvallet commented Sep 9, 2021

rudolf commented Sep 10, 2021

mshustov commented Sep 15, 2021

rudolf commented Oct 13, 2023