Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Kibana logging for OOM crashes #109602

Closed
mshustov opened this issue Aug 23, 2021 · 7 comments
Closed

Improve Kibana logging for OOM crashes #109602

mshustov opened this issue Aug 23, 2021 · 7 comments
Assignees
Labels
impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@mshustov
Copy link
Contributor

Summary

Given: an ESS instance with 1Gb RAM migrates from v7.13.2 to v7.14.0.
Docker image crashes without any logs:

2021-08-22 11:16:44.319 | *** setuser exited with status 0.
2021-08-22 11:16:44.319 | *** Killing all processes...

According to the logs, Kibana started SO migration and suddenly bootstrapped again:

{"type":"log","@timestamp":"2021-08-22T11:16:31+00:00","tags":["info","savedobjects-service"],"pid":53,"message":"[.kibana] REINDEX_SOURCE_TO_TEMP_OPEN_PIT -> REINDEX_SOURCE_TO_TEMP_READ. took: 6ms."}
{"type":"log","@timestamp":"2021-08-22T11:16:31+00:00","tags":["info","savedobjects-service"],"pid":53,"message":"[.kibana_task_manager] UPDATE_TARGET_MAPPINGS_WAIT_FOR_TASK -> DONE. took: 307ms."}
{"type":"log","@timestamp":"2021-08-22T11:16:31+00:00","tags":["info","savedobjects-service"],"pid":53,"message":"[.kibana_task_manager] Migration completed after 396ms"}
<---- NO LOGS HERE --->
{"type":"log","@timestamp":"2021-08-22T11:16:54+00:00","tags":["info","plugins-service"],"pid":52,"message":"Plugin \"metricsEntities\" is disabled."}

The deployment has been re-configured to run a Kibana instance with 2Gb RAM. After that, the migration failed with [undefined]: Response Error message. This might be an indicator of 413 payload too large error we investigated in #107288

The migration has been fixed by reducing savedObjects.batchSize from 1000 to 200.

According to the proxy logs, some of the responses of /_search ES endpoint during migration were 400Mb. This might have caused Kibana (or a Docker container) with 1Gb RAM to fail with the OOM exception. Note, max_old_space_size for nodejs run on ESS instance with 1Gb RAM is set to 800MB.

Impact and Concerns

We need to investigate how Kibana behaves in the case of the OOM exception in the ESS environment to improve error logging. Otherwise, it's extremely hard for users to investigate such problems without any actionable logs.

Acceptance criteria

Users can get feedback from Kibana whenever it fails in the ESS environment due to the OOM problems.
Users can find recommendations (in the logs or documentation) about alleviating the OOM problem.

@mshustov mshustov added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Aug 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf
Copy link
Contributor

rudolf commented Aug 23, 2021

In this specific case it seems like the first batch is already causing the OOM, but when there's many batches memory could also keep on growing because we add transformedDocs to the executionLog after every state transition.

#109540 will reduce the memory consumption by redacting the transformedDocs that's kept in the executionLog after every state transition.

@rudolf
Copy link
Contributor

rudolf commented Sep 8, 2021

Confirmed that Node OOM logs are present in the docker logs, but not being indexed on ESS

@pgayvallet
Copy link
Contributor

I don't think there's any trivial way to catch an OOM, as those are non-recoverable errors that don't simply bubble up.

Looking at https://github.com/blueconic/node-oom-heapdump's implementation, it requires either c bindings, or monitoring the GC.

Either way, it may be very tedious to be able to plug it into our logging system.

@rudolf
Copy link
Contributor

rudolf commented Sep 10, 2021

Yeah, I don't think we can intercept it and log it, but these are logged to stderr so the docker container should be able to index this.

@mshustov
Copy link
Contributor Author

I assigned @rudolf as discussed yesterday since he is in contact with Cloud to ingest these logs.
TIL: nodejs exits with 134 code in the event of OOM nodejs/node#12271

@exalate-issue-sync exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Nov 4, 2021
@rudolf
Copy link
Contributor

rudolf commented Oct 13, 2023

This has been fixed upstream (downstream?) in https://github.com/elastic/cloud/issues/88114

@rudolf rudolf closed this as completed Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

6 participants