Task Manager ignores the loss of connectivity with ES and can potentially prevent Kibana from loading #75501

gmmorris · 2020-08-19T21:23:22Z

If ES goes down but Kibana keeps running, Task Manager continues to poll ignoring the down time.

This introduces three issues:

The log fills with error messages from Task Manager saying: [error][plugins][taskManager][taskManager] Failed to poll for work: Error: No Living connections
If a fresh ES is booted up, Kibana tries to do some recovery so that, for example, new Saved Object mappings are created. As Task Manager has been running throughout, it begins running queries the moment it can resulting in this error being thrown No field found for [task.retryAt] in mapping as the SO mapping hasn't yet been defined. (more detail error is below).

[error][plugins][taskManager][taskManager] Failed to poll for work: [script_exception] runtime error, with { script_stack={ 0="org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:71)" & 1="org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:37)" & 2="if (doc['task.retryAt'].size()!=0) {\n  " & 3="        ^---- HERE" } & script=" ..." & lang="painless" & position={ offset=9 & start=1 & end=40 } } :: {"path":"/.kibana_task_manager/_update_by_query","query":{"ignore_unavailable":true,"refresh":true,"max_docs":10,"conflicts":"proceed"},"body":"{\"query\":{\"bool\":{\"must\":[{\"term\":{\"type\":\"task\"}},{\"bool\":{\"must\":[{\"bool\":{\"must\":[{\"bool\":{\"should\":[{\"bool\":{\"must\":[{\"term\":{\"task.status\":\"idle\"}},{\"range\":{\"task.runAt\":{\"lte\":\"now\"}}}]}},{\"bool\":{\"must\":[{\"bool\":{\"should\":[{\"term\":{\"task.status\":\"running\"}},{\"term\":{\"task.status\":\"claiming\"}}]}},{\"range\":{\"task.retryAt\":{\"lte\":\"now\"}}}]}}]}},{\"bool\":{\"should\":[{\"exists\":{\"field\":\"task.schedule\"}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"vis_telemetry\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"session_cleanup\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"lens_telemetry\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.email\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.index\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.pagerduty\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.server-log\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.slack\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.webhook\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.servicenow\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.jira\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.resilient\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions_telemetry\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting_telemetry\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:.index-threshold\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:example.always-firing\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:example.people-in-space\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:siem.signals\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:siem.notifications\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"endpoint:user-artifact-packager\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:metrics.alert.threshold\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:metrics.alert.inventory.threshold\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:logs.alert.document.count\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_cluster_health\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_license_expiration\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_cpu_usage\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_nodes_changed\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_logstash_version_mismatch\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_kibana_version_mismatch\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_elasticsearch_version_mismatch\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:apm.transaction_duration\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:apm.error_rate\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"apm-telemetry-task\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:xpack.uptime.alerts.monitorStatus\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:xpack.uptime.alerts.tls\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:xpack.uptime.alerts.durationAnomaly\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}}]}}]}}],\"filter\":[{\"bool\":{\"must_not\":[{\"bool\":{\"should\":[{\"term\":{\"task.status\":\"running\"}},{\"term\":{\"task.status\":\"claiming\"}}],\"must\":{\"range\":{\"task.retryAt\":{\"gt\":\"now\"}}}}}]}}]}}]}},\"sort\":[\"_score\",{\"_script\":{\"type\":\"number\",\"order\":\"asc\",\"script\":{\"lang\":\"painless\",\"source\":\"\\nif (doc['task.retryAt'].size()!=0) {\\n  return doc['task.retryAt'].value.toInstant().toEpochMilli();\\n}\\nif (doc['task.runAt'].size()!=0) {\\n  return doc['task.runAt'].value.toInstant().toEpochMilli();\\n}\\n    \"}}}],\"seq_no_primary_term\":true,\"script\":{\"source\":\"ctx._source.task.ownerId=params.ownerId; ctx._source.task.status=params.status; ctx._source.task.retryAt=params.retryAt;\",\"lang\":\"painless\",\"params\":{\"ownerId\":\"kibana:5b2de169-2785-441b-ae8c-186a1936b17d\",\"status\":\"claiming\",\"retryAt\":\"2020-08-19T21:12:55.732Z\"}}}","statusCode":400,"response":"{\"error\":{\"root_cause\":[{\"type\":\"script_exception\",\"reason\":\"runtime error\",\"script_stack\":[\"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:71)\",\"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:37)\",\"if (doc['task.retryAt'].size()!=0) {\\n  \",\"        ^---- HERE\"],\"script\":\" ...\",\"lang\":\"painless\",\"position\":{\"offset\":9,\"start\":1,\"end\":40}}],\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":[{\"shard\":0,\"index\":\".kibana_task_manager\",\"node\":\"PrxvKU5LTtSP7mR-pqeoLw\",\"reason\":{\"type\":\"script_exception\",\"reason\":\"runtime error\",\"script_stack\":[\"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:71)\",\"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:37)\",\"if (doc['task.retryAt'].size()!=0) {\\n  \",\"        ^---- HERE\"],\"script\":\" ...\",\"lang\":\"painless\",\"position\":{\"offset\":9,\"start\":1,\"end\":40},\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"No field found for [task.retryAt] in mapping\"}}}]},\"status\":400}"}

Once ES is back, even though the SO mappings haven't yet been restored, Task Manager can still respond to scheduling requests, which means it's possible to schedule a task before the mappings are set, which means Elasticsearch will actually get stuck when Kibana tries to create the mapping as it already created dynamic mappings based off of the scheduled task.

Suggested approach:
Task Manager should stop polling the moment ES becomes unavailable and listen in for the online$: Observable<OnlineStatusRetryScheduler> that Platform expose so that it can start again when it becomes available. We must also await the SavedObject mapping's creation. As the Start event happens before the mappings are created, new requests could come into Task Manager before it's ready, so it should buffer these and respond to them when it can.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-08-19T21:23:54Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2020-08-19T22:33:10Z

The suggested approach sounds right.

But I'm curious about:

new requests could come into Task Manager before it's ready

Do you mean via the plugin APIs? We already have a waitUntilStarted() gate, should we queue, or we could instead block them till things are ready again? We'd probably want that "blocker" to be smart enough to log an error message every few minutes.

What about tasks that are running when the connection is lost? I guess they'll get retried if we just drop them, but it seems like some are highly likely to work and you wouldn't want to retry (eg, slack, email), and some are highly likely to not work and you would want to retry (anything touching ES). "It's complicated".

azasypkin · 2020-08-21T12:57:53Z

We must also await the SavedObject mapping's creation.

Just tried to wait for coreStatus.savedObjects.level === ServiceStatusLevels.available before scheduling a task in a Security plugin, but surprisingly it didn't help. Either the issue is somewhere else or core's SO status doesn't reflect the fact that mappings are being recovered (assuming it's actually handled by the core).

gmmorris · 2020-09-02T07:47:28Z

The suggested approach sounds right.

But I'm curious about:

new requests could come into Task Manager before it's ready

Do you mean via the plugin APIs? We already have a waitUntilStarted() gate, should we queue, or we could instead block them till things are ready again? We'd probably want that "blocker" to be smart enough to log an error message every few minutes.

The gate only applies when Kibana is starting up from scratch, it doesn't apply when ES connectivity is lost and a new Kibana index is being set up - which is what broke things for @azasypkin .

We can probably block the requests until ES is back 🤔 but as most operations are a request to schedule things for future execution, I'm not sure we have to.

gmmorris changed the title ~~Task Manager ignores the loss of connectivity with ES~~ Task Manager ignores the loss of connectivity with ES and can potentially prevent Kibana from loading Aug 19, 2020

gmmorris added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Aug 19, 2020

gmmorris removed the Feature:Alerting label Aug 19, 2020

azasypkin mentioned this issue Aug 21, 2020

Schedule session index cleanup task only once. #75661

Merged

mikecote mentioned this issue Sep 16, 2020

Alerting GA #74788

Closed

36 tasks

gmmorris self-assigned this Oct 26, 2020

gmmorris mentioned this issue Oct 27, 2020

Reactively disable Task Manager lifecycle when core services become unavailable #81779

Merged

2 tasks

gmmorris closed this as completed in #81779 Oct 29, 2020

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task Manager ignores the loss of connectivity with ES and can potentially prevent Kibana from loading #75501

Task Manager ignores the loss of connectivity with ES and can potentially prevent Kibana from loading #75501

gmmorris commented Aug 19, 2020

elasticmachine commented Aug 19, 2020

pmuellr commented Aug 19, 2020

azasypkin commented Aug 21, 2020

gmmorris commented Sep 2, 2020

Task Manager ignores the loss of connectivity with ES and can potentially prevent Kibana from loading #75501

Task Manager ignores the loss of connectivity with ES and can potentially prevent Kibana from loading #75501

Comments

gmmorris commented Aug 19, 2020

elasticmachine commented Aug 19, 2020

pmuellr commented Aug 19, 2020

azasypkin commented Aug 21, 2020

gmmorris commented Sep 2, 2020