Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task Manager ignores the loss of connectivity with ES and can potentially prevent Kibana from loading #75501

Closed
gmmorris opened this issue Aug 19, 2020 · 4 comments · Fixed by #81779
Assignees
Labels
bug Fixes for quality problems that affect the customer experience discuss Feature:Task Manager R&D Research and development ticket (not meant to produce code, but to make a decision) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

If ES goes down but Kibana keeps running, Task Manager continues to poll ignoring the down time.

This introduces three issues:

  1. The log fills with error messages from Task Manager saying: [error][plugins][taskManager][taskManager] Failed to poll for work: Error: No Living connections
  2. If a fresh ES is booted up, Kibana tries to do some recovery so that, for example, new Saved Object mappings are created. As Task Manager has been running throughout, it begins running queries the moment it can resulting in this error being thrown No field found for [task.retryAt] in mapping as the SO mapping hasn't yet been defined. (more detail error is below).
[error][plugins][taskManager][taskManager] Failed to poll for work: [script_exception] runtime error, with { script_stack={ 0="org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:71)" & 1="org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:37)" & 2="if (doc['task.retryAt'].size()!=0) {\n  " & 3="        ^---- HERE" } & script=" ..." & lang="painless" & position={ offset=9 & start=1 & end=40 } } :: {"path":"/.kibana_task_manager/_update_by_query","query":{"ignore_unavailable":true,"refresh":true,"max_docs":10,"conflicts":"proceed"},"body":"{\"query\":{\"bool\":{\"must\":[{\"term\":{\"type\":\"task\"}},{\"bool\":{\"must\":[{\"bool\":{\"must\":[{\"bool\":{\"should\":[{\"bool\":{\"must\":[{\"term\":{\"task.status\":\"idle\"}},{\"range\":{\"task.runAt\":{\"lte\":\"now\"}}}]}},{\"bool\":{\"must\":[{\"bool\":{\"should\":[{\"term\":{\"task.status\":\"running\"}},{\"term\":{\"task.status\":\"claiming\"}}]}},{\"range\":{\"task.retryAt\":{\"lte\":\"now\"}}}]}}]}},{\"bool\":{\"should\":[{\"exists\":{\"field\":\"task.schedule\"}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"vis_telemetry\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"session_cleanup\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"lens_telemetry\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.email\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.index\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.pagerduty\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.server-log\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.slack\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.webhook\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.servicenow\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.jira\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions:.resilient\"}},{\"range\":{\"task.attempts\":{\"lt\":1}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"actions_telemetry\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting_telemetry\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:.index-threshold\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:example.always-firing\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:example.people-in-space\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:siem.signals\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:siem.notifications\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"endpoint:user-artifact-packager\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:metrics.alert.threshold\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:metrics.alert.inventory.threshold\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:logs.alert.document.count\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_cluster_health\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_license_expiration\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_cpu_usage\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_nodes_changed\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_logstash_version_mismatch\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_kibana_version_mismatch\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:monitoring_alert_elasticsearch_version_mismatch\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:apm.transaction_duration\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:apm.error_rate\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"apm-telemetry-task\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:xpack.uptime.alerts.monitorStatus\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:xpack.uptime.alerts.tls\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"alerting:xpack.uptime.alerts.durationAnomaly\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}}]}}]}}],\"filter\":[{\"bool\":{\"must_not\":[{\"bool\":{\"should\":[{\"term\":{\"task.status\":\"running\"}},{\"term\":{\"task.status\":\"claiming\"}}],\"must\":{\"range\":{\"task.retryAt\":{\"gt\":\"now\"}}}}}]}}]}}]}},\"sort\":[\"_score\",{\"_script\":{\"type\":\"number\",\"order\":\"asc\",\"script\":{\"lang\":\"painless\",\"source\":\"\\nif (doc['task.retryAt'].size()!=0) {\\n  return doc['task.retryAt'].value.toInstant().toEpochMilli();\\n}\\nif (doc['task.runAt'].size()!=0) {\\n  return doc['task.runAt'].value.toInstant().toEpochMilli();\\n}\\n    \"}}}],\"seq_no_primary_term\":true,\"script\":{\"source\":\"ctx._source.task.ownerId=params.ownerId; ctx._source.task.status=params.status; ctx._source.task.retryAt=params.retryAt;\",\"lang\":\"painless\",\"params\":{\"ownerId\":\"kibana:5b2de169-2785-441b-ae8c-186a1936b17d\",\"status\":\"claiming\",\"retryAt\":\"2020-08-19T21:12:55.732Z\"}}}","statusCode":400,"response":"{\"error\":{\"root_cause\":[{\"type\":\"script_exception\",\"reason\":\"runtime error\",\"script_stack\":[\"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:71)\",\"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:37)\",\"if (doc['task.retryAt'].size()!=0) {\\n  \",\"        ^---- HERE\"],\"script\":\" ...\",\"lang\":\"painless\",\"position\":{\"offset\":9,\"start\":1,\"end\":40}}],\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":[{\"shard\":0,\"index\":\".kibana_task_manager\",\"node\":\"PrxvKU5LTtSP7mR-pqeoLw\",\"reason\":{\"type\":\"script_exception\",\"reason\":\"runtime error\",\"script_stack\":[\"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:71)\",\"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:37)\",\"if (doc['task.retryAt'].size()!=0) {\\n  \",\"        ^---- HERE\"],\"script\":\" ...\",\"lang\":\"painless\",\"position\":{\"offset\":9,\"start\":1,\"end\":40},\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"No field found for [task.retryAt] in mapping\"}}}]},\"status\":400}"}
  1. Once ES is back, even though the SO mappings haven't yet been restored, Task Manager can still respond to scheduling requests, which means it's possible to schedule a task before the mappings are set, which means Elasticsearch will actually get stuck when Kibana tries to create the mapping as it already created dynamic mappings based off of the scheduled task.

Suggested approach:
Task Manager should stop polling the moment ES becomes unavailable and listen in for the online$: Observable<OnlineStatusRetryScheduler> that Platform expose so that it can start again when it becomes available. We must also await the SavedObject mapping's creation. As the Start event happens before the mappings are created, new requests could come into Task Manager before it's ready, so it should buffer these and respond to them when it can.

@gmmorris gmmorris changed the title Task Manager ignores the loss of connectivity with ES Task Manager ignores the loss of connectivity with ES and can potentially prevent Kibana from loading Aug 19, 2020
@gmmorris gmmorris added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Aug 19, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Aug 19, 2020

The suggested approach sounds right.

But I'm curious about:

new requests could come into Task Manager before it's ready

Do you mean via the plugin APIs? We already have a waitUntilStarted() gate, should we queue, or we could instead block them till things are ready again? We'd probably want that "blocker" to be smart enough to log an error message every few minutes.

What about tasks that are running when the connection is lost? I guess they'll get retried if we just drop them, but it seems like some are highly likely to work and you wouldn't want to retry (eg, slack, email), and some are highly likely to not work and you would want to retry (anything touching ES). "It's complicated".

@azasypkin
Copy link
Member

We must also await the SavedObject mapping's creation.

Just tried to wait for coreStatus.savedObjects.level === ServiceStatusLevels.available before scheduling a task in a Security plugin, but surprisingly it didn't help. Either the issue is somewhere else or core's SO status doesn't reflect the fact that mappings are being recovered (assuming it's actually handled by the core).

@mikecote mikecote added bug Fixes for quality problems that affect the customer experience discuss loe:needs-research This issue requires some research before it can be worked on or estimated R&D Research and development ticket (not meant to produce code, but to make a decision) and removed loe:needs-research This issue requires some research before it can be worked on or estimated labels Aug 26, 2020
@gmmorris
Copy link
Contributor Author

gmmorris commented Sep 2, 2020

The suggested approach sounds right.

But I'm curious about:

new requests could come into Task Manager before it's ready

Do you mean via the plugin APIs? We already have a waitUntilStarted() gate, should we queue, or we could instead block them till things are ready again? We'd probably want that "blocker" to be smart enough to log an error message every few minutes.

The gate only applies when Kibana is starting up from scratch, it doesn't apply when ES connectivity is lost and a new Kibana index is being set up - which is what broke things for @azasypkin .

We can probably block the requests until ES is back 🤔 but as most operations are a request to schedule things for future execution, I'm not sure we have to.

@mikecote mikecote mentioned this issue Sep 16, 2020
36 tasks
@gmmorris gmmorris self-assigned this Oct 26, 2020
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience discuss Feature:Task Manager R&D Research and development ticket (not meant to produce code, but to make a decision) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants