-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager] old idle task not run when there's a large backlog of idle tasks #80371
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Gidi suggested capturing the task manager update-by-query params we're using, so that's here: query but with some task-type match clauses elidedcallCluster('updateByQuery' {
"index": ".kibana_task_manager",
"ignoreUnavailable": true,
"refresh": true,
"max_docs": 10,
"conflicts": "proceed",
"body": {
"query": {
"bool": {
"must": [
{
"term": {
"type": "task"
}
},
{
"bool": {
"must": [
{
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must": [
{
"term": {
"task.status": "idle"
}
},
{
"range": {
"task.runAt": {
"lte": "now"
}
}
}
]
}
},
{
"bool": {
"must": [
{
"bool": {
"should": [
{
"term": {
"task.status": "running"
}
},
{
"term": {
"task.status": "claiming"
}
}
]
}
},
{
"range": {
"task.retryAt": {
"lte": "now"
}
}
}
]
}
}
]
}
},
{
"bool": {
"should": [
{
"exists": {
"field": "task.schedule"
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "session_cleanup"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
// block below repeated for every task type
{
"bool": {
"must": [
{
"term": {
"task.taskType": "actions:.server-log"
}
},
{
"range": {
"task.attempts": {
"lt": 1
}
}
}
]
}
}
// block above repeated for every task type
]
}
}
]
}
}
],
"filter": [
{
"bool": {
"must_not": [
{
"bool": {
"should": [
{
"term": {
"task.status": "running"
}
},
{
"term": {
"task.status": "claiming"
}
}
],
"must": {
"range": {
"task.retryAt": {
"gt": "now"
}
}
}
}
}
]
}
}
]
}
}
]
}
},
"sort": [
"_score",
{
"_script": {
"type": "number",
"order": "asc",
"script": {
"lang": "painless",
"source": "\nif (doc['task.retryAt'].size()!=0) {\n return doc['task.retryAt'].value.toInstant().toEpochMilli();\n}\nif (doc['task.runAt'].size()!=0) {\n return doc['task.runAt'].value.toInstant().toEpochMilli();\n}\n "
}
}
}
],
"seq_no_primary_term": true,
"script": {
"source": "ctx._source.task.ownerId=params.ownerId; ctx._source.task.status=params.status; ctx._source.task.retryAt=params.retryAt;",
"lang": "painless",
"params": {
"ownerId": "kibana:5b2de169-2785-441b-ae8c-186a1936b17d",
"status": "claiming",
"retryAt": "2020-10-13T17:11:21.892Z"
}
}
}
} The sort script, near the bottom, reformatted, is: if (doc['task.retryAt'].size()!=0) {
return doc['task.retryAt'].value.toInstant().toEpochMilli();
}
if (doc['task.runAt'].size()!=0) {
return doc['task.runAt'].value.toInstant().toEpochMilli();
} the elided task-type match clauses removed from above {
"bool": {
"must": [
{
"term": {
"task.taskType": "lens_telemetry"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "actions:.email"
}
},
{
"range": {
"task.attempts": {
"lt": 1
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "actions:.index"
}
},
{
"range": {
"task.attempts": {
"lt": 1
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "actions:.pagerduty"
}
},
{
"range": {
"task.attempts": {
"lt": 1
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "actions:.server-log"
}
},
{
"range": {
"task.attempts": {
"lt": 1
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "actions:.slack"
}
},
{
"range": {
"task.attempts": {
"lt": 1
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "actions:.webhook"
}
},
{
"range": {
"task.attempts": {
"lt": 1
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "actions:.servicenow"
}
},
{
"range": {
"task.attempts": {
"lt": 1
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "actions:.jira"
}
},
{
"range": {
"task.attempts": {
"lt": 1
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "actions:.resilient"
}
},
{
"range": {
"task.attempts": {
"lt": 1
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "actions_telemetry"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting_telemetry"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:.index-threshold"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:.geo-threshold"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:example.always-firing"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:example.people-in-space"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:siem.signals"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:siem.notifications"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "endpoint:user-artifact-packager"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:metrics.alert.threshold"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:metrics.alert.inventory.threshold"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:logs.alert.document.count"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:monitoring_alert_cluster_health"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:monitoring_alert_license_expiration"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:monitoring_alert_cpu_usage"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:monitoring_alert_missing_monitoring_data"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:monitoring_alert_disk_usage"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:monitoring_alert_jvm_memory_usage"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:monitoring_alert_nodes_changed"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:monitoring_alert_logstash_version_mismatch"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:monitoring_alert_kibana_version_mismatch"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:monitoring_alert_elasticsearch_version_mismatch"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:apm.transaction_duration"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:apm.transaction_duration_anomaly"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:apm.error_rate"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:apm.transaction_error_rate"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "apm-telemetry-task"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:xpack.uptime.alerts.monitorStatus"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:xpack.uptime.alerts.tls"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"task.taskType": "alerting:xpack.uptime.alerts.durationAnomaly"
}
},
{
"range": {
"task.attempts": {
"lt": 3
}
}
}
]
}
} |
I see a Looking at the action and alert tasks above, and seeing how that sort runs, looks like it's coded the way you want. The I should note that I checked several times during my testing to see if any errors were being logged, never found anything. |
under Description from the update-by-query docs
And docs for
Given that, seems like quite possibly UBQ find 10 updateable docs, and then says "done!", missing older ones |
Some more interesting settings:
Opened an issue to get the docs fixed: elastic/elasticsearch#63637
|
If the missing docs are in fact missing because of At this point, I think I've exhausted by ES foo on this, will need to pull in some other experts ... |
Task manager also uses This is there for Kibana instances to each claim 10 tasks at the same time and skip through updates that return 409 errors. |
hmmm ... I was running multiple Kibana instances sometimes, and I did happen to look at some ES stats that indicated some conflicts had occurred ... |
So, I was dying to try deleting some alerts to see if TM would then "see" the action docs. But don't want to touch the environment. So created a new deployment with a snapshot of the one I've been playing with. Came right up, looks exactly the same, noice! First tried deleting one alert - nothing; well, it did delete one task doc (the alert one). Then tried 50, then tried the final 49. Nothing. Actions still not running. Also tried restarting Kibana, same. So it seems unlikely it has anything to do with |
Playing around some more yesterday, I found the smoothest way to "wake up" some of the old tasks was to disable a significant # of the 100 enabled alerts. Started with 50. After a few minutes, started getting a few of them to run. |
Here's a version of the query used, that you can paste into the dev tools console. It's not exactly the same; I used dev tools query
Here's the response with hits elided, and then the first hit: {
"took" : 81,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : []
}
} {
"_index" : ".kibana_task_manager_1",
"_type" : "_doc",
"_id" : "task:2831c731-0d14-11eb-ae66-f1f66f186dec",
"_seq_no" : 1507447,
"_primary_term" : 1,
"_score" : 3.6364787,
"_source" : {
"migrationVersion" : {
"task" : "7.6.0"
},
"task" : {
"taskType" : "alerting:.index-threshold",
"retryAt" : null,
"runAt" : "2020-10-14T15:56:59.636Z",
"scope" : [
"alerting"
],
"startedAt" : null,
"state" : """{"alertInstances":{},"previousStartedAt":"2020-10-14T15:56:58.636Z"}""",
"params" : """{"alertId":"793b2a84-2399-4917-8670-95a7b25b29fc","spaceId":"default"}""",
"ownerId" : null,
"scheduledAt" : "2020-10-13T05:22:58.979Z",
"attempts" : 0,
"status" : "idle"
},
"references" : [ ],
"updated_at" : "2020-10-14T15:56:59.860Z",
"type" : "task"
},
"sort" : [
3.6364787,
1.602691019636E12
]
} The Looking at the 10 results, they all had Which makes that score even more suspicious. Smelling like the sort script to me ... |
doc on scripting a sort: https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-sort-context.html "sort": {
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": "doc['theatre'].value.length() * params.factor",
"params": {
"factor": 1.1
}
},
"order": "asc"
} So, yeah, it looks like we're sorting on our current sort: "sort": [
"_score",
{
"_script": {
"type": "number",
"order": "asc",
"script": {
"lang": "painless",
"source": "..."
}
}
}
] maybe it should be more like this? "sort": {
"_script": {
"type": "number",
"order": "asc",
"script": {
"lang": "painless",
"source": "..."
}
}
} I'll see if I can repro the problem locally (I've just been doing it on cloud), and then see if changing the sort gets things moving again ... |
Here's how I'm trying to repro.
This should start producing a lot of action requests. You can check by doing a cat of the task manager index - there's only a handful of non-alerting tasks, and then there should be 100 alert tasks and the rest are action execution tasks. At some point, the actions will start running, but you can get them to start again by killing That's the basic gist - we should come up with a script we can use that just touches task manager. Maybe add an example task in examples, that we can queue up via http requests, then figure out how to tell if "old" ones get stuck. |
well, dang, here's the commentary on why kibana/x-pack/plugins/task_manager/server/task_store.ts Lines 297 to 302 in 00a9a46
The "pinned" tasks came in with PR #54765 . I vaguely remember this, it was about having to replace using |
I happened to think to try the devtools query, but remove the |
Yup, you're right - We can change the ubq to only include |
Ya, that seems right. I thought I noticed a path where we didn't add the score, so thought maybe we were already optionally using it - but I wasn't sure what those paths were. In this case, there shouldn't be any |
Not in Task Store - the |
My attempted repro ^^^ wasn't perfect, because it didn't result in any "zombie" tasks, but I did notice it was running a 60/40 mix of (old actions)/ (alerts), when I think it should have been 100/0, given the actions were all older than the runAt's of the alerts. So, the change should be noticeable with a fix. BTW, I've developed some viz's and dashboards to try to visualize some of this stuff, have it in a gist here which you should should be able to import into Kibana as saved objects: https://gist.github.com/pmuellr/b19420bd52124c50fc3cdb116f56644e |
It might be worth looking into whether we can nudge the |
Ya, you could imagine somehow changing the score to reflect the "age" of the task, but then it's not clear how you then boost pinned tasks so they always "beat" those. I'm hesitant to add even more complexity here! :-) |
I have a draft PR here #80692 that only adds the score sort when there are pinned tasks. When I re-ran my non-perfect repro ^^^ with this, I was unable to get more than ~500 tasks in the tm index. I think that's a good sign - it means that some of the "older" actions that were queueing up previously are now being run, keeping things a bit more "stable" in terms of running "old" things. OTOH, this will be a probable systemic change - I'd guess that alerts will be running less often than they did before, because these older actions will now be running where they weren't before, consuming the task slots that alerts were using. I'd guess this is only problematic for cases where there are A LOT of actions queued up, and I hope that would be an uncommon case. Makes me think this may not be critical to ship as in 7.10, we can probably wait till 7.11. A workaround if we notice this in the field would be to disable a number of alerts for a short while, then re-enable them. This seemed to be enough to get older actions running again in cases where I noticed 10's of 1000's of action task zombies. |
I tend to agree regarding 7.10. |
…tasks resolves: elastic#80371 Previously, when claiming tasks, we were always sorting the tasks to claim by the score and then by the time they should be run. We sort by score to capture `runNow()` tasks, also referred to internally as "pinned" tasks in the update by query. The change in this PR is to only sort by score if there are pinned tasks, and to not sort by score at all if there aren't any.
Just kind of a drive by question by someone totally unfamiliar with the process: do y'all have randomized exponential backoff for the task grabbing failures? I feel like tcp style backoff could make the process tolerant of pretty wonky issues. It could change the problem from "we can't run tasks" to "running tasks is slower than I'd like". |
…tasks (#80692) resolves: #80371 Previously, when claiming tasks, we were always sorting the tasks to claim by the score and then by the time they should be run. We sort by score to capture `runNow()` tasks, also referred to internally as "pinned" tasks in the update by query. The change in this PR is to only sort by score if there are pinned tasks, and to not sort by score at all if there aren't any.
…tasks (elastic#80692) resolves: elastic#80371 Previously, when claiming tasks, we were always sorting the tasks to claim by the score and then by the time they should be run. We sort by score to capture `runNow()` tasks, also referred to internally as "pinned" tasks in the update by query. The change in this PR is to only sort by score if there are pinned tasks, and to not sort by score at all if there aren't any.
…tasks (elastic#80692) resolves: elastic#80371 Previously, when claiming tasks, we were always sorting the tasks to claim by the score and then by the time they should be run. We sort by score to capture `runNow()` tasks, also referred to internally as "pinned" tasks in the update by query. The change in this PR is to only sort by score if there are pinned tasks, and to not sort by score at all if there aren't any. # Conflicts: # x-pack/plugins/task_manager/server/task_store.test.ts
…tasks (#80692) (#81495) resolves: #80371 Previously, when claiming tasks, we were always sorting the tasks to claim by the score and then by the time they should be run. We sort by score to capture `runNow()` tasks, also referred to internally as "pinned" tasks in the update by query. The change in this PR is to only sort by score if there are pinned tasks, and to not sort by score at all if there aren't any.
…pinned tasks (#80692) (#81497) resolves: #80371 Previously, when claiming tasks, we were always sorting the tasks to claim by the score and then by the time they should be run. We sort by score to capture `runNow()` tasks, also referred to internally as "pinned" tasks in the update by query. The change in this PR is to only sort by score if there are pinned tasks, and to not sort by score at all if there aren't any. Also had to fix type check error after fixing a. merge conflict during the backport:
7.10.0-snapshot on staging cloud (post 7.10 FF)
I've been running some load testing on staging cloud with 7.10, and managed to get to a point where I have 96K tasks queued (for action execution), which appear to not being picked up to run. While this is going on, alerts are running, so tasks are running, it's just that these old tasks are not running.
Here's an example task which is not being run, but should, as it's almost 12 hours old at this point:
Here's a task doc for a currently running alert
Not exactly sure what I did to get in this state, exactly, but here are some of the things I was playing with.
KBN_URLBASE
andES_URL
as appropriate for the following scriptsES_URL
as appropriateALERTS=100 whole-lotta-alerts.sh
, where that script is from this gistes-apm-sys-sim 1 4 -k
- that allows you to change the cpu threshold with keystrokes; bump them all to 100% by holding down keys1
, then2
, etcThings will back up fairly quickly. You can get the number of tasks queued (basically actions queued to run) via
You can kill
es-apm-sys-sim
so that the alerts will stop firing, and see if the actions will eventually get run, and then deleted from the tm index. You can also easily "upgrade" Kibana, from say 1 1GB RAM instance to 4 8GB RAM instances, and that has worked at times to get the queued actions resolved.To get to this state, I let ~100K tasks get queued, then killed
es-apm-sys-sim
, so that actions should STOP being queued, and hopefully resolved. No more tasks got queued (as expected), but these old ones also never got run and so the tm index was left with ~100K action tasks. Left it running overnight with 1 1GB Kibana instance, and they never got run.The text was updated successfully, but these errors were encountered: