[Task Manager] Log at different levels based on the state #101751

chrisronline · 2021-06-09T13:25:32Z

Relates to #101505

This PR introduces logic that will change how we log the monitoring stats to the Kibana server log:

.subscribe(([monitoredHealth, serviceStatus]) => {
  serviceStatus$.next(serviceStatus);
  logger.debug(`Latest Monitored Stats: ${JSON.stringify(monitoredHealth)}`);
});

Currently, we write a debug log entry every time an event is pushed into the stream (not every as we utilize throttling) which is helpful, if verbose logging is configured by the user. More commonly, users do not have this configured (as it does involve seeing a lot of noise) so this logging paradigm has limited uses (a user would need to know there was a problem, restart Kibana with the config change and then observe the metrics - assuming the problem happens regularly)

This PR changes that by writing to a different log level based on a few things:

The status of each "bucket" of the API response (currently: runtime, configuration, and workload). If the status is Warning, then we log as a warning. If the status is Error, we log as an error.
The worst case calculated drift (across all task types), represented by stats.runtime.value.drift.p99, is above a configurable threshold (which defaults to 1m for no particular reason so any insight here would be great) - if this happens, we log as a warning.

This will help ensure these metrics are written to the logs when task manager is under performing and will give valuable insight into the why.

…_api_logging

x-pack/plugins/task_manager/server/monitoring/monitoring_stats_stream.ts

…_api_logging

elasticmachine · 2021-06-11T02:00:10Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

…_api_logging

x-pack/plugins/task_manager/server/routes/health.ts

x-pack/plugins/task_manager/server/monitoring/monitoring_stats_stream.ts

chrisronline · 2021-06-14T17:30:30Z

@ymao1
Regarding this comment, thank you so much for taking the time to look into that. I wonder if putting too much implementation detail into the mock is a hard thing to maintain, long-term? Is it obvious that someone needs to change the mock if they change the original code, because I'd imagine tests would still pass in that case?

ymao1 · 2021-06-14T18:10:15Z

@ymao1
Regarding this comment, thank you so much for taking the time to look into that. I wonder if putting too much implementation detail into the mock is a hard thing to maintain, long-term? Is it obvious that someone needs to change the mock if they change the original code, because I'd imagine tests would still pass in that case?

I don't disagree with any of this :) I'm also not a fan of injecting a custom status just for testing but if that's the only way to make it work, then that is fine as well.

…arate, mockable function

chrisronline · 2021-06-15T14:10:58Z

@ymao1 Thanks for the suggestion to move the logic to a separate function. That allowed me to easily mock the function to craft appropriate unit tests without needing the customStatus!

ymao1

LGTM after removing the console logs!

x-pack/plugins/task_manager/server/routes/health.ts

…_api_logging

mikecote · 2021-06-15T18:52:14Z

x-pack/plugins/task_manager/server/lib/log_health_metrics.ts

+    logger.warn(
+      `Latest Monitored Stats (${contextMessage ?? `warning status`}): ${JSON.stringify(
+        monitoredHealth
+      )}`
+    );


When testing locally, I started a fresh Kibana and got this "warning status" log message right away. After investigating, I found out the capacity estimation caused the health status to be "warning" (see below).

It sounds like a bug in the capacity estimation API? and exposed more severely in this PR?

The second log I saw was "error status" while all the sub-elements being "OK" 🤔

These don't seem related to this PR but definitely expose a couple of potential bugs. cc @gmmorris

Health output on start-up

{"id":"5b2de169-2785-441b-ae8c-186a1936b17d","timestamp":"2021-06-15T18:46:24.663Z","status":"warn","last_update":"2021-06-15T18:46:24.663Z","stats":{"configuration":{"timestamp":"2021-06-15T18:46:19.871Z","value":{"request_capacity":1000,"max_poll_inactivity_cycles":10,"monitored_aggregated_stats_refresh_rate":60000,"monitored_stats_running_average_window":50,"monitored_task_execution_thresholds":{"default":{"error_threshold":90,"warn_threshold":80},"custom":{}},"poll_interval":3000,"max_workers":10},"status":"OK"},"runtime":{"timestamp":"2021-06-15T18:46:19.872Z","value":{"polling":{"claim_duration":{"p50":null,"p90":null,"p95":null,"p99":null},"duration":{"p50":null,"p90":null,"p95":null,"p99":null},"claim_conflicts":{"p50":null,"p90":null,"p95":null,"p99":null},"claim_mismatches":{"p50":null,"p90":null,"p95":null,"p99":null},"result_frequency_percent_as_number":{"Failed":0,"NoAvailableWorkers":0,"NoTasksClaimed":0,"RanOutOfCapacity":0,"RunningAtCapacity":0,"PoolFilled":0}},"drift":{"p50":null,"p90":null,"p95":null,"p99":null},"drift_by_type":{},"load":{"p50":null,"p90":null,"p95":null,"p99":null},"execution":{"duration":{},"duration_by_persistence":{"recurring":{"p50":null,"p90":null,"p95":null,"p99":null},"non_recurring":{"p50":null,"p90":null,"p95":null,"p99":null},"ephemeral":{"p50":null,"p90":null,"p95":null,"p99":null}},"persistence":{"recurring":0,"non_recurring":0,"ephemeral":0},"result_frequency_percent_as_number":{}}},"status":"OK"},"workload":{"timestamp":"2021-06-15T18:46:24.663Z","value":{"count":0,"task_types":{},"non_recurring":0,"owner_ids":0,"schedule":[],"overdue":0,"overdue_non_recurring":0,"estimated_schedule_density":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"capacity_requirments":{"per_minute":0,"per_hour":0,"per_day":0}},"status":"OK"},"capacity_estimation":{"status":"warn","timestamp":"2021-06-15T18:46:24.664Z","value":{"observed":{"observed_kibana_instances":1,"max_throughput_per_minute_per_kibana":200,"max_throughput_per_minute":200,"minutes_to_drain_overdue":null,"avg_recurring_required_throughput_per_minute":0,"avg_recurring_required_throughput_per_minute_per_kibana":0,"avg_required_throughput_per_minute":null,"avg_required_throughput_per_minute_per_kibana":null},"proposed":{"provisioned_kibana":null,"min_required_kibana":0,"avg_recurring_required_throughput_per_minute_per_kibana":null,"avg_required_throughput_per_minute_per_kibana":null}}}}}

Second health log

{"id":"5b2de169-2785-441b-ae8c-186a1936b17d","timestamp":"2021-06-15T18:46:33.787Z","status":"error","last_update":"2021-06-15T18:46:33.787Z","stats":{"configuration":{"timestamp":"2021-06-15T18:46:19.871Z","value":{"request_capacity":1000,"max_poll_inactivity_cycles":10,"monitored_aggregated_stats_refresh_rate":60000,"monitored_stats_running_average_window":50,"monitored_task_execution_thresholds":{"default":{"error_threshold":90,"warn_threshold":80},"custom":{}},"poll_interval":3000,"max_workers":10},"status":"OK"},"runtime":{"timestamp":"2021-06-15T18:46:33.787Z","value":{"polling":{"last_successful_poll":"2021-06-15T18:46:29.725Z","last_polling_delay":"2021-06-15T18:46:23.697Z","claim_duration":{"p50":9,"p90":1087,"p95":1087,"p99":1087},"duration":{"p50":54,"p90":91,"p95":91,"p99":91},"claim_conflicts":{"p50":0,"p90":0,"p95":0,"p99":0},"claim_mismatches":{"p50":0,"p90":0,"p95":0,"p99":0},"result_frequency_percent_as_number":{"Failed":0,"NoAvailableWorkers":0,"NoTasksClaimed":100,"RanOutOfCapacity":0,"RunningAtCapacity":0,"PoolFilled":0}},"drift":{"p50":null,"p90":null,"p95":null,"p99":null},"drift_by_type":{},"load":{"p50":0,"p90":0,"p95":0,"p99":0},"execution":{"duration":{},"duration_by_persistence":{"recurring":{"p50":null,"p90":null,"p95":null,"p99":null},"non_recurring":{"p50":null,"p90":null,"p95":null,"p99":null},"ephemeral":{"p50":null,"p90":null,"p95":null,"p99":null}},"persistence":{"recurring":0,"non_recurring":0,"ephemeral":0},"result_frequency_percent_as_number":{}}},"status":"OK"},"workload":{"timestamp":"2021-06-15T18:46:24.663Z","value":{"count":0,"task_types":{},"non_recurring":0,"owner_ids":0,"schedule":[],"overdue":0,"overdue_non_recurring":0,"estimated_schedule_density":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"capacity_requirments":{"per_minute":0,"per_hour":0,"per_day":0}},"status":"OK"},"capacity_estimation":{"status":"OK","timestamp":"2021-06-15T18:46:33.787Z","value":{"observed":{"observed_kibana_instances":1,"max_throughput_per_minute_per_kibana":200,"max_throughput_per_minute":200,"minutes_to_drain_overdue":null,"avg_recurring_required_throughput_per_minute":0,"avg_recurring_required_throughput_per_minute_per_kibana":0,"avg_required_throughput_per_minute":0,"avg_required_throughput_per_minute_per_kibana":0},"proposed":{"provisioned_kibana":0,"min_required_kibana":0,"avg_recurring_required_throughput_per_minute_per_kibana":null,"avg_required_throughput_per_minute_per_kibana":null}}}}}

@mikecote I'm not sure. Perhaps we don't consider capacity estimation as a factor for the logging here? I'm assuming capacity estimation is more of a planning reference rather than "I need to know these metrics when something isn't working right"?

On a fresh ES/Kibana launch, I saw this on startup:

server log [15:27:30.000] [error][plugins][taskManager] Latest Monitored Stats (error status): \ { "id": "5b2de169-2785-441b-ae8c-186a1936b17d", "timestamp": "2021-06-15T19:27:29.995Z", "status": "error", "last_update": "2021-06-15T19:27:29.994Z", "stats": {} }

After adding an alert with actions and getting everything firing, then killing Kibana and restarting, saw the exact same message. Nothing like Mike's ^^^ though

The second log I saw was "error status" while all the sub-elements being "OK" 🤔

My guess here is the stats are stale which triggers a warning/error state.

Perhaps we don't consider capacity estimation as a factor for the logging here? I'm assuming capacity estimation is more of a planning reference rather than "I need to know these metrics when something isn't working right"?

You're totally right, and like this PR states, it should only log when runtime, configuration, and workload are not "ok". But, seems to still log when everything is "ok" except capacity estimation. It may be because this code here looks at the overall status? but should look within runtime, configuration, and workload individually?

I agree, makes sense. I think, in a future PR, we should consider changing this logic and this logic so they are represented in the overall status for the workload and runtime buckets respectably.

If I change this code to just look at the overall status for each bucket, then we'll need to move around the freshness checks to ensure those are properly accounted for when determining overall status

@mikecote I just pushed up some changes around this. LMK

Thanks, I ran it locally and the startup error logs no longer showed. 👍

…_api_logging

pmuellr

code LGTM, but made some comments

do we need more doc?
need cloud and docker whitelist for this (I think)
some copypasta in the test descriptions
wondering about the word "drift" in the message - understandable?
spurious message printed at startup

pmuellr · 2021-06-15T16:12:48Z

x-pack/plugins/task_manager/server/config.ts

@@ -18,6 +18,7 @@ export const DEFAULT_VERSION_CONFLICT_THRESHOLD = 80;
 // Refresh aggregated monitored stats at a default rate of once a minute
 export const DEFAULT_MONITORING_REFRESH_RATE = 60 * 1000;
 export const DEFAULT_MONITORING_STATS_RUNNING_AVERGAE_WINDOW = 50;
+export const DEFAULT_MONITORING_STATS_WARN_DRIFT_IN_SECONDS = 60;


I don't have a great sense for a good value here - presumably 60 seconds means it missed 20 (potential) poll requests, which seems like ... plenty to have a warning on. At the same time, it feels kinda short, like this could happen every now and again. I think I feel like 60 seconds is workable if it's warning not an error, and isn't super-scary if you see them every now and again, and obvious that something's wrong if you see it a lot.

Neither do I. I feel we should attempt to be less conservative here, as the point of this change is to ensure these logs are available when there is an issue with TM/Alerting. As long as the messaging is clear on how to avoid the logs, we should be all set.

pmuellr · 2021-06-15T16:27:42Z

x-pack/plugins/task_manager/server/config.ts

@@ -109,6 +110,9 @@ export const configSchema = schema.object(
        defaultValue: {},
      }),
    }),
+    monitored_stats_warn_drift_in_seconds: schema.number({


Do we want to doc this, here? https://www.elastic.co/guide/en/kibana/current/task-manager-settings-kb.html

I don't think we doc all the config options, but maybe this is one that we do?

Also implies that we need to add this to the cloud whitelist, and to the docker whitelist.

Also wondering if we want to add anything to the health monitor doc - https://www.elastic.co/guide/en/kibana/current/task-manager-health-monitoring.html - specifically, I wondering if users will be able to search the docs for the message this produces, to try to understand it. I feel like we need to define "drift", because we don't seem to.

I added it there and also renamed it to avoid needing to define a new term for users! I'm not sure how to add things to the docker/cloud whitelist - is that a file in the Kibana repo?

the docker whitelist is in this repo, and is listed in the PR template (at least when I create new PRs) - https://github.com/elastic/kibana/blob/master/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker

the cloud whitelist is in a private repo; I'll DM you the info ...

x-pack/plugins/task_manager/server/lib/log_health_metrics.ts

pmuellr · 2021-06-15T19:32:06Z

x-pack/plugins/task_manager/server/lib/log_health_metrics.ts

+    logger.warn(
+      `Latest Monitored Stats (${contextMessage ?? `warning status`}): ${JSON.stringify(
+        monitoredHealth
+      )}`
+    );


On a fresh ES/Kibana launch, I saw this on startup:

server log [15:27:30.000] [error][plugins][taskManager] Latest Monitored Stats (error status): \ { "id": "5b2de169-2785-441b-ae8c-186a1936b17d", "timestamp": "2021-06-15T19:27:29.995Z", "status": "error", "last_update": "2021-06-15T19:27:29.994Z", "stats": {} }

After adding an alert with actions and getting everything firing, then killing Kibana and restarting, saw the exact same message. Nothing like Mike's ^^^ though

x-pack/plugins/task_manager/server/lib/log_health_metrics.test.ts

src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker

jbudz

Nit on the kibana-docker bin, otherwise LGTM

…_api_logging

mikecote

LGTM!

kibanamachine · 2021-06-16T19:14:22Z

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

💚 Build #131712 succeeded 28c6a75
💔 Build #131525 failed 24c8718
💔 Build #131517 failed 069f7fa
💚 Build #131508 succeeded 4567cce
💚 Build #131449 succeeded 4ce5672

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…102402) * Log at different levels based on the state * Fix types and add tests * Remove unnecessary code * Add more descriptive message * Partially fix failing tests * Move into separate function * Get rid of customStatus in favor of moving the logging logic to a separate, mockable function * Remove debug logging * Do not log as an error if the stats are empty * PR feedback * Add docker whitelist * alpha order * English is hard * Removing extra newline * PR feedback around ignoring capacity estimation * Move json utils

…egrations-to-global-search * 'master' of github.com:elastic/kibana: (46 commits) [Lens] Add some more documentation for dynamic coloring (elastic#101369) hide not searchable results when no term (elastic#102401) [Lens] Fix Formula functional test with multiple suggestions (elastic#102378) Fix trusted apps modified by field displayed as a date field (elastic#102377) [Lens] Docs for time shift (elastic#102048) update readme of logs-metrics-ui (elastic#101968) Refactor observability plugin breadcrumbs (elastic#102290) [Index Patterns] Move rollup config to index pattern management v2 (elastic#102285) [Security Solution][Endpoint] Isolate Action should only be available to Platinum+ licenses (elastic#102374) [build] Updates Ironbank templates (elastic#102407) Update security best practices document (elastic#100814) [Enterprise Search] Set up initial KibanaPageTemplate (elastic#102170) [Reporting/Docs] Add section to troubleshooting guide to explain the StatusCodeError logs (elastic#102278) [DOCS] Updating Elastic Security Overview topic (elastic#101922) [Uptime] refactor Synthetics Integration package UI (elastic#102080) [Task Manager] Log at different levels based on the state (elastic#101751) [APM] Fixing time comparison types (elastic#101423) [RAC] Update alert documents in lifecycle rule type helper (elastic#101598) [ML] Functional tests - fix and re-activate alerting flyout test (elastic#102368) [Reporting] remove unused reference to path.data config (elastic#102267) ... # Conflicts: # x-pack/plugins/fleet/kibana.json

Log at different levels based on the state

1382b1c

chrisronline mentioned this pull request Jun 9, 2021

[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505

Closed

chrisronline added 2 commits June 9, 2021 12:09

Merge remote-tracking branch 'elastic/master' into alerting/tm_health…

4ca7c80

…_api_logging

Fix types and add tests

6d9ae91

chrisronline commented Jun 10, 2021

View reviewed changes

x-pack/plugins/task_manager/server/monitoring/monitoring_stats_stream.ts Outdated Show resolved Hide resolved

chrisronline added 2 commits June 10, 2021 15:53

Remove unnecessary code

3f54951

Merge remote-tracking branch 'elastic/master' into alerting/tm_health…

c3109d4

…_api_logging

chrisronline marked this pull request as ready for review June 11, 2021 01:59

chrisronline requested a review from a team as a code owner June 11, 2021 01:59

chrisronline changed the title ~~[Alerting] Log at different levels based on the state~~ [Task Manager] Log at different levels based on the state Jun 11, 2021

chrisronline added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.14.0 v8.0.0 labels Jun 11, 2021

chrisronline added the release_note:enhancement label Jun 11, 2021

chrisronline linked an issue Jun 11, 2021 that may be closed by this pull request

[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505

Closed

chrisronline added 2 commits June 11, 2021 13:50

Add more descriptive message

57219a4

Merge remote-tracking branch 'elastic/master' into alerting/tm_health…

b64356d

…_api_logging

ymao1 reviewed Jun 14, 2021

View reviewed changes

x-pack/plugins/task_manager/server/routes/health.ts Outdated Show resolved Hide resolved

x-pack/plugins/task_manager/server/monitoring/monitoring_stats_stream.ts Outdated Show resolved Hide resolved

chrisronline added 3 commits June 14, 2021 15:45

Partially fix failing tests

1dc0519

Move into separate function

0ed7d0a

Get rid of customStatus in favor of moving the logging logic to a sep…

35d89d6

…arate, mockable function

ymao1 approved these changes Jun 15, 2021

View reviewed changes

x-pack/plugins/task_manager/server/routes/health.ts Outdated Show resolved Hide resolved

x-pack/plugins/task_manager/server/routes/health.ts Outdated Show resolved Hide resolved

chrisronline added 3 commits June 15, 2021 11:22

Remove debug logging

a9a2a53

Merge remote-tracking branch 'elastic/master' into alerting/tm_health…

6715146

…_api_logging

Merge remote-tracking branch 'elastic/master' into alerting/tm_health…

4ce5672

…_api_logging

mikecote self-requested a review June 15, 2021 18:24

mikecote reviewed Jun 15, 2021

View reviewed changes

chrisronline added 2 commits June 15, 2021 14:59

Merge remote-tracking branch 'elastic/master' into alerting/tm_health…

a219c6f

…_api_logging

Do not log as an error if the stats are empty

4567cce

pmuellr approved these changes Jun 15, 2021

View reviewed changes

chrisronline added 3 commits June 15, 2021 16:13

PR feedback

069f7fa

Add docker whitelist

e4fc649

alpha order

b85f83b

chrisronline requested a review from a team as a code owner June 15, 2021 20:42

English is hard

4df5433

jbudz reviewed Jun 15, 2021

View reviewed changes

src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker Outdated Show resolved Hide resolved

jbudz approved these changes Jun 15, 2021

View reviewed changes

chrisronline added 5 commits June 15, 2021 16:52

Removing extra newline

24c8718

PR feedback around ignoring capacity estimation

30e13d7

Merge remote-tracking branch 'elastic/master' into alerting/tm_health…

28c6a75

…_api_logging

Merge remote-tracking branch 'elastic/master' into alerting/tm_health…

868f6be

…_api_logging

Move json utils

fb1ded5

mikecote approved these changes Jun 16, 2021

View reviewed changes

chrisronline mentioned this pull request Jun 16, 2021

[Task Manager] Optimize status field output for health api #102400

Open

chrisronline merged commit ab2a80f into elastic:master Jun 16, 2021

chrisronline mentioned this pull request Jun 16, 2021

[7.x] [Task Manager] Log at different levels based on the state (#101751) #102402

Merged

chrisronline deleted the alerting/tm_health_api_logging branch June 16, 2021 19:55

This was referenced Jun 21, 2021

[Task Manager] Throttle TM health logging #102783

Closed

[Task Manager] Add config switch around logging at different levels based on the state #102804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Manager] Log at different levels based on the state #101751

[Task Manager] Log at different levels based on the state #101751

chrisronline commented Jun 9, 2021 •

edited

Loading

elasticmachine commented Jun 11, 2021

chrisronline commented Jun 14, 2021

ymao1 commented Jun 14, 2021

chrisronline commented Jun 15, 2021

ymao1 left a comment

mikecote Jun 15, 2021 •

edited

Loading

chrisronline Jun 15, 2021

pmuellr Jun 15, 2021

chrisronline Jun 15, 2021

mikecote Jun 16, 2021

chrisronline Jun 16, 2021

chrisronline Jun 16, 2021

mikecote Jun 16, 2021

pmuellr left a comment

pmuellr Jun 15, 2021

chrisronline Jun 15, 2021

pmuellr Jun 15, 2021

chrisronline Jun 15, 2021

pmuellr Jun 15, 2021

pmuellr Jun 15, 2021

jbudz left a comment

mikecote left a comment

kibanamachine commented Jun 16, 2021

[Task Manager] Log at different levels based on the state #101751

[Task Manager] Log at different levels based on the state #101751

Conversation

chrisronline commented Jun 9, 2021 • edited Loading

elasticmachine commented Jun 11, 2021

chrisronline commented Jun 14, 2021

ymao1 commented Jun 14, 2021

chrisronline commented Jun 15, 2021

ymao1 left a comment

Choose a reason for hiding this comment

mikecote Jun 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmuellr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbudz left a comment

Choose a reason for hiding this comment

mikecote left a comment

Choose a reason for hiding this comment

kibanamachine commented Jun 16, 2021

💚 Build Succeeded

Metrics [docs]

History

chrisronline commented Jun 9, 2021 •

edited

Loading

mikecote Jun 15, 2021 •

edited

Loading