More alerting services telemetry #60315

mikecote · 2020-03-16T21:13:05Z

What is this issue about?

We have identified that there are some gaps in our telemetry and we would like to close those gaps.

The proposed list below is partial and most likely missing some important metrics, so please use them as inspiration rather than clear requirements.

To do:

review the proposed telemetry and identify missing metrics
break down the work needed into smaller issues if needed (in case some additional telemetry can be postponed to 8.x)
add actual telemetry for anything we wish to target for 7.16

Prioritisation:
There is probably a lot of telemetry we could add, so to keep this issue focused, here are the areas to address in order of priority:

Telemetry needed as a baseline for our OKRs in 8.x. If we plan on improving a certain area in early 8.x, we need a baseline to compare against as a measurement for improvement. Please consult the team OKRs (or @gmmorris / @mikecote is unsure)
- average task execution drift (ideally we'd be able to break this down by task type, or at least have clusters for "rules", "actions" etc.)
- count how often task duration exceeds the timeout (and by how much if possible)
- count how often task duration exceeds the interval (and by how much if possible) (task set to repeat every 1s but takes 10s to run)
- is the cluster underprovisioned? Use TM Health stats for that, but we don't have a clear definition of what "underprovisioned" means, so best to make a proposal to the team on this one
Telemetry needed to answer questions we have about system resilience and performance
Telemetry needed to answer questions we have about the usage of alerting / connectors / task manager

Proposed missing telemetry

Below is a list of missing telemetry that we have identified.
We might not want to add them all, and there might be critical telemetry missing in this list - please use as inspiration for detailed research, rather than hard requirements.

For potential guardrails

Required by 8.0	GitHub issue	Plugin	Telemetry questions	Why
Yes	#122535	Alerting	What is the maximum number of alerts a single rule execution created? What is the maximum number of actions a rule has scheduled during a single execution?	Knowing these values will indicate what scale of alerts and actions we are dealing with and decide if we need to guardrail against such large values.
Yes	#122535	Alerting	What is the average rule execution time by rule type?	Help understand which rule types need to be optimized, which default execution timeout values would be appropriate, etc.
Yes		Alerting	What is the average connector execution time by connector type?	Help understand which connector types need to be optimized, which default execution timeout values would be appropriate, etc.
Yes	#113465	Alerting	how often does task duration exceed the timeout?	Helps us know whether specific rule types tend to run too long and require additional guardrails
Yes	#113465	Alerting	how often task duration exceeds the interval	Helps us know whether specific rule types tend to run longer then the customer expects and require additional guardrails
Yes		Alerting	how many clusters are underprovisioned according to TM health stats?	Helps us knowwhether customers need more guidance on scaling their Kibana deployments

Stability

Required by 8.0	Plugin	Telemetry questions	Why
No	Alerting	What is the total count of rule executions? How many times did rules encounter execution failures for `read`? `decrypt`? `unknown`? `license`? What is the count by rule type of execution failures with an error reason of `execute`?	Help know if there are areas in the framework or rule type that need to improve and reduce certain types of failures. Knowing the total executions and a count of certain failures help calculate ratios.
Yes	Actions	What is the total count of action executions? What is the count of action execution failures by connector type?	Help know if there are certain connector types that need to improve stability. Knowing the total executions and a count of failures helps calculate ratios.
Yes	Task Manager	What is the average drift by cluster?	Helps to know what average drift we are dealing with today and know how much we need to improve.
Yes	Task Manager	what is the average task execution drift by task type ?	Helps to know what average drift we are dealing with today and know how much we need to improve.

7.x changes

Required by 7.16	GitHub issue	Description	Telemetry questions	Why
Yes	✅ #93466	Email connector	What is the most popular serverType? What is the least popular serverType? What is the count by serverType per cluster?	It is currently impossible to assess the impact of Microsoft deprecating basic auth for Exchange. With this data point, we could have assessed the impact and prioritize accordingly. Havin this metric will be important for the next time we need to assess impact by serverType.
Yes	✅ #100067	Rules per space	What percentage of clusters are using rules in different spaces? Across how many spaces?	Knowing how popular it is to use rules in different spaces can justify making rule saved-objects sharable.
Yes	✅ #100067	Connectors per space	What percentage of clusters are using connectors in different spaces? Across how many spaces?	Knowing how popular it is to use connectors in different spaces can justify making connector saved-objects sharable.
Yes	✅ #55340	Failed tasks	How many clusters have tasks with `status:failed`? How many per cluster?	Knowing these metrics will help us prioritize how important it is to move away from leaving failed tasks.

General usage of alerting / connectors / task manager

Required by 8.0	Plugin	Description
No	actions	Count of 3rd party action types
No	actions	Actions telemetry should be extended with data for total executions count and executions count per action type. Current logic is supposed to use the event log data. Issue is blocked till event log write to index will be supported.
No	actions	Connectors telemetry should be extended to report where Connectors are created (in the Connectors UI or in the Alert flyout)
No	actions	Trial license and usage of connectors. As per #60315 (comment): "I am interested in knowing what connectors (by count) are being used per Tracking Alert."
No	actions	Configuration changes; who's changing it, what values are being used, etc
No	alerting	Implement telemetry for Min / max / avg - alert instances
No	alerting	Count of 3rd party rule types
No	alerting	Alerting telemetry should be extended with data for total executions count and executions count per alert type. Current logic is supposed to use the event log data. Issue is blocked till event log write to index will be supported.
No	alerting	Trial license and usage of rules. As per #60315 (comment): "I am interested in knowing when customers upgrade their license to use the Tracking Alert
No	alerting	Configuration changes; who's changing it, what values are being used, etc
No	alerting	Telemetry for long running rules; we should keep a count of the rules that are exceeding the task manager timeout, and what rule types they are.
No	alerting	What is the breakdown of runWhen? How many rules are using runWhen of X (ex: status change, active, custom interval)?
No	event_log	Telemetry for Event log doc count
No	task_manager	Min / max / avg - tasks per second
No	task_manager	Tasks Total count ?
No	task_manager	Total count active (in use) - should be clarified as a requirement.
No	task_manager	Total executions ?
No	task_manager	recurring tasks whose duration exceeds their interval (rule set to run every 1 minute, yet takes 5 minutes to run). Reference #111259 for additional related metrics to track
No	task_manager	Configuration changes; who's changing it, what values are being used, etc

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-16T21:13:06Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote · 2020-03-18T19:17:35Z

Could also have some telemetry about how many users upgraded or started a trial because of alerting gold+ features

alexfrancoeur · 2020-03-19T17:37:58Z

Yes, this would be a great metric to track. +1 ^^

mikecote · 2020-06-10T16:54:19Z

@arisonl can you provide a filtered list of the above that we would like to start tracking?

arisonl · 2020-07-02T12:29:26Z

A couple of thoughts on what we will want to measure from a product perspective (non-exhaustive and not final):

Adoption:

Number of alerts and number of connectors
- Per type
- Per license
- Per platform
Trend of % of clusters using the alerting functionality

Retention:

Churn: how many clusters stop using alerts

Usage:

of views (management, alert details and log pages)
of pre-configured connectors and alerts
How much time users spend on editing existing alerts?

Higher level questions:

Are customers upgrading due to alerting?
- I think that it is hard to measure causality but we could measure things indicators like first (second and third) commercial functionalities they adopt after the upgrade
Should alerting be placed on a higher level in navigation? Or should solutions expose management views in their context?
Usage patterns of management views:
- Time spent in management views, time to get from solutions to management views etc.

There is a number of lower level items in the description of this issue and I would like to understand them better. If their purpose is to optimise the lower level facilities of the framework, I think that prioritisation is up to engineering.

alexfrancoeur · 2020-08-03T17:47:43Z

Hey all, quick bump on this thread. I know the team is heads down on GA but do we have an idea of when telemetry on connector type might land?

mikecote · 2020-08-04T12:52:27Z

@alexfrancoeur what depth of telemetry are you thinking about for connector types? I took a quick look and we should already be tracking the following today:

count_total
count_active_total
count_by_type
count_active_by_type

alexfrancoeur · 2020-08-04T13:47:47Z

Hey Mike, thanks for clarifying. I think this level of granularity will work for now. I see the fields now in the our doc but the fields aren't mapped at the moment. I'll create an issue to do so. Appreciate the quick follow up!

mikecote · 2020-08-04T13:50:38Z

@alexfrancoeur ah that may be why, the telemetry team may be having issues mapping these because they contain dynamic keys. If they don't have a work around for this and need changes on the alerting side, let me know!

cc @Bamieh

kmartastic · 2020-11-06T19:40:55Z

Hey @mikecote -- Just wanted to voice the need for more telemetry related to "trial license" and connectors.

(As you know ;)) We'll be enabling the first location-based alert in 7.11* and it will be Gold+ license. I am interested in knowing when customers upgrade their license to use the Tracking Alert. I am interested in knowing what connectors (by count) are being used per Tracking Alert.

*Tracking Alert is included in 7.10 as experimental; and requires turning on a feature flag to enable.

mikecote · 2020-11-06T19:47:56Z

Thanks @kmartastic, I've added an item to the issue description to capture your scenario. We will be reviewing these requirements next week 🙏

Bamieh · 2021-01-29T10:12:39Z

We do not encourage dynamically mapped fields since they are not supported on our cluster. Dynamic fields come with a lot of drawbacks. Please reachout to the telemetry team to discuss possible solutions around this before working on the requirements to avoid sending more dynamic fields.

gmmorris · 2021-09-13T10:17:50Z

I've updated the description to add some clarity for whoever pick this issue up. :)

ymao1 · 2021-09-13T17:12:12Z

Updated the description to include an item for keeping track of rules that exceed their timeout.

mikecote · 2021-12-13T19:18:10Z

@YulNaumenko, thanks for driving this issue through 7.16 / 8.0 ❤️! After today's planning session and given we're past 8.0, we've done as much as we can regarding additional telemetry points, and we can now move this issue back into the backlog for now.

mikecote · 2022-06-24T15:49:55Z

Closing as 8.0 has sufficient telemetry.

mikecote added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Mar 16, 2020

mikecote added the Meta label Mar 16, 2020

gmmorris mentioned this issue Sep 9, 2021

[Alerting] Investigate resilience / side effects of excessively long running rules #111259

Open

3 tasks

gmmorris removed the Meta label Sep 13, 2021

gmmorris added estimate:needs-research Estimated as too large and requires research to break down into workable issues and removed estimate:medium Medium Estimated Level of Effort labels Sep 20, 2021

YulNaumenko self-assigned this Sep 23, 2021

gmmorris added the telemetry Issues related to the addition of telemetry to a feature label Oct 7, 2021

YulNaumenko mentioned this issue Oct 13, 2021

[Alerting] Telemetry required for 7.16 #114690

Merged

YulNaumenko mentioned this issue Oct 18, 2021

[Alerting] More telemetry for 8.0 based on Event Log data #115318

Merged

YulNaumenko closed this as completed in #115318 Nov 2, 2021

YulNaumenko reopened this Nov 2, 2021

mikecote unassigned YulNaumenko Dec 13, 2021

XavierM added this to AppEx: ResponseOps - Execution & Connectors and AppEx: ResponseOps - Rules & Alerts Management Jan 6, 2022

mikecote mentioned this issue Jan 10, 2022

[Alerting] Telemetry for potential rule execution guardrails #122535

Closed

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

mikecote closed this as completed Jun 24, 2022

mikecote moved this to Done in AppEx: ResponseOps - Rules & Alerts Management Jun 24, 2022

mikecote moved this to Done in AppEx: ResponseOps - Execution & Connectors Jun 24, 2022

mikecote removed this from AppEx: ResponseOps - Execution & Connectors Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More alerting services telemetry #60315

More alerting services telemetry #60315

mikecote commented Mar 16, 2020 •

edited

Loading

elasticmachine commented Mar 16, 2020

mikecote commented Mar 18, 2020

alexfrancoeur commented Mar 19, 2020

mikecote commented Jun 10, 2020

arisonl commented Jul 2, 2020

alexfrancoeur commented Aug 3, 2020

mikecote commented Aug 4, 2020

alexfrancoeur commented Aug 4, 2020

mikecote commented Aug 4, 2020

kmartastic commented Nov 6, 2020

mikecote commented Nov 6, 2020

Bamieh commented Jan 29, 2021

gmmorris commented Sep 13, 2021

ymao1 commented Sep 13, 2021

mikecote commented Dec 13, 2021 •

edited

Loading

mikecote commented Jun 24, 2022

More alerting services telemetry #60315

More alerting services telemetry #60315

Comments

mikecote commented Mar 16, 2020 • edited Loading

What is this issue about?

Proposed missing telemetry

For potential guardrails

Stability

7.x changes

General usage of alerting / connectors / task manager

elasticmachine commented Mar 16, 2020

mikecote commented Mar 18, 2020

alexfrancoeur commented Mar 19, 2020

mikecote commented Jun 10, 2020

arisonl commented Jul 2, 2020

alexfrancoeur commented Aug 3, 2020

mikecote commented Aug 4, 2020

alexfrancoeur commented Aug 4, 2020

mikecote commented Aug 4, 2020

kmartastic commented Nov 6, 2020

mikecote commented Nov 6, 2020

Bamieh commented Jan 29, 2021

gmmorris commented Sep 13, 2021

ymao1 commented Sep 13, 2021

mikecote commented Dec 13, 2021 • edited Loading

mikecote commented Jun 24, 2022

mikecote commented Mar 16, 2020 •

edited

Loading

mikecote commented Dec 13, 2021 •

edited

Loading