Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More alerting services telemetry #60315

Closed
3 tasks
mikecote opened this issue Mar 16, 2020 · 17 comments · Fixed by #115318
Closed
3 tasks

More alerting services telemetry #60315

mikecote opened this issue Mar 16, 2020 · 17 comments · Fixed by #115318
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) telemetry Issues related to the addition of telemetry to a feature

Comments

@mikecote
Copy link
Contributor

mikecote commented Mar 16, 2020

What is this issue about?

We have identified that there are some gaps in our telemetry and we would like to close those gaps.

The proposed list below is partial and most likely missing some important metrics, so please use them as inspiration rather than clear requirements.

To do:

  • review the proposed telemetry and identify missing metrics
  • break down the work needed into smaller issues if needed (in case some additional telemetry can be postponed to 8.x)
  • add actual telemetry for anything we wish to target for 7.16

Prioritisation:
There is probably a lot of telemetry we could add, so to keep this issue focused, here are the areas to address in order of priority:

  1. Telemetry needed as a baseline for our OKRs in 8.x. If we plan on improving a certain area in early 8.x, we need a baseline to compare against as a measurement for improvement. Please consult the team OKRs (or @gmmorris / @mikecote is unsure)
    • average task execution drift (ideally we'd be able to break this down by task type, or at least have clusters for "rules", "actions" etc.)
    • count how often task duration exceeds the timeout (and by how much if possible)
    • count how often task duration exceeds the interval (and by how much if possible) (task set to repeat every 1s but takes 10s to run)
    • is the cluster underprovisioned? Use TM Health stats for that, but we don't have a clear definition of what "underprovisioned" means, so best to make a proposal to the team on this one
  2. Telemetry needed to answer questions we have about system resilience and performance
  3. Telemetry needed to answer questions we have about the usage of alerting / connectors / task manager

Proposed missing telemetry

Below is a list of missing telemetry that we have identified.
We might not want to add them all, and there might be critical telemetry missing in this list - please use as inspiration for detailed research, rather than hard requirements.

For potential guardrails

Required by 8.0 GitHub issue Plugin Telemetry questions Why
Yes #122535 Alerting What is the maximum number of alerts a single rule execution created? What is the maximum number of actions a rule has scheduled during a single execution? Knowing these values will indicate what scale of alerts and actions we are dealing with and decide if we need to guardrail against such large values.
Yes #122535 Alerting What is the average rule execution time by rule type? Help understand which rule types need to be optimized, which default execution timeout values would be appropriate, etc.
Yes Alerting What is the average connector execution time by connector type? Help understand which connector types need to be optimized, which default execution timeout values would be appropriate, etc.
Yes #113465 Alerting how often does task duration exceed the timeout? Helps us know whether specific rule types tend to run too long and require additional guardrails
Yes #113465 Alerting how often task duration exceeds the interval Helps us know whether specific rule types tend to run longer then the customer expects and require additional guardrails
Yes Alerting how many clusters are underprovisioned according to TM health stats? Helps us knowwhether customers need more guidance on scaling their Kibana deployments

Stability

Required by 8.0 Plugin Telemetry questions Why
No Alerting What is the total count of rule executions? How many times did rules encounter execution failures for read? decrypt? unknown? license? What is the count by rule type of execution failures with an error reason of execute? Help know if there are areas in the framework or rule type that need to improve and reduce certain types of failures. Knowing the total executions and a count of certain failures help calculate ratios.
Yes Actions What is the total count of action executions? What is the count of action execution failures by connector type? Help know if there are certain connector types that need to improve stability. Knowing the total executions and a count of failures helps calculate ratios.
Yes Task Manager What is the average drift by cluster? Helps to know what average drift we are dealing with today and know how much we need to improve.
Yes Task Manager what is the average task execution drift by task type ? Helps to know what average drift we are dealing with today and know how much we need to improve.

7.x changes

Required by 7.16 GitHub issue Description Telemetry questions Why
Yes #93466 Email connector What is the most popular serverType? What is the least popular serverType? What is the count by serverType per cluster? It is currently impossible to assess the impact of Microsoft deprecating basic auth for Exchange. With this data point, we could have assessed the impact and prioritize accordingly. Havin this metric will be important for the next time we need to assess impact by serverType.
Yes #100067 Rules per space What percentage of clusters are using rules in different spaces? Across how many spaces? Knowing how popular it is to use rules in different spaces can justify making rule saved-objects sharable.
Yes #100067 Connectors per space What percentage of clusters are using connectors in different spaces? Across how many spaces? Knowing how popular it is to use connectors in different spaces can justify making connector saved-objects sharable.
Yes #55340 Failed tasks How many clusters have tasks with status:failed? How many per cluster? Knowing these metrics will help us prioritize how important it is to move away from leaving failed tasks.

General usage of alerting / connectors / task manager

Required by 8.0 Plugin Description
No actions Count of 3rd party action types
No actions Actions telemetry should be extended with data for total executions count and executions count per action type. Current logic is supposed to use the event log data. Issue is blocked till event log write to index will be supported.
No actions Connectors telemetry should be extended to report where Connectors are created (in the Connectors UI or in the Alert flyout)
No actions Trial license and usage of connectors. As per #60315 (comment): "I am interested in knowing what connectors (by count) are being used per Tracking Alert."
No actions Configuration changes; who's changing it, what values are being used, etc
No alerting Implement telemetry for Min / max / avg - alert instances
No alerting Count of 3rd party rule types
No alerting Alerting telemetry should be extended with data for total executions count and executions count per alert type. Current logic is supposed to use the event log data. Issue is blocked till event log write to index will be supported.
No alerting Trial license and usage of rules. As per #60315 (comment): "I am interested in knowing when customers upgrade their license to use the Tracking Alert
No alerting Configuration changes; who's changing it, what values are being used, etc
No alerting Telemetry for long running rules; we should keep a count of the rules that are exceeding the task manager timeout, and what rule types they are.
No alerting What is the breakdown of runWhen? How many rules are using runWhen of X (ex: status change, active, custom interval)?
No event_log Telemetry for Event log doc count
No task_manager Min / max / avg - tasks per second
No task_manager Tasks Total count ?
No task_manager Total count active (in use) - should be clarified as a requirement.
No task_manager Total executions ?
No task_manager recurring tasks whose duration exceeds their interval (rule set to run every 1 minute, yet takes 5 minutes to run). Reference #111259 for additional related metrics to track
No task_manager Configuration changes; who's changing it, what values are being used, etc
@mikecote mikecote added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Mar 16, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor Author

Could also have some telemetry about how many users upgraded or started a trial because of alerting gold+ features

@alexfrancoeur
Copy link

Yes, this would be a great metric to track. +1 ^^

@mikecote
Copy link
Contributor Author

@arisonl can you provide a filtered list of the above that we would like to start tracking?

@arisonl
Copy link
Contributor

arisonl commented Jul 2, 2020

A couple of thoughts on what we will want to measure from a product perspective (non-exhaustive and not final):

Adoption:

  • Number of alerts and number of connectors
    • Per type
    • Per license
    • Per platform
  • Trend of % of clusters using the alerting functionality

Retention:

  • Churn: how many clusters stop using alerts

Usage:

  • of views (management, alert details and log pages)
  • of pre-configured connectors and alerts
  • How much time users spend on editing existing alerts?

Higher level questions:

  • Are customers upgrading due to alerting?
    • I think that it is hard to measure causality but we could measure things indicators like first (second and third) commercial functionalities they adopt after the upgrade
  • Should alerting be placed on a higher level in navigation? Or should solutions expose management views in their context?
    Usage patterns of management views:
    • Time spent in management views, time to get from solutions to management views etc.

There is a number of lower level items in the description of this issue and I would like to understand them better. If their purpose is to optimise the lower level facilities of the framework, I think that prioritisation is up to engineering.

@alexfrancoeur
Copy link

Hey all, quick bump on this thread. I know the team is heads down on GA but do we have an idea of when telemetry on connector type might land?

@mikecote
Copy link
Contributor Author

mikecote commented Aug 4, 2020

@alexfrancoeur what depth of telemetry are you thinking about for connector types? I took a quick look and we should already be tracking the following today:

  • count_total
  • count_active_total
  • count_by_type
  • count_active_by_type

@alexfrancoeur
Copy link

Hey Mike, thanks for clarifying. I think this level of granularity will work for now. I see the fields now in the our doc but the fields aren't mapped at the moment. I'll create an issue to do so. Appreciate the quick follow up!

@mikecote
Copy link
Contributor Author

mikecote commented Aug 4, 2020

@alexfrancoeur ah that may be why, the telemetry team may be having issues mapping these because they contain dynamic keys. If they don't have a work around for this and need changes on the alerting side, let me know!

cc @Bamieh

@kmartastic
Copy link
Contributor

Hey @mikecote -- Just wanted to voice the need for more telemetry related to "trial license" and connectors.

(As you know ;)) We'll be enabling the first location-based alert in 7.11* and it will be Gold+ license. I am interested in knowing when customers upgrade their license to use the Tracking Alert. I am interested in knowing what connectors (by count) are being used per Tracking Alert.

*Tracking Alert is included in 7.10 as experimental; and requires turning on a feature flag to enable.

@mikecote
Copy link
Contributor Author

mikecote commented Nov 6, 2020

Thanks @kmartastic, I've added an item to the issue description to capture your scenario. We will be reviewing these requirements next week 🙏

@Bamieh
Copy link
Member

Bamieh commented Jan 29, 2021

We do not encourage dynamically mapped fields since they are not supported on our cluster. Dynamic fields come with a lot of drawbacks. Please reachout to the telemetry team to discuss possible solutions around this before working on the requirements to avoid sending more dynamic fields.

@gmmorris
Copy link
Contributor

I've updated the description to add some clarity for whoever pick this issue up. :)

@gmmorris gmmorris removed the Meta label Sep 13, 2021
@ymao1
Copy link
Contributor

ymao1 commented Sep 13, 2021

Updated the description to include an item for keeping track of rules that exceed their timeout.

@gmmorris gmmorris added estimate:needs-research Estimated as too large and requires research to break down into workable issues and removed estimate:medium Medium Estimated Level of Effort labels Sep 20, 2021
@YulNaumenko YulNaumenko self-assigned this Sep 23, 2021
@gmmorris gmmorris added the telemetry Issues related to the addition of telemetry to a feature label Oct 7, 2021
@YulNaumenko YulNaumenko reopened this Nov 2, 2021
@mikecote
Copy link
Contributor Author

mikecote commented Dec 13, 2021

@YulNaumenko, thanks for driving this issue through 7.16 / 8.0 ❤️! After today's planning session and given we're past 8.0, we've done as much as we can regarding additional telemetry points, and we can now move this issue back into the backlog for now.

@mikecote
Copy link
Contributor Author

Closing as 8.0 has sufficient telemetry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) telemetry Issues related to the addition of telemetry to a feature
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

10 participants