Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leveraging event log data to provide better insights of the alerting framework #111452

Closed
mikecote opened this issue Sep 7, 2021 · 18 comments
Closed
Assignees
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Actions/Framework Issues related to the Actions Framework Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. insight Issues related to user insight into platform operations and resilience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Sep 7, 2021

Relates to #51548
Relates to #61018

We should research what UX and vision would give users better insight into the alerting framework and break it down into deliverables. It should benefit users by empowering and requiring less time to debug the alerting framework when it's not behaving in expected ways.

Problem Statements

  • It is currently impossible to view in the UI a history of the rule execution (including time, duration, outcome, error message, etc.) to gain confidence that the rules are running on time and successfully
  • It is currently impossible to view in the UI a history of connector execution (including time, duration, outcome, error message, etc.) to gain confidence that connectors are running successfully
  • It is currently impossible to view in the UI a count of alerts fired during a rule execution and the related action event log documents
  • It is currently impossible to view in the UI a history of the underlying rule task (including timeout, outcome, picked up due to retryAt, error message, drift, etc.)
  • It is currently impossible to view in the UI a breakdown of events within a rule's execution (loading / decrypting SOs, calling the executor function, scheduling actions, clean up, etc.) and makes it hard to identify which part of the execution is slow
  • It is currently impossible to "drilldown" in the UI from rule -> rule execution history -> action execution history and vice versa
  • ...
@mikecote mikecote added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Actions/Framework Issues related to the Actions Framework estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Sep 7, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Sep 7, 2021

In my event log dashboard, I have some graphs which were easy to build from the event log, and I think useful. At least provides some more thought into what we could do. The dashboard queries events across the entire system, but you could imagine that in the product it would show events for all the rules / connectors in the current space.

graphs over time:

  • top N of average rule duration
  • top N of rule schedule delay (drift)
  • success/failure counts
  • error messages / counts during rule/connector executions
  • task type / state - this one is kinda interesting; shows each task and it's current state and the number in that state. Probably more interesting to think about as a "task manager UI", and it's problematic that it's "global", in terms of security. Seems like we'd only want someone with a "management role" to be able to view it.

@ymao1 ymao1 self-assigned this Sep 14, 2021
@mdefazio
Copy link
Contributor

Trying to bring this into the rule monitoring work we're doing for RAC. With that in mind, @pmuellr , is

top N of average rule duration
top N of rule schedule delay (drift)
similar to security's query time or index time?

@ymao1
Copy link
Contributor

ymao1 commented Sep 15, 2021

@mdefazio I believe security keeps track of more granular durations during rule execution, so query time is the time the ES query took and index time is the time it took to index into siem.signals. Right now the event log keeps track of total duration to run the rule, so we don't quite have that level of information.

Looks like there was just an issue opened to store some of that information with the rule though: #112193

@pmuellr
Copy link
Member

pmuellr commented Sep 15, 2021

are these similar to security's query time or index time?

top N of average rule duration
top N of rule schedule delay (drift)

Related, but not similar. Security's query/index time are durations of time spent making ES calls. Average rule duration will include those, as the rule duration implicitly includes all calls made during the rule execution, so includes any query or indexing being done. The rule schedule delay is a completely different thing - it's the difference of when a rule (or connector) was supposed to run and when it actually ran. If it was supposed to run at 12:00:00, but ran at 12:00:01, the schedule delay would be 1 second.

Looks like there was just an issue opened to store some of that information with the rule though: #112193

Yup, #112193 is tracking what security wants added - I don't believe this includes any thoughts from observability yet though.

Presumably these interesting values will also be added as rule- or solution-specific event log records, though we'll probably want to "standardize" on whatever we add to the execution_status for #112193 is also available in the event log.

@gmmorris gmmorris added impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. insight Issues related to user insight into platform operations and resilience labels Sep 16, 2021
@ymao1
Copy link
Contributor

ymao1 commented Sep 23, 2021

Link to POC: #112982

Event log data in Rules Management view

When viewing the rules management page, we would like to answer the following questions as quickly as possible:

  • Are there any rules or rule types that are taking a long time to execute?
  • Are there any rules or rule types that are taking a long time to start running after being scheduled?
  • Are there any rules or rule types that are generating a lot (too many?) of alerts
  • Are there any errors happening during rule execution?
    • If there are errors, what are the top reasons for them?
    • If there are errors, what are the most common error messages?
    • If there are errors, which rules or rule types are erroring most frequently?

Included in the POC is the type of information we can use to answer the above questions that is currently available in the Event Log. Note that while we can and should filter this data by time (last week, last day, etc), this filtering is not included in the POC and it defaults to the last 7 days.
New Project (5) (1)

  1. Breakdown of rules by rule type
  2. Alert count, average execution duration and average delay is broken down by rule type and then by rule ID. This is not included in the POC but for the rule type breakdown, we can imagine clicking on a rule type and then filtering the breakdown by rule ID to only show rules with that rule type. So we can see that rule type A has a long average duration and then by clicking into rule type A, we can then see the rules of rule type A sorted by longest average duration.
  3. Each breakdown by rule can include a link to the rule details page for 1-click drilldown into the rule.
  4. Error messages are aggregated by message and error reason. I think grouping by error message might be tricky in real life because often the error message includes the rule id. I see in Patrick's dashboard that the error messages are just listed out in a table without grouping so that is another view we could show.

Other questions that we currently do not have the event log data to support:

  • Are there any timeouts happening during rule execution?
    • If there are timeouts, which rules or rule types are erroring most frequently?
    • If there are timeouts, what type of schedule interval are the rules running at?

Event log data in Rules Details view

When viewing the rule details page, we would like to answer the following questions as quickly as possible:

  • Last time this rule successfully ran
  • Last time this rule failed to run
  • Number of times this rule has run successfully or failed
  • Execution durations of the rule executions
  • Schedule delay of the rule executions
  • Number of alerts this rule generates with each execution
  • Last time an action successfully ran for this rule
  • Last time an action failed to run for this rule
  • Number of times an action for this rule successfully ran or failed
  • Execution duration of the action executions
  • Schedule delay of the action executions

Included in the POC is the type of information we can use to answer the above questions that is currently available in the Event Log. Note that while we can and should filter this data by time (last week, last day, etc), this filtering is not included in the POC and it defaults to the last 7 days.
New Project (4)

Failure outcomes are listed in the execution table with the associated error message. Not included in this POC is a filter on the execution tables to filter by success/failure outcome.
Screen Shot 2021-09-21 at 11 52 49 AM

Other questions that we currently do not have the event log data to support:

  • Further breaking down execution duration into decrypt time, query time, index time.
  • What is the state at the end of each execution? This is stored in task manager but overwritten with each execution. Would this be useful to know for each execution?
  • When the API key tied to the rule changed
  • History of updates (including enable/disable and mute/unmute) to the rule and the users who made the update.
  • History of scheduled task ids associated with the rule (useful for debugging if we end up with multiple tasks for a rule)

@ymao1
Copy link
Contributor

ymao1 commented Sep 23, 2021

On the Rules Management page, the current POC aggregates a lot of information from the Event Log in order to show a birds eye view of the data, but it would also be useful to show a time based view of the average execution duration and delay (probably by rule type, I think not necessarily by rule) in order to see if there are trends in the time of day when the execution duration/delay increase.

@ymao1
Copy link
Contributor

ymao1 commented Sep 27, 2021

After offline discussion, there was a question of which items from the POC would be best/most straightforward to try to push out as an experimental feature for 7.16.

1. Event log data in Rules Details view

  • Limited to a single rule ID so we can use the same RBAC model we use to get the current rule summary
  • No changes to the event log client; we're using the existing find function to retrieve documents

2. Event log data in Rules Management view

  • Requires more thought about RBAC. In the POC, I used the rules client .find() to retrieve rule IDs that the user has access to in order to filter the event log documents by these rule IDs. In practice, this is a hacky way to achieve RBAC
  • Used the aggregation function from this PR to achieve most of the aggregations, so would be nice to resolve that PR first before moving forward with this view

IMO, we have all the information to implement 1. Event log data in Rules Details view right away where we need some more pieces to fully and correctly implement 2. It would be my suggestion to push forward with the Rule Details view in the near term.

cc @mikecote @kobelb @gmmorris and @elastic/kibana-alerting-services for your thoughts

@YulNaumenko
Copy link
Contributor

Used the aggregation function from this PR to achieve most of the aggregations, so would be nice to resolve that PR first before moving forward with this view

I can update it to the the green CI. @ymao1 do you need any specific changes on the query structure or the current is OK?

@mikecote
Copy link
Contributor Author

cc @mikecote @kobelb @gmmorris and @elastic/kibana-alerting-services for your thoughts

Ranking the questions by section and by overall priority, below is how I can see them sorted. If I had to pick a cut-off on my recommendation, I'd choose after 3 for the rules management and after 8 for the rule details, but a bonus if we get everything included!

Event log data in Rules Management view

  1. Are there any errors happening during rule execution?
  2. Are there any timeouts happening during rule execution?
  3. Are there any rules or rule types that are taking a long time to execute?
  4. Are there any rules or rule types that are generating a lot (too many?) of alerts
  5. Are there any rules or rule types that are taking a long time to start running after being scheduled?

Some more questions that could be useful:

  • What is the schedule_delay over time? for rules, for actions..
  • How many rules and actions are executed over time?

Event log data in Rules Details view

  1. Last time this rule successfully ran
  2. Execution durations of the rule executions
  3. Schedule delay of the rule executions
  4. Last time an action failed to run for this rule
  5. Number of alerts this rule generates with each execution
  6. Last time this rule failed to run
  7. Execution duration of the action executions
  8. Schedule delay of the action executions
  9. Last time an action successfully ran for this rule
  10. Number of times this rule has run successfully or failed
  11. Number of times an action for this rule successfully ran or failed

@kobelb
Copy link
Contributor

kobelb commented Sep 27, 2021

While I still think this is super cool, the closer that I've looked at the details of the data that is being presented, the less that I think it's something that we should rush into 7.16.

To address the immediate needs, I think that we're going to need data to help our users determine whether they have enough capacity to run their alerting rules/actions and whether there are errors or slow executing rules that need attention. The proof-of-concept does help here, but I think it'll be somewhat difficult for them to do so. I fully acknowledge that your proof-of-concept wasn't created to solve the immediate needs that have recently come to our attention, and you can't predict the future.

I don't say any of this to disparage the work that you've done @ymao1, it truly is exceptional. I'd just hate for us to rush this in, expand the scope, or stress people out if we don't need to.

@pmuellr
Copy link
Member

pmuellr commented Sep 27, 2021

To address the immediate needs, I think that we're going to need data to help our users determine whether they have enough capacity to run their alerting rules/actions and whether there are errors or slow executing rules that need attention.

That's certainly A need, but I think there are bigger concerns with that, since the implication is you'd want to allow someone to look across all the rules in the cluster, and we don't have a way to do that right now. Presumably such a cluster-wide alerting admin UX would only be available for superuser, or probably we'd want to invent some new role of alerting-admin or such. And it's going to be a radically different UX, though it would presumably be showing some of the same stuff, and using the same APIs.

I suspect that it's also true that we don't know exactly what we want / don't want in UX's such as this or the cluster-wide alerting admin UX, and we won't until we start seeing some of this data and figure out how useful it really is. In that way, this work is a stepping stone to get us to that set of useful metrics/charts.

There are still plenty of good potential reasons to not ship this (too much work to finish, not able to add cleanly so we won't be able to change/remove it later, etc), but I just don't feel like that is a good one.

@kobelb
Copy link
Contributor

kobelb commented Sep 28, 2021

That's certainly A need, but I think there are bigger concerns with that, since the implication is you'd want to allow someone to look across all the rules in the cluster, and we don't have a way to do that right now. Presumably such a cluster-wide alerting admin UX would only be available for superuser, or probably we'd want to invent some new role of alerting-admin or such. And it's going to be a radically different UX, though it would presumably be showing some of the same stuff, and using the same APIs.

Agreed, there are likely a lot of complications here.

I suspect that it's also true that we don't know exactly what we want / don't want in UX's such as this or the cluster-wide alerting admin UX, and we won't until we start seeing some of this data and figure out how useful it really is. In that way, this work is a stepping stone to get us to that set of useful metrics/charts.

There are still plenty of good potential reasons to not ship this (too much work to finish, not able to add cleanly so we won't be able to change/remove it later, etc), but I just don't feel like that is a good one.

I agree that what I've articulated isn't a reason to not ship this UI. However, during a sync meeting today I was pushing for us to figure out if we could ship this in 7.16 because of the immediate needs that have recently been discovered. I was attempting to retract my advocacy for us to push hard for this to be included 7.16. If the team and others feel comfortable shipping part or all of this for 7.16, I'm all for it.

@gmmorris
Copy link
Contributor

I love this POC @ymao1 !

with regards to the discussion about shipping in 7.16 - I'm good either way, as I trust the team to evaluate whether code is ready to ship or not.

The two things I'd like us to consider are:

  1. Is this code ready to ship from a quality perspective (@ymao1 can make this assessment with reviewers as always :) )
  2. Will this experimental UX reduce the support workload we have, or might rushing this in 7.x actually cause us to deliver something that creates more support workload instead?

In order to assess (2), I'd like it if we could ask a few of questions:

  1. What questions would this UX actually answer? (we can start with @mikecote 's list, though @chrisronline has a list from his previous O11y of alerting work)
  2. What questions might exposing this UX actually force that would otherwise not come up? (For example, if customers suddenly see a constant drift of 2-3 seconds this might lead to SDHs as customers don't realise that this is expected).
  3. What questions that we usually get in SDHs would this UX not actually answer?

If these questions reveal that a limited experimental UX will actually help reduce the support workload, then I'm all for it and we can iterate on it in 8.1.
If it raises any concerns around this, or @ymao1 feels any pressure to rush this, then I'm more than happy to defer this to 8.1+.

Moving fast is great, but rushing only costs us more in the long term, and given we are relying on data that is already available by querying the Event Log, I don't see a reason to rush this.

@gmmorris
Copy link
Contributor

For example, if customers suddenly see a constant drift of 2-3 seconds this might lead to SDHs as customers don't realise that this is expected

This could actually be mitigated through UX (green indicator with help label for then drift <= polling interval), but until we figure that out, might be safer not to expose data that could be misinterpreted.

@ymao1
Copy link
Contributor

ymao1 commented Sep 28, 2021

I think for this first iteration, we should narrow the focus, maybe on these two:

  • failures, either during rule execution or action execution
  • task durations, either during rule execution or action execution

I think if we want to ship anything in 7.16, it would have to be a view inside the Rule Details page. As I mentioned above, I think RBAC considerations make shipping anything in the Rule Management page difficult in the short term.

If we are surfacing a view in the Rule Details page, we need a starting point for users/support to know to go to the rule. Do we have that?

  • for failures, if there is an error during rule execution, the error logs have the rule ID and the rule management UI shows an error status for the rule. this seems like it would be sufficient to direct people to the Rule Details view
  • for failures, if there is an error during action execution, these errors don't show up in the UI and the error logs don't link back to the rule, so there would have to be additional digging to figure out which rule the failed action belongs to in order to make use of any info in the Rule Details view
  • for task duration, if there is a rule running for a long time, we have this 7.16 issue to show something in the rule management UI. I don't believe we have sufficient error logging for this
  • for task duration, if there is an action running for a long time, we don't show this in the UI or have logging, so it would take work to determine which actions are failing and which rules they belong to before being able to use anything in the Rule Details page (by which time maybe the view isn't even useful?)

So it seems to me that adding info to the Rule Details view would currently only be useful if a known rule (not rule type) is either being explicitly being monitored (i.e. some rule admin keeps going to the details for this rule and checking out the monitoring tab) or is failing/taking a long time during rule execution.

@chrisronline
Copy link
Contributor

Unfortunately, my focus on the o11y efforts started with integration with Stack Monitoring specifically - I didn't have much insight into what questions and answers users had when trying to understand and diagnose the performance of alerting so there isn't much there to help.

@gmmorris
Copy link
Contributor

gmmorris commented Oct 1, 2021

@ymao1 @mikecote and I held a sync call on this issue and decided not to rush into implementing this UX based on the POC.

The main blockers for us are questions around whether the UX will actually answer more questions than it raises, and how it fits into our long term visions for Observability of Alerting.
We've learned A LOT from this effort, and the next steps are for @arisonl, @mikecote and I to explore this long term vision and put together a plan for 8.x.

Thanks for the amazing effort @ymao1 !

@ymao1 ymao1 closed this as completed Oct 1, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Actions/Framework Issues related to the Actions Framework Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. insight Issues related to user insight into platform operations and resilience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

9 participants