-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leveraging event log data to provide better insights of the alerting framework #111452
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
In my event log dashboard, I have some graphs which were easy to build from the event log, and I think useful. At least provides some more thought into what we could do. The dashboard queries events across the entire system, but you could imagine that in the product it would show events for all the rules / connectors in the current space. graphs over time:
|
Trying to bring this into the rule monitoring work we're doing for RAC. With that in mind, @pmuellr , is
|
@mdefazio I believe security keeps track of more granular durations during rule execution, so query time is the time the ES query took and index time is the time it took to index into Looks like there was just an issue opened to store some of that information with the rule though: #112193 |
Related, but not similar. Security's query/index time are durations of time spent making ES calls. Average rule duration will include those, as the rule duration implicitly includes all calls made during the rule execution, so includes any query or indexing being done. The rule schedule delay is a completely different thing - it's the difference of when a rule (or connector) was supposed to run and when it actually ran. If it was supposed to run at 12:00:00, but ran at 12:00:01, the schedule delay would be 1 second.
Yup, #112193 is tracking what security wants added - I don't believe this includes any thoughts from observability yet though. Presumably these interesting values will also be added as rule- or solution-specific event log records, though we'll probably want to "standardize" on whatever we add to the execution_status for #112193 is also available in the event log. |
Link to POC: #112982 Event log data in Rules Management viewWhen viewing the rules management page, we would like to answer the following questions as quickly as possible:
Included in the POC is the type of information we can use to answer the above questions that is currently available in the Event Log. Note that while we can and should filter this data by time (last week, last day, etc), this filtering is not included in the POC and it defaults to the last 7 days.
Other questions that we currently do not have the event log data to support:
Event log data in Rules Details viewWhen viewing the rule details page, we would like to answer the following questions as quickly as possible:
Included in the POC is the type of information we can use to answer the above questions that is currently available in the Event Log. Note that while we can and should filter this data by time (last week, last day, etc), this filtering is not included in the POC and it defaults to the last 7 days. Failure outcomes are listed in the execution table with the associated error message. Not included in this POC is a filter on the execution tables to filter by success/failure outcome. Other questions that we currently do not have the event log data to support:
|
On the Rules Management page, the current POC aggregates a lot of information from the Event Log in order to show a birds eye view of the data, but it would also be useful to show a time based view of the average execution duration and delay (probably by rule type, I think not necessarily by rule) in order to see if there are trends in the time of day when the execution duration/delay increase. |
After offline discussion, there was a question of which items from the POC would be best/most straightforward to try to push out as an experimental feature for 7.16. 1. Event log data in Rules Details view
2. Event log data in Rules Management view
IMO, we have all the information to implement cc @mikecote @kobelb @gmmorris and @elastic/kibana-alerting-services for your thoughts |
I can update it to the the green CI. @ymao1 do you need any specific changes on the query structure or the current is OK? |
Ranking the questions by section and by overall priority, below is how I can see them sorted. If I had to pick a cut-off on my recommendation, I'd choose after 3 for the rules management and after 8 for the rule details, but a bonus if we get everything included! Event log data in Rules Management view
Some more questions that could be useful:
Event log data in Rules Details view
|
While I still think this is super cool, the closer that I've looked at the details of the data that is being presented, the less that I think it's something that we should rush into 7.16. To address the immediate needs, I think that we're going to need data to help our users determine whether they have enough capacity to run their alerting rules/actions and whether there are errors or slow executing rules that need attention. The proof-of-concept does help here, but I think it'll be somewhat difficult for them to do so. I fully acknowledge that your proof-of-concept wasn't created to solve the immediate needs that have recently come to our attention, and you can't predict the future. I don't say any of this to disparage the work that you've done @ymao1, it truly is exceptional. I'd just hate for us to rush this in, expand the scope, or stress people out if we don't need to. |
That's certainly A need, but I think there are bigger concerns with that, since the implication is you'd want to allow someone to look across all the rules in the cluster, and we don't have a way to do that right now. Presumably such a cluster-wide alerting admin UX would only be available for superuser, or probably we'd want to invent some new role of alerting-admin or such. And it's going to be a radically different UX, though it would presumably be showing some of the same stuff, and using the same APIs. I suspect that it's also true that we don't know exactly what we want / don't want in UX's such as this or the cluster-wide alerting admin UX, and we won't until we start seeing some of this data and figure out how useful it really is. In that way, this work is a stepping stone to get us to that set of useful metrics/charts. There are still plenty of good potential reasons to not ship this (too much work to finish, not able to add cleanly so we won't be able to change/remove it later, etc), but I just don't feel like that is a good one. |
Agreed, there are likely a lot of complications here.
I agree that what I've articulated isn't a reason to not ship this UI. However, during a sync meeting today I was pushing for us to figure out if we could ship this in 7.16 because of the immediate needs that have recently been discovered. I was attempting to retract my advocacy for us to push hard for this to be included 7.16. If the team and others feel comfortable shipping part or all of this for 7.16, I'm all for it. |
I love this POC @ymao1 ! with regards to the discussion about shipping in 7.16 - I'm good either way, as I trust the team to evaluate whether code is ready to ship or not. The two things I'd like us to consider are:
In order to assess (2), I'd like it if we could ask a few of questions:
If these questions reveal that a limited experimental UX will actually help reduce the support workload, then I'm all for it and we can iterate on it in 8.1. Moving fast is great, but rushing only costs us more in the long term, and given we are relying on data that is already available by querying the Event Log, I don't see a reason to rush this. |
This could actually be mitigated through UX (green indicator with help label for then |
I think for this first iteration, we should narrow the focus, maybe on these two:
I think if we want to ship anything in 7.16, it would have to be a view inside the Rule Details page. As I mentioned above, I think RBAC considerations make shipping anything in the Rule Management page difficult in the short term. If we are surfacing a view in the Rule Details page, we need a starting point for users/support to know to go to the rule. Do we have that?
So it seems to me that adding info to the Rule Details view would currently only be useful if a known rule (not rule type) is either being explicitly being monitored (i.e. some rule admin keeps going to the details for this rule and checking out the monitoring tab) or is failing/taking a long time during rule execution. |
Unfortunately, my focus on the o11y efforts started with integration with Stack Monitoring specifically - I didn't have much insight into what questions and answers users had when trying to understand and diagnose the performance of alerting so there isn't much there to help. |
@ymao1 @mikecote and I held a sync call on this issue and decided not to rush into implementing this UX based on the POC. The main blockers for us are questions around whether the UX will actually answer more questions than it raises, and how it fits into our long term visions for Observability of Alerting. Thanks for the amazing effort @ymao1 ! |
Relates to #51548
Relates to #61018
We should research what UX and vision would give users better insight into the alerting framework and break it down into deliverables. It should benefit users by empowering and requiring less time to debug the alerting framework when it's not behaving in expected ways.
Problem Statements
The text was updated successfully, but these errors were encountered: