-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuleTypes can't provide an AlertContext on recovery #87048
Comments
Couple of thoughts.
|
I wonder if we need to be more general - maybe every action group could have different context variables? |
One of our use-cases why the state in the recovered action would be very helpful for us is due to the following: We're monitoring about 64 different monitors with one |
We use |
@tamland this is a good hack! Thank you for the suggestion |
@gmmorris can we please prioritise this, without this recovery state message is pretty much useless we are constantly receiving feedback on discuss forum from user and also from internal users |
After seeing another user request for this, specifically mentioning |
I think we should at least solve this confusion from UI point of view, until we can resolve the issue, we shouldn't display list of state variables for resolved action in the alert flyout. |
@pmuellr We do have access to them when firing a recovered action but it's the state/context from the last execution. So some of the information captured (ex: current CPU usage) could be false.
This should be implemented already 🤔 |
@mikecote it will work for most of uptime alerts use cases though, since mostly it's meta data user is interested in, |
taking a deeper look at list
for monitor status alert instance, only error message is which can change. Rest will remain same. |
Random thought: I'm wondering if the solution to the problem should provide a split for this. Have access to some metadata when the alert is created, and that metadata is assumed static for the life of the alert. This would separate from the state/context, which would remain unaccessible for recovered actions. This may overlap with parts of the alerts as data story where we could provide the alert data object for all actions. 🤔 hmm |
Sorry for the delay @shahzad31 , as you're likely aware our team's entire capacity is currently diverted to the RAC initiative and unified alerting. As a consequence we've had to delay every other priority. Let @mikecote and myself know if that's an option from your perspective. |
This issue came up in Slack today and I provided some historical context which, I figure, is worth copy-pasting it in here as well. To add some historical context around this:
As Recovery is caused by the omission of an alert by a rule, there is no To get around this we’ve proposed a change where Rules could explicitly cause an alert to recover by providing the The product concern around this possible randomness caused us to hold off on the proposed solution and in the meantime other priorities put this on ice. @mikecote recently proposed another idea which could address this: Rules can provide a static context that remains true for the lifetime of the alert. We could pass this static data into every action, including the Recovery action. This would mean that some context fields are still unavailable on recovery, but at least you would have some details that are not specific to an evaluation and they can be used at any point in the alert’s lifecycle. |
How "static" would this be? Taking the index threshold rule type as an example, I'd think that there are cases where you'd like see the last value of the calculated value, in the recovered message. Or maybe it would make more sense to see the first one, or the maximal one, or ... In any case, this isn't really "static" data in my thinking, but more "persistent" data. And presumably this data will be in "alerts as data" documents, so perhaps there is so way of making use of that data for this specific case. Perhaps this is even a case where alerting shouldn't be dealing with this at all, and it should be more a "RAC" capability / responsibility. |
I have updated the description above with the next steps we've discussed synchronously: to explore the two complementary options discussed and propose concrete follow up issues. |
I'm also running into an issue with this. When creating an Uptime alert with integration into ServiceNow (SNOW) we would like to set the Message Key going into the SNOW Event Management Platform to {{state.url}} so that all alerts for a specific URL would roll up to a single alert. Using {{alert.id}} is not a good solution because the AlertID also contains the name of the checkpoint that performed the check. This would result in an alert per URL per checkpoint which creates confusion and alert fatigue. We also see this in other areas such as Metric Thresholds where we have similar requirements to key off a value in Context. |
Thanks for the feedback @mholttech we'll explore ways in which this can be remediated (cc @elastic/uptime @elastic/logs-metrics-ui ) |
This still sounds like the best idea to me, but the second comment above is an interesting twist. Sorta feels like a rule could opt-in to providing it's own recover processing, in which case alerting would NOT provide the automatic recover processing. I think the scariest thing about this is thinking about all the edge cases. For instance, with our current rule disable processing, we "recover" rules since we drop the state, and have no hope in recovering them later since we did in fact lose the state. Seems like this is going to force us to come up with a different actionGroup or such, for these - Hmmm ... or is this a case for subgroups? I can't remember how those work ... but something like:
|
With this POC, we are attempting to maintain the availability of consistent context variables for a rule type across all of its action groups. We do this by allowing rule executors the ability to retrieve the list of recovered alert IDs during an execution and allowing rule executors the ability to explicitly specify context variables for the recovery action. With the idea of providing a consistent user experience, we want the recovery variables to be typed the same as existing action variables. Just providing these service functions to the rule executor is insufficient, as rule executors will need to update to use these service functions so we have two options for providing defaults context for recovered alerts:
While the POC focuses on |
POC for static context approach With this POC, we are identifying a third set of action variables ( After working through the POC, it seems to me that having state, context and static context variables are kind of redundant. If static context is being persisted in task manager the same way that state is, is it that different than putting these variables inside state other than organizational? And having a third prefix in the action variable dropdown might cause even more confusion. |
Verified that this is actually not implemented. We are hiding |
After discussion with @mikecote, I will write up RFCs for the following to get feedback from rule authors:
|
While the RFC is in review, the engineering work is still yet to be done. |
An RFC has been submitted and approved for this. We've created two followup issues to handle implementation:
Closing this research issue as done. Please track the implementation issues for current status |
This is a follow up to #49405 (comment) , identified as a problem as part of the work on #86761 and resulting in a bug identified by @arisonl.
The Problem
When an AlertType wishes to activate an AlertInstance it schedules actions under an actionGroup and can optionally specify an AlertInstanceContext which is provided to the actions in that actionGroup.
The Alerting Framework automatically schedules actions in the recovered actionGroup when an active AlertInstance is omitted on the next execution.
As it is the omission of an AlertInstance that causes the scheduling of instances in the recovered actionGroup, we have no access to a context in these actions.
This is by design as we're trying to encourage detection of certain active states as part of a search, rather than querying of all data and then manually identifying this state in-memory. For that reason we do not want AlertType implementors to explicitly schedule actions in the recovered actionGroup, and to that effect we've made it impossible to do so without bypassing the type system in this PR.
In the meantime, as surfaced by this bug, we have found that the
InventoryMetricThreshold
have been scheduling actions under the recovered actionGroup. This causes the actions for recovery to fire twice: Once explicitly as called by the AlertType, and a second time when the AlertInstance is omitted.Addressing the 7.11 bug
I don't know if the bug identified in 7.11 is a blocker or not (that is up to @elastic/logs-metrics-ui [cc @Zacqary ]).
If this is a blocker, then I doubt we can find a solution at Alerting Framework level that is safe to include in 7.11 post FF.
As we do intend on supporting this use case in the near future (hence this issue), I'd recommend that Infra remove the explicit scheduling of recovered in their AlertType. Firing actions on recovery will still be supported - it just won't be able to include a context in it.
If this approach is unacceptable to @elastic/logs-metrics-ui, then we'll have to discuss options in @elastic/kibana-alerting-services (keeping in mind this a behaviour we explicitly chose not to support, but hadn't sufficiently prevented at the time).
Proposed Next Step
We want to add support
AlertInstanceContext
on recovery, but we need to reconcile the fact that there is no way to know what the context is when an instance recovers automatically.Balancing these two states in a sustainable manner will require a solution that can support:
AlertInstanceContext
in the manner that would address the need inInventoryMetricThreshold
The next steps we've decided on are thus:
context
that doesn't change throughout the lifecycle of the alert.context
as detected by the rule.The text was updated successfully, but these errors were encountered: