Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand the context available within Kibana Alerting in ES 7.7 #69611

Open
sinnotts opened this issue Jun 19, 2020 · 31 comments
Open

Expand the context available within Kibana Alerting in ES 7.7 #69611

sinnotts opened this issue Jun 19, 2020 · 31 comments
Labels
Feature:Alerting Feature:Logs UI Logs UI feature needs-refinement A reason and acceptance criteria need to be defined for this issue Team:obs-ux-logs Observability Logs User Experience Team

Comments

@sinnotts
Copy link

Describe the feature:
When using Alerting from Kibana within Elasticsearch 7.7, it would be brilliant if it was possible to pull specific field information from an Index when an alert is triggered?

Currently the following fields are only available:

{{alertName}}
{{alertId}}
{{alertName}}
{{spaceId}}
{{tags}}
{{alertInstanceId}}
{{context.message}}
{{context.title}}
{{context.group}}
{{context.date}}
{{context.value}}

I believe this is done using mustache (https://mustache.github.io/mustache.5.html) but I can't seem to find out what context/template Kibana has available to it to populate the above.

Describe a specific use case for the feature:
I would like to setup an alert based on metricbeat data:

IF
INDEX metricbeat*
WHEN average()
OF system.filesystem.used.pct
OVER all documents
IS ABOVE 0.95
FOR THE LAST 60 seconds

Send an email of:

Alert {{alertName}}!

Server {{agent.name}} has used {{context.value}} percent of storage!

Summary:
{{system.filesystem.device_name}},
{{system.filesystem.total}}
{{system.filesystem.used.bytes}}
{{system.filesystem.available}}

{{system.filesystem.used.pct}}

Kind Regards,
Kibana

Thanks for the hard work!
S

@timroes timroes added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Jun 22, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@chemalopezp
Copy link

chemalopezp commented Jun 22, 2020

I have a very similar use case for this, only with logs. The alert conditions is:

WHEN more than or equals
1 log entry
WITH json.msg
IS SYMBOL_NOT_FOUND
FOR THE LAST 1 minute

I'd like to use other fields from the json object logged into the alert message.

Thank you!

@mikecote mikecote added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services and removed Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 23, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@pmuellr
Copy link
Member

pmuellr commented Jul 3, 2020

From issue #70174, a request to add process name to metric threshold.

@chemalopezp
Copy link

It is great to see some activity on this issue! :D
Just to give the team a bit more background, we are using functionbeat to stream our pino JSON logs into ElasticSearch and Kibana to report alerts. With the new Kibana features recently released we noticed we could actually a much better tuning of those and apply different actions: for example report on different Slack channels depending on the actual error code, and add into the error message much more context (which function failed and other relevant data which is streamed with the log on different json fields, including the error message).

We would like to include information from any of the log fields into the error message we send to slack.

Thank you very much!

@jasonrhodes
Copy link
Member

@Zacqary can you take a look at this and write up what might be possible re: dynamic values per alert instance?

@Zacqary
Copy link
Contributor

Zacqary commented Aug 3, 2020

I'm not sure if there's a good way to do this without making changes to the Alerting plugin, but here's what I think is possible now.

The available variables for action messages, like {{alertName}} and {{context.stuff} are all pre-defined, generic values that always get generated and sent whenever an alert fires. The function in charge of executing the alert doesn't have any knowledge of which variables are actually being requested.

So following @sinnotts's example, let's say the alert parses the metric system.filesystem.used.pct and recognizes that it needs to also fetch the entire hierarchy of data from system. So instead of just fetching the system.filesystem.used.pct value, it'll query Elasticsearch for system, and evaluate the alert by looking up filesystem.used.pct within the hit for system that it receives. Assuming it fires, it can make the entire system object available like {{context.system.filesystem.available}} (We would definitely have to modify the Alerting plugin to be able to remove the context prefix).

Keep in mind we'd have to fetch the whole system object for all alerts now, because we'd have no knowledge if any of that extra data is actually going to be requested in the action message. But I don't think that's all that inefficient.

My concern here is the UX. We wouldn't be able to make {{context.system}} available in the dropdown since it'd be dynamically determined by the metric you select, and that dropdown is populated by the alerting plugin's registerType function which only runs at Kibana startup. Maybe {{context.data}} would work, but that would extend the prefix to {{context.data.system.filesystem.available}} and I kinda don't like that either.

Also, if the behavior we want is

pull specific field information from an Index when an alert is triggered

then we're not really achieving that, since the user only has access to whatever's in system. That would work fine for this example, but what if another user thinks some other kind of data in an entirely different field is relevant?

Maybe just system (or in the Logs case, json) is good enough, I just want to figure out a way to communicate to the user exactly which fields are available to them. Like if you're querying a metric a.b.c.d, and that means a.e.f.g is available to you — not really sure how to make that intuitive and discoverable.

@chemalopezp
Copy link

Thank you for looking into it @Zacqary

If it is of any help, it seems fine to me if not all fields are discoverable in the UI. For the logs scenario, we might want to expose different fields on the error message depending on the actual alert that has been triggered (i.e. the let's say an error log with an userId we might want to use json.userId) so it is totally understandable that these customizable fields are not really discoverable, giving the large amount of them on each index. Thanks again!

@Zacqary
Copy link
Contributor

Zacqary commented Aug 5, 2020

@mikecote I'm realizing the original post is actually using an Index Threshold alert type as an example, so I think Alerting Services might still want to be tagged on this? We can investigate getting this to work on Logs and Metrics alerts for sure, but whatever we settle on should probably propagate back to the core plugin.

@mikecote
Copy link
Contributor

mikecote commented Aug 5, 2020

@Zacqary seems like the request for index data can apply to both index threshold and metric alerts. Since context variables are set per alert type, each would have to expose this manually. There isn't UX support yet for dynamic template variables which makes it hard to support this with the index threshold alert. From my understanding, a metric alert has similar issues because the variables change based on the metric?

@chemalopezp
Copy link

Thank you for looking into this @Zacqary @mikecote ! If I'm getting this correctly, it seems we have maybe 2 different parts? On one side there's the ability to use any field when an alert is triggered. On the other we have a limitation on the alarms UI, that won't make possible to show the user all the possible fields that are available to use.

Still, we can look in other parts of Kibana which fields are available and use them, so it seems to me the first part could be already completed (this would be already an amazing feature to enhance our alarms!). Thanks!

@pmuellr
Copy link
Member

pmuellr commented Aug 21, 2020

It's going to be difficult to extract more document info from the index threshold alert given the way it does aggregations; I think something like an essql-based alert would be great for this. The idea being that the customer ends up providing an essql query (which would include an optional query dsl filter that the essql API supports) - the results of which would be used to determine instances to schedule actions for. The instanceId would be one column; the remaining columns could then be passed as context variables, which would mean the user could set those to whatever they could get returned by sql.

@pmuellr
Copy link
Member

pmuellr commented Aug 21, 2020

I spent a few minutes yesterday playing with the mustache templating we use, to see if we could get it to show all the "context variables" that we make available, somehow. I think we can make a meta-variable which would list all the context variables / values, which customers could use when developing an alert, to see what data is actually available as variables, and what the values of them are. See issue #75601

That little exercise also made me realize that we can put functions in the context, which will end up being invoked when accessed from the template. Seems like a bit of a "foot gun" to me, but obviously fairly powerful. Note that these functions are invoked with no arguments, AFAIK. I'm not sure how far we could take this to make it easier for alertTypes to make context variables available without having to have the all data available up front.

@jasonrhodes
Copy link
Member

@mikecote @pmuellr @Zacqary @chemalopezp as far as a way forward, I'm not sure if I have a clear understanding of what we are trying to do here and what our next steps are? I'll try to summarize my understanding so far:

Problem 1: A user is asking to be able to reference system.* values inside their alert message, and we currently don't provide that as an available context variable. We could potentially make system.* available on all alerts, as context.system.*, but what happens if someone wants host.hostname or something else outside of the system scope? Where is the line for which values we make available on context and which we don't, assuming we can't just allow a user to access anything they want arbitrarily?

Problem 2: If we do allow for some kind of dynamic loading of values in alert messages, how will a user know those values will be available to them when creating their alert message template? We won't be able to provide them as autocomplete if they're dynamic, and we also won't necessarily be able to validate that what they've typed will resolve to a real value. Do they just type what they think is a value and hope it works/test it out?

Am I understanding the state of things right now correctly? Am I missing parts? Thanks!

@pmuellr
Copy link
Member

pmuellr commented Sep 21, 2020

That seems like a good summary of the problem area today:

  • It's not clear how to expose ES data not necessarily involved in the alerting calculation, back to the actions invoked when the alert triggers. This seems like it may be especially problematic / complex for alerts that do aggregations, figuring out how to get that data back after the aggregation.

  • It's not clear how we would expose that to users, given our current "static" story of providing the mustache template variable values in a pop-up list, which is populated from static data provided by the alert type. I wonder if the eventual story, once we solve the first issue above, would be to "test" the alert by running it, and seeing what values it made available. Presumably we'd have to re-run this as the alert fields are being edited, since the data could presumably change as the alert is being edited.

@jasonrhodes
Copy link
Member

@pmuellr Thanks. Do you know if there's an Alerting team ticket for this kind of feature, that we could link for tracking purposes, or is this more likely to be a "wontfix" because of its complexity?

@Zacqary if I understand correctly, in a very limited way we could query for all of the system.* values and provide them to the alert message interface, discoverable at context.system.*, right? Based on what you said above:

Assuming it fires, it can make the entire system object available like {{context.system.filesystem.available}}

Keep in mind we'd have to fetch the whole system object for all alerts now, because we'd have no knowledge if any of that extra data is actually going to be requested in the action message. But I don't think that's all that inefficient.

If we just incorporate this into every Metric alert, it would stop being dynamic, right? Would this still be an issue?

We wouldn't be able to make {{context.system}} available in the dropdown since it'd be dynamically determined by the metric you select, and that dropdown is populated by the alerting plugin's registerType function which only runs at Kibana startup.

@Zacqary
Copy link
Contributor

Zacqary commented Sep 21, 2020

Sorry, when I said we're fetching the whole system object for all alerts, I meant all alerts that used a system.something metric.

I do think it'd be a needless bottleneck if we fetched the root document of every possible metric, regardless of which one the alert actually selects.

@jasonrhodes
Copy link
Member

@Zacqary yeah that's what I meant by this:

in a very limited way we could query for all of the system.* values and provide them to the alert message interface

it would be extremely limited to only providing the system.* values. I'm just verifying we could do that if we chose to, right?

@Zacqary
Copy link
Contributor

Zacqary commented Sep 22, 2020

Absolutely, yeah, we can query for system.* regardless of whether the alert in question is alerting on a system.* metric.

@jasonrhodes
Copy link
Member

@sorantis @mukeshelastic can you weigh in on whether you think providing something static, such as "system.*" or otherwise, to ALL alert instances of our Metrics and Logs alert types would be of value to enough of our users to move forward on that?

If so, let's spin off a new ticket that outlines exactly which static sets of values we want to provide for the Metrics and Logs alert types (the values can be different for each alert type but would be static and consistent for all instances of each type).

If not, I don't think there is an action item for our group on this ticket.

Thanks!

@chemalopezp
Copy link

Not sure I'm following the discussion, as an user we add different fields into our logs that might be used in an alert (both in the criteria and the message). Since we manually logged those fields we don't really need a change on the "exposure" of those fields, we just need them available (i.e. added to the context of the alert).

In other words, there's some benefit of adding a few more static fields (e.g. system.*), but the at least for my use case most of the alerts are triggered by customizable events (i.e. being able to access fields of the log statement that triggered the alarm).

If it is easier to manage that way, I can split into a different ticket. Thank you for your help! :)

@jasonrhodes
Copy link
Member

Since we manually logged those fields we don't really need a change on the "exposure" of those fields, we just need them available

@chemalopezp can you explain the difference between these two things? I've been using those two concepts interchangably, I think. "Exposing" fields means "making them available on the alert's context", which requires querying them at some point as part of every alert check/run.

@sorantis
Copy link

sorantis commented Sep 24, 2020

@sorantis @mukeshelastic can you weigh in on whether you think providing something static, such as "system.*" or otherwise, to ALL alert instances of our Metrics and Logs alert types would be of value to enough of our users to move forward on that?

@jasonrhodes The fact that these fields have to be static limits the scope of the otherwise valid use case. We've been getting feedback on how important it is to link fields in notifications when firing alerts, because it can ultimately save a lot of back and forth calls between the departments in order to figure out which host/service/system is causing trouble. Some of the fields that have been mentioned as relevant are hostname, IP address, anything org specific like labels, tags (not the alert tags).

Perhaps a starting point for adding static this could be ECS fields? Specifically for metric alerts I can see a value in adding Host, Container and Cloud fields to context. For log alerts Event and Log fields can be relevant.

Thoughts?

@ManuelKugelmann
Copy link

@sorantis I concur, access to the reason of the alert is crucial. If I e.g. have an alert triggered by some log messages, I would want (at least an excerpt) of these in the alert message. And maybe even a link to a search that will return the full set of messages that were the reason for the alert.
Right now I have to fallback to the clunky Watchers w/o nice UI support ...

@chemalopezp
Copy link

@chemalopezp can you explain the difference between these two things? I've been using those two concepts interchangably, I think. "Exposing" fields means "making them available on the alert's context", which requires querying them at some point as part of every alert check/run.

@jasonrhodes Then we are on the same page. I was under the impression that there are some concern at the UI level on how a user would be able to select these fields, which seemed a different issue.

@praveenmak
Copy link

I have the same issue as described in this ticket.
Another one we need is the timestamp, how can I send the timestamp in the email Alert?
"{{context.timestamp}}" does not work.

@pmuellr
Copy link
Member

pmuellr commented Nov 10, 2020

We have an open issue to add a timestamp as a "global" mustache variable for templates - #67389

We also have an open issue to add a "helper" for mustache variables that are "objects", to make it easier for customers to see what variables are available, in an alert message - #82044 . This should help for complex/rich objects added to the context variables, at least while a customer is developing their mustache templates used in actions.

Beyond that, I think the notion of providing variables for "related" fields in documents an alert is processing, is going to have to be alert-specific. Eg, it seems easier to me to add this kind of capability to log-based alerts, but the current shape of the index threshold alerts doesn't lend itself to doing this. Because it's just doing aggs over the indices, and so it's not clear how we'd also collect other fields as well. As mentioned in #69611 (comment), it feels like an essql-based alert might be a good fit for this, as it would presumably allow customers to select fields to be returned to the alert.

@simianhacker simianhacker added needs-refinement A reason and acceptance criteria need to be defined for this issue and removed triage_needed labels May 27, 2021
@elastic elastic deleted a comment from eric-olaya Sep 28, 2021
@SonalJain1707
Copy link

I also have a case

I need to pull below hostname and put in connector when trying to use {{#context.hits}}{{_source}}{{/context.hits}} its returning empty . Can you please let me know how to traverse to agent and hostname and if its possible to do so.

hits" : [
{
"_index" : "metricbeat-7.12.1-000594",
"_type" : "_doc",
"_id" : "YUSMboEBINxiR4tEXwHq",
"_score" : 0.0,
"_source" : {
"@timestamp" : "2022-06-16T22:05:01.392Z",
"ecs" : {
"version" : "1.8.0"
},
"agent" : {
"version" : "7.12.1",
"hostname" : "kks-***********",

@SonalJain1707
Copy link

Hi,

Also if the list is returning null value the alert is going in recovered state. Please le us know how to solve this.

@gbamparop gbamparop added Team:obs-ux-logs Observability Logs User Experience Team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 9, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

@botelastic botelastic bot added needs-team Issues missing a team label and removed needs-team Issues missing a team label labels Nov 9, 2023
@phirestalker
Copy link

To me, it seems that the alert config itself could determine the "extra" information to be made available. For instance, I set an alert of system.filesystem.used.pct like the op. Inside this alert I have set a grouping on the device name, so the device name should be made available somehow. Would making available only things that were used in the query, filter, and grouping options be easier?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Feature:Logs UI Logs UI feature needs-refinement A reason and acceptance criteria need to be defined for this issue Team:obs-ux-logs Observability Logs User Experience Team
Projects
None yet
Development

No branches or pull requests