API to get all active instances from Observability consumers #70169

cauemarcondes · 2020-06-29T08:35:07Z

In the new Observability Overview page, we're planning to show two charts to give the user a clear picture of which alert is active at the moment.

In this chart, we want to show all active instances for all observability plugins (APM/Logs/Uptime/Metrics) grouped by type.

And in this one, we want to show some alert detail and the number of active instances next to it.

Current situation:
In the current API to get this information I have to first call _find to get all created alerts, then filter by Observability plugins (APM/Logs/Uptime/Metrics), and make an HTTP call for each alert to get the active instances.

What a need:
An API that returns all active instances and the alert details, with the possibility to filter by consumer and alert type.

Example API:

alerting.getInstances({ active: true, consumers: ['apm', 'uptime', 'metrics'] })

Example response:

[
  {
    "id": "b5ef31a1-7c9f-47f5-a0d4-69169fc2f407",
    "params": {
      "threshold": 1,
      "aggregationType": "avg",
      "windowSize": 5,
      "windowUnit": "m",
      "transactionType": "request",
      "environment": "ENVIRONMENT_ALL",
      "serviceName": "opbeans-java"
    },
    "consumer": "apm",
    "alertTypeId": "apm.transaction_duration",
    "schedule": {
      "interval": "10s"
    },
    "actions": [
      {
        "actionTypeId": ".webhook",
        "group": "threshold_met",
        "params": {
          "body": "{\"transaction\": \"transaction\"}"
        },
        "id": "4e6a507f-1238-49c1-8b55-c19e42076543"
      }
    ],
    "tags": ["apm", "service.name:opbeans-java"],
    "name": "Transaction duration | opbeans-java",
    "throttle": "15s",
    "enabled": true,
    "apiKeyOwner": "elastic",
    "createdBy": "elastic",
    "updatedBy": "elastic",
    "createdAt": "2020-06-25T14:27:19.820Z",
    "muteAll": false,
    "mutedInstanceIds": [],
    "scheduledTaskId": "fad2cf20-b6ef-11ea-9623-a57005710a46",
    "updatedAt": "2020-06-25T14:27:21.257Z",

    //All active instances
    "alertInstances": [
      {
        "state": {},
        "meta": {
          "lastScheduledActions": {
            "group": "threshold_met",
            "date": "2020-06-29T08:31:38.802Z"
          }
        }
      }
    ]
  }
]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-06-29T11:54:50Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

ogupte · 2020-06-29T18:21:21Z

Hello, APM service maps also needs this capability as well for 7.9. We need to be able to show display health indicators or all services in the service map which have active alerts violations. Right now, we can only get it to work by calling getAlertStatein parallel for each id we get from find, but it is prohibitively inefficient, especially for very large service maps. Something where we can get all the alert statuses in one go is required before we can integrate.

sorenlouv · 2020-07-22T09:17:23Z

@mikecote I see you've added this to "Long Term". This is something we hope to be able to have available in 7.10. Is that possible?

mikecote · 2020-07-22T14:41:19Z

@sqren I went over the recording of the triage session we had for this issue. I think we needed more clarifications on if this issue was still needed or if your requirements have changed based on the scope adjustment the homepage team made for 7.9 / 7.10. We placed it with the bulk APIs story (long term) and had an approach we believe could work for you without waiting on this API (some email thread from a few weeks ago).

@pmuellr can help on this. The approach that could work for now is to use the alert find API to get all observability related alerts (filter by alert type and/or consumer) and then use the task manager's fetch API for the alert's scheduledTaskId. With that result, each task will contain the state of an alert and you can then extract the instances from there.

We can always revisit and prioritize this issue no problem, probably in the scope of 7.11 once our work for GA is complete.

kobelb · 2020-07-23T15:54:25Z

@XavierM has a PR which adds aggregations to the SavedObjectsClient, can we take advantage of this here?

ogupte · 2020-07-28T19:19:21Z

@mikecote

... The approach that could work for now is to use the alert find API to get all observability related alerts (filter by alert type and/or consumer) and then use the task manager's fetch API for the alert's scheduledTaskId. With that result, each task will contain the state of an alert and you can then extract the instances from there.

From an other thread:

I believe the workaround that has been suggested was similar to the approach mentioned earlier. It works by fetching the task state for each matching alert returned in the find call. In setups with few alerts configured, it will add a few requests to the page load. But what if we're trying to load a service map with 10, 20, or more services where each has an alert configured? Are we OK adding x number of single requests to our initial page load?

From the alerting plugin context, it might be possible to obtain multiple alerts states in a single request, but it would require querying the task manager index filtered by job ids obtained in the initial find. This would result in the initial page load adding a constant 2 additional requests instead of the suggested 1 + x requests.

pmuellr · 2020-08-18T20:52:51Z

We've re-prioritized some work that - I think - will happen to work out very well for this requirement.

We will be formalizing the notion of an alert "status" per issue #51099 . We'll add a new status object to the alert saved object, which means you should be able to get the status from the alertClient find() API (or equivalent http call), including usual saved object filtering, fields, etc. I think this would mean having to retrieve all the alerts with find(), and manually generating the numbers, based on the alert type and status.

That gets us back down from 1+x or even 2 api calls, down to 1! (but with more data than actually required, I think)

I'm going to start working on this shortly, will note the PR here once it's under way.

formgeist · 2020-08-19T10:01:11Z

@pmuellr Thanks for the heads up - it sounds very exciting!

pmuellr · 2020-09-30T21:46:53Z

Doing some bookkeeping, realized I didn't post the PR with the new 'alert status' field - it's here: #75553

But also, re-reading this, and realizing the original request, that still doesn't give us instance data, just the alert data. So, that still leaves us in a 1 + n requests state - 1 find() request to get the alerts, and then n calls to get the instance data.

It feels to me like we'll end up needing some new APIs, and I don't think we've talked about what those might look like, so here's a rough sketch:

new method on alerts client that takes find() parameters, and returns instance data about all the matching alerts; this would internally use find(), then make a single call (well, probably have to deal with pagination, but one "virtual" call) to the event log to query against all the alert SO's returned from find(). We'd likely need to process the events returned to get whatever data we're looking for, much like the current "get instance status" API (which returns instance data for a single alert)
http API that calls that new alerts client API
some changes to the event log to bypass the current checks on the saved object being queried for event data - that's done for security reasons (you need to be able to read an alert to see it's events) - because we've already done that check in the find() call to get the list of alerts

I should note this would be to get instance data beyond just the current state of known instances (eg, it could return data about recent instances which are no longer active, like the current "get instance status" API). If we only need the current list of instances, or count of instances, it's possible we could do a query over task manager to get the current alert instance data. This also wouldn't contain any instance status data like errors. Here's what that task manager data looks like (note, it's stored as a JSON string today, so we'd need to parse it after fetching and can't search over these "fields"); this shows an alert with one active instance, host-1:

{
  "alertInstances": {
    "host-1": {
      "state": {},
      "meta": {
        "lastScheduledActions": {
          "group": "threshold met",
          "date": "2020-09-30T21:40:14.771Z"
        }
      }
    }
  },
  "previousStartedAt": "2020-09-30T21:40:14.664Z"
}

sorenlouv · 2020-12-09T21:27:31Z

Another issue which depends on being able to retrieve alert instances: #85479

Let me know if everything is clear or I should add more details.

pmuellr · 2020-12-15T00:10:05Z

Thx @sqren !

From #85479:

The alert instance should be displayed at the time it activated. On hover it should be possible to see the threshold and the value that exceeded the threshold.

So you'll need the time, threshold, and actual value.

Today, you can get the active-instance events from the event log to get the alert id, instance, I think action group (and a bit more). We don't currently store the threshold or the actual value, since there's no common value across alerts for those - but I have been thinking that it makes sense, if you can boil everything down to "simple values", and preferably numbers :-). That would be a new concept for alerting, but think it makes sense.

Presumably the application knows the "value that exceeded the threshold", unless it's no longer available (eg, ILM). But then the app wouldn't be able to show a pretty graph to annotate in the first place.

But if we're storing the threshold value (where else would an "older" version of a threshold value, if changed over time, be available?), it makes sense to store the metric value as well, so we should add those both at the same time.

In terms of "progressive enhancement" then, I'd hope we'll make those values available at some point in the future in the event log, but for today, all you'll have is the timestamp of when the alert/instance was "active".

sorenlouv · 2020-12-15T15:14:09Z

but for today, all you'll have is the timestamp of when the alert/instance was "active".

Sounds great! Having the timestamp will still allow us to add alert annotations to charts which is a great start. Then we can enhance this down the road with the actual values.

mikecote · 2020-12-18T19:22:32Z

Some notes from the 7.12 planning session

@peterschretlen

One of the outcomes of the working group workstreams was to have these instances as data, rather than state in a saved object. That might be worth considering here. Pulling the instance state out of a set of rules sounds challenging, and may not align with the long term direction. Having instances as data (or maybe as data in addition to state) might be worth considering.

@pmuellr

We may need to scope this issue down to "an API to get all instances for all visible alerts" - today, clients need to separately get a list of all the visible alerts, and then get the instances for each of them; 1 + N calls; I want to build an API to do this in one call (from the client's perspective). Adding additional data to the instances is something we can do independently.

mikecote · 2020-12-18T19:31:21Z

Moved from 8.x - Candidates to To-Do in order to start working on this in 7.12.

pmuellr · 2021-01-27T15:06:56Z

note: I originally opened this as issue #88908, but moving here since it's really just relevant to this overall issue

It's not clear that this will be needed, but thought I'd outline generating instance data might work, when searching through multiple alerts. The thought here is that if the best we can do for now, is to generate a list of all the events for all the alerts, we'll need to have a standard way of having processing the events.

For the "Alert Details" page, we generate the list of instances and data from them, via this function, which is not currently exposed as an API:

kibana/x-pack/plugins/alerts/server/lib/alert_instance_summary_from_event_log.ts

Lines 11 to 20 in da8abda

    
           export interface AlertInstanceSummaryFromEventLogParams { 
        
             alert: SanitizedAlert<{ bar: boolean }>; 
        
             events: IEvent[]; 
        
             dateStart: string; 
        
             dateEnd: string; 
        
           } 
        
           export function alertInstanceSummaryFromEventLog( 
        
             params: AlertInstanceSummaryFromEventLogParams 
        
           ): AlertInstanceSummary {

As we are getting more consumers of the event log coming on line, this function - or similar ones, or perhaps this one with more parameters/capabilities - could be useful if we only end up providing a way to get ALL the event log docs (eg, if we don't support a richer search mechanism). Otherwise, those consumers will be forced to implement similar logic in their own plugins.

We'd need to clean this up a bit to turn it into an API, and presumably if we did this, we'd also change it to support events from multiple alerts, and not just a single alert. And presumably, it would be a function on the alertsClient.

mikecote · 2021-02-25T12:06:18Z

After discussing with @sqren yesterday, "alerts as data" is necessary for the 7.x Observability workflows, which remove the need for these APIs as a half-way measure.

cauemarcondes added Feature:Alerting :Alerting labels Jun 29, 2020

pmuellr added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Jun 29, 2020

pmuellr mentioned this issue Jul 6, 2020

Bulk create, update, delete abilities for the rules client #53144

Closed

sorenlouv added the v7.10.0 label Jul 13, 2020

This was referenced Jul 21, 2020

[APM] Service maps health indicators: Show alert violations Service node popovers #65708

Closed

[APM] Service maps health indicators: Indicate alert-based health status on Service nodes #64144

Open

sorenlouv mentioned this issue Jul 23, 2020

[APM] Investigate alternative Alerting API approach #73148

Closed

pmuellr mentioned this issue Aug 5, 2020

[eventLog] provide bulk query facility #70856

Closed

mikecote mentioned this issue Aug 20, 2020

Dependencies on Kibana Alerting #67992

Open

59 tasks

sorenlouv mentioned this issue Sep 2, 2020

[APM] Investigate alerting event log queries #62711

Closed

gmmorris removed the :Alerting label Oct 29, 2020

sorenlouv mentioned this issue Dec 9, 2020

[APM] Show alert annotations on charts #85479

Closed

mikecote removed the v7.10.0 label Dec 18, 2020

YulNaumenko self-assigned this Jan 5, 2021

YulNaumenko linked a pull request Jan 6, 2021 that will close this issue

[EventLog] Added event log API to get events for multiple saved objects. #87596

Merged

YulNaumenko closed this as completed in #87596 Jan 13, 2021

YulNaumenko reopened this Jan 13, 2021

YulNaumenko mentioned this issue Jan 13, 2021

[Alerts] Added API to find all alerts instances by the filters like consumers, status(active, ...), etc. #88224

Closed

pmuellr mentioned this issue Jan 27, 2021

[alerts] expose API to generate alert instance summary from event log documents #88908

Closed

YulNaumenko mentioned this issue Feb 8, 2021

Poc observability #70169 #90567

Closed

mikecote closed this as completed Feb 25, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API to get all active instances from Observability consumers #70169

API to get all active instances from Observability consumers #70169

cauemarcondes commented Jun 29, 2020 •

edited by sorenlouv

Loading

elasticmachine commented Jun 29, 2020

ogupte commented Jun 29, 2020

sorenlouv commented Jul 22, 2020

mikecote commented Jul 22, 2020

kobelb commented Jul 23, 2020

ogupte commented Jul 28, 2020

pmuellr commented Aug 18, 2020

formgeist commented Aug 19, 2020

pmuellr commented Sep 30, 2020

sorenlouv commented Dec 9, 2020

pmuellr commented Dec 15, 2020

sorenlouv commented Dec 15, 2020

mikecote commented Dec 18, 2020

mikecote commented Dec 18, 2020

pmuellr commented Jan 27, 2021

mikecote commented Feb 25, 2021

API to get all active instances from Observability consumers #70169

API to get all active instances from Observability consumers #70169

Comments

cauemarcondes commented Jun 29, 2020 • edited by sorenlouv Loading

elasticmachine commented Jun 29, 2020

ogupte commented Jun 29, 2020

sorenlouv commented Jul 22, 2020

mikecote commented Jul 22, 2020

kobelb commented Jul 23, 2020

ogupte commented Jul 28, 2020

pmuellr commented Aug 18, 2020

formgeist commented Aug 19, 2020

pmuellr commented Sep 30, 2020

sorenlouv commented Dec 9, 2020

pmuellr commented Dec 15, 2020

sorenlouv commented Dec 15, 2020

mikecote commented Dec 18, 2020

mikecote commented Dec 18, 2020

pmuellr commented Jan 27, 2021

mikecote commented Feb 25, 2021

cauemarcondes commented Jun 29, 2020 •

edited by sorenlouv

Loading