[APM] Service maps health indicators: Indicate alert-based health status on Service nodes #64144

formgeist · 2020-04-22T08:49:33Z

Design issue: elastic/apm#218
Related issue: #63574

Summary

Part of the health indicators in Service maps is to update the health status of a service if there are any active threshold alerts for the service in the selected time range. We assume that if there's been a threshold violation on the service, this is automatically a bad scenario for the user, and therefore we'll treat it as a critical health status.

Solution

If there are any active APM-type (duration or error rate) threshold alerts for the service within the selected time range, override the current anomaly score health status indication (#63574) and show the service as a critical state (red outline).

The overall service map will look something like this

As a later iteration we will most likely have different severity levels in the alerts, which means that some alerts might not be critical but simply warnings, in which case we can choose to only show the warning indication (yellow state).

elasticmachine · 2020-04-22T08:49:34Z

Pinging @elastic/apm-ui (Team:apm)

smith · 2020-07-21T19:49:59Z

Blocked by #70169.

pmuellr · 2020-08-18T21:04:42Z

Our current thinking on making some forward movement on this, is that this is likely do-able using the existing alertClient find() API to identify all the alerts, followed by an elasticsearch search through the event log indices to extract the data you need.

While we'd like to gate all access to the event log through it's plugin, for security purposes, the timeframe of getting that in is ... not near. In lieu of a new event log API to fulfill this requirement, accessing the event log indices directly is the next best option.

The event log indices are managed by ILM and available under an elasticsearch alias, which we currently don't publish, so we'll need to add an API to the event log to return this in it's plugin setup() and/or start(). Once you have the alias, and the alert ids to look up, it's just a matter of crafting the search you want (query DSL, esSQL, etc) and running it with the kibana system es client.

The event log indices are based on ECS with some extensions for Kibana, and so hopefully the shape is somewhat familiar.

I mentioned "security purposes" above - the event log APIs currently require you to pass in saved object type/ids when searching, and the implementation ensures you have read access to those saved object type/ids before doing the search. So, I'm suggesting bypassing that, given the suggestion of doing a query on the event log indices directly. However, since you do first need to do a find() call to get the alerts, that's the security gate right there. find() won't return anything you can't read. As long as a find() is used as a filter to obtain the readable alerts, we're happy enough that there aren't any other security concerns going down the route.

formgeist · 2020-08-19T09:59:00Z

@pmuellr Thanks Patrick for picking this up - it's very exciting to see that we might be able to provide this to our users sooner rather than later.

I just wanted to respond to the following question;

I'm assuming each service has it's own alert? Or is there a single alert that has an alert instance for each service? I'm not sure the event log is writing out instance id's at the moment (but should, may already be an issue for that also).

Indeed - currently alerts are created and tied to one service. I'm not sure about this, but I can certainly imagine a time when we'd want to allow for alerts to be tied not to a service, but be able to simply alert on a service instance threshold violation or error instead of having to set individual thresholds per service or even policies for specific service environments. @nehaduggal perhaps you can clarify what's our thinking around alerts in the near- and long-term?

Not sure if this changes anything for you at this time.

graphaelli · 2021-06-16T18:18:20Z

Is this still blocked given capabilities introduced via RAC efforts?

sorenlouv · 2021-06-17T07:15:09Z

This will be unblocked by rac. Given we don't have an exact date for the release of rac I suggest we leave it blocked for now.

gmmorris · 2021-06-30T12:29:42Z

@sqren Given this is being delivered by RAC, can I remove this from the Alerting team's dependencies list?

We're trying to groom the backlog of work and figuring out what is being blocked by us is important :)

sorenlouv · 2021-06-30T13:18:10Z

@gmmorris sure 👍

botelastic · 2021-12-27T13:52:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gmmorris · 2022-01-10T17:29:55Z

I think this can be unblocked now @sqren 🤔

formgeist · 2022-01-11T08:25:10Z

@sqren I'm going to move this to the design board since it's been 2+ years since we looked at the design for this. Cc @alex-fedotyev

formgeist added Team:APM - DEPRECATED Use Team:obs-ux-infra_services. apm:service-maps Service Map feature in APM v7.8.0 labels Apr 22, 2020

dgieselaar mentioned this issue Apr 22, 2020

[APM] Service maps health indicators: Show anomaly score indicator on Service nodes #63574

Closed

formgeist mentioned this issue Apr 22, 2020

[APM] Design health indicators in service maps elastic/apm#222

Closed

formgeist added the [zube]: (7.8) Planned for release label Apr 22, 2020

alex-fedotyev mentioned this issue Apr 30, 2020

[APM] Improve UI of list of services page and match with latest service map design elastic/apm#262

Closed

sorenlouv added v7.9.0 [zube]: Backlog and removed v7.8.0 [zube]: (7.8) Planned for release labels May 6, 2020

sorenlouv added [zube]: (7.9) Planned for release and removed [zube]: Backlog labels May 19, 2020

ogupte self-assigned this Jun 10, 2020

ogupte added [zube]: In Progress and removed [zube]: (7.9) Planned for release labels Jun 10, 2020

zube bot added [zube]: (7.9) Planned for release and removed [zube]: In Progress labels Jun 10, 2020

ogupte added [zube]: In Progress and removed [zube]: (7.9) Planned for release labels Jun 10, 2020

ogupte added [zube]: (7.9) Planned for release v7.10.0 and removed [zube]: In Progress v7.9.0 labels Jul 4, 2020

sorenlouv added [zube]: (7.10) Planned for release and removed [zube]: Backlog labels Jul 15, 2020

mikecote mentioned this issue Aug 20, 2020

Dependencies on Kibana Alerting #67992

Open

59 tasks

formgeist added apm:alerting and removed :Alerting labels Sep 3, 2020

sorenlouv added v7.12.0 and removed v7.11.0 labels Sep 29, 2020

sorenlouv added v7.13.0 and removed v7.12.0 labels Jan 12, 2021

sorenlouv removed the v7.13.0 label Feb 23, 2021

gmmorris added the NeededFor:apm label Jun 30, 2021

botelastic bot added the stale Used to mark issues that were closed for being stale label Dec 27, 2021

botelastic bot removed the stale Used to mark issues that were closed for being stale label Jan 10, 2022

formgeist assigned formgeist and unassigned ogupte Jan 11, 2022

formgeist added needs design and removed blocked NeededFor:apm labels Jan 11, 2022

formgeist removed the [zube]: Backlog label Mar 14, 2022

formgeist removed their assignment Mar 14, 2022

smith mentioned this issue Mar 6, 2024

[APM] Service maps health indicators: Show alert violations Service node popovers #65708

Closed

smith added the enhancement New value added to drive a business result label Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[APM] Service maps health indicators: Indicate alert-based health status on Service nodes #64144

[APM] Service maps health indicators: Indicate alert-based health status on Service nodes #64144

formgeist commented Apr 22, 2020

elasticmachine commented Apr 22, 2020

smith commented Jul 21, 2020

pmuellr commented Aug 18, 2020

formgeist commented Aug 19, 2020

graphaelli commented Jun 16, 2021

sorenlouv commented Jun 17, 2021

gmmorris commented Jun 30, 2021

sorenlouv commented Jun 30, 2021

botelastic bot commented Dec 27, 2021

gmmorris commented Jan 10, 2022

formgeist commented Jan 11, 2022

[APM] Service maps health indicators: Indicate alert-based health status on Service nodes #64144

[APM] Service maps health indicators: Indicate alert-based health status on Service nodes #64144

Comments

formgeist commented Apr 22, 2020

Summary

Solution

elasticmachine commented Apr 22, 2020

smith commented Jul 21, 2020

pmuellr commented Aug 18, 2020

formgeist commented Aug 19, 2020

graphaelli commented Jun 16, 2021

sorenlouv commented Jun 17, 2021

gmmorris commented Jun 30, 2021

sorenlouv commented Jun 30, 2021

botelastic bot commented Dec 27, 2021

gmmorris commented Jan 10, 2022

formgeist commented Jan 11, 2022