Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Service maps health indicators: Indicate alert-based health status on Service nodes #64144

Open
formgeist opened this issue Apr 22, 2020 · 13 comments
Labels
apm:alerting apm:service-maps Service Map feature in APM enhancement New value added to drive a business result needs design Team:APM - DEPRECATED Use Team:obs-ux-infra_services.

Comments

@formgeist
Copy link
Contributor

Design issue: elastic/apm#218
Related issue: #63574

Summary

Part of the health indicators in Service maps is to update the health status of a service if there are any active threshold alerts for the service in the selected time range. We assume that if there's been a threshold violation on the service, this is automatically a bad scenario for the user, and therefore we'll treat it as a critical health status.

Solution

If there are any active APM-type (duration or error rate) threshold alerts for the service within the selected time range, override the current anomaly score health status indication (#63574) and show the service as a critical state (red outline).

Default

The overall service map will look something like this

Java _ ML anomaly enabled

As a later iteration we will most likely have different severity levels in the alerts, which means that some alerts might not be critical but simply warnings, in which case we can choose to only show the warning indication (yellow state).

@formgeist formgeist added Team:APM - DEPRECATED Use Team:obs-ux-infra_services. apm:service-maps Service Map feature in APM v7.8.0 labels Apr 22, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@smith
Copy link
Contributor

smith commented Jul 21, 2020

Blocked by #70169.

@pmuellr
Copy link
Member

pmuellr commented Aug 18, 2020

Our current thinking on making some forward movement on this, is that this is likely do-able using the existing alertClient find() API to identify all the alerts, followed by an elasticsearch search through the event log indices to extract the data you need.

While we'd like to gate all access to the event log through it's plugin, for security purposes, the timeframe of getting that in is ... not near. In lieu of a new event log API to fulfill this requirement, accessing the event log indices directly is the next best option.

The event log indices are managed by ILM and available under an elasticsearch alias, which we currently don't publish, so we'll need to add an API to the event log to return this in it's plugin setup() and/or start(). Once you have the alias, and the alert ids to look up, it's just a matter of crafting the search you want (query DSL, esSQL, etc) and running it with the kibana system es client.

The event log indices are based on ECS with some extensions for Kibana, and so hopefully the shape is somewhat familiar.

I mentioned "security purposes" above - the event log APIs currently require you to pass in saved object type/ids when searching, and the implementation ensures you have read access to those saved object type/ids before doing the search. So, I'm suggesting bypassing that, given the suggestion of doing a query on the event log indices directly. However, since you do first need to do a find() call to get the alerts, that's the security gate right there. find() won't return anything you can't read. As long as a find() is used as a filter to obtain the readable alerts, we're happy enough that there aren't any other security concerns going down the route.

@formgeist
Copy link
Contributor Author

@pmuellr Thanks Patrick for picking this up - it's very exciting to see that we might be able to provide this to our users sooner rather than later.

I just wanted to respond to the following question;

I'm assuming each service has it's own alert? Or is there a single alert that has an alert instance for each service? I'm not sure the event log is writing out instance id's at the moment (but should, may already be an issue for that also).

Indeed - currently alerts are created and tied to one service. I'm not sure about this, but I can certainly imagine a time when we'd want to allow for alerts to be tied not to a service, but be able to simply alert on a service instance threshold violation or error instead of having to set individual thresholds per service or even policies for specific service environments. @nehaduggal perhaps you can clarify what's our thinking around alerts in the near- and long-term?

Not sure if this changes anything for you at this time.

@graphaelli
Copy link
Member

Is this still blocked given capabilities introduced via RAC efforts?

@sorenlouv
Copy link
Member

This will be unblocked by rac. Given we don't have an exact date for the release of rac I suggest we leave it blocked for now.

@gmmorris
Copy link
Contributor

@sqren Given this is being delivered by RAC, can I remove this from the Alerting team's dependencies list?

We're trying to groom the backlog of work and figuring out what is being blocked by us is important :)

@sorenlouv
Copy link
Member

@gmmorris sure 👍

@botelastic
Copy link

botelastic bot commented Dec 27, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@botelastic botelastic bot added the stale Used to mark issues that were closed for being stale label Dec 27, 2021
@gmmorris
Copy link
Contributor

I think this can be unblocked now @sqren 🤔

@botelastic botelastic bot removed the stale Used to mark issues that were closed for being stale label Jan 10, 2022
@formgeist
Copy link
Contributor Author

@sqren I'm going to move this to the design board since it's been 2+ years since we looked at the design for this. Cc @alex-fedotyev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:alerting apm:service-maps Service Map feature in APM enhancement New value added to drive a business result needs design Team:APM - DEPRECATED Use Team:obs-ux-infra_services.
Projects
None yet
Development

No branches or pull requests

8 participants