APPSERV-11 Adds Health Check Alerts to Monitoring Console #4390

jbee · 2019-12-17T15:36:12Z

Summary

This PR extends the monitoring console with a server side alerting system that is used to track the health check status values and create alerts should they exceed the configured thresholds.
While the thresholds are those of the health check configuration the interval in the configuration has no effect as the alerting system has "immediate" evaluation (delay < 2 sec).

Many changes are general ones that create the foundation of the alerting so that it can be used for health checks. This approach is motivated in more detail further down.

Server Changes

adds a general concept server side evaluated Watches and Alerts that track changes of monitoring data metrics and cause alerts
adds internal API that makes watches accessible so that MonitoringDataSource now can also implement MonitoringWatchSource to communicate the detailed conditions without making modules dependent on monitoring console (as this is still an optional module)
changes to health check configurations are automatically reflected in watch conditions using the collection mechanism
adds ability to MonitoringDataSource to use the new @MonitoringData annotation on the collection method to put collected series in a common namespace and to only collect every n seconds (as opposed to every second). This was needed to archive a slower collection of the health checks as those can be more expensive or slow to compute.
extract the computation logic from included health checks so that it can be used both for their own check (see doCheckInternal) and when collecting monitoring data.
extends REST API for returning watches and alerts together with series data
adds REST API for watches and alerts
fixes Garbage Collection HealthCheck never triggers usefully #3291 - GC health check now computes the percentage of time spend doing GC in checked time window
fixes issue in MC Health checks where check fails due to some setup issue with web client library used to ping the health check endpoints
build now creates the merged JS file monitoring-console.js directly in target (removed from source to avoid further confusion during reviews)
adds a watch for data collection duration

Client Changes

adds Health Checks page preset
adds new widget type alert for tables of alerts
adds new configuration for defaults of red, amber, green state colour
adds alert state dependent colouring to graph legend
adds variable graph window size (as a side effect of collecting only every n seconds is more data then 1 min becomes available for those series using it)
adds active alerts being shown on top of the graph (in short form)
adds acknowledgement of alerts (to remove from list shown in graphs)
uses default colours from configuration for legend labels in critical or alerting state (not from CSS file)
adds possibility to group inputs in settings (to group default colour settings properly) and prevent splitting of label and input in settings
adds coloured background areas to line charts that show states red, amber, green (from watches) as well as alarming and critical thresholds (from decorations) more clearly.
does no longer use (multiple) data lines to show threshold lines.
grid now supports separate column and row spans
adds a Monitoring page preset (was needed for debugging)
adds a JVM page preset (also for better debugging)
updates the provided colour schemes so their colours used for lines do not collide with colours used to indicate states like red, amber and green.
fixed: secondary settings now are collapsed by default as intended before
adds a setting for data line width (thickness)
widget titles are now centred and make selection easier (as they span full width)

Why not use existing Health Check and Notifier system directly?

The goal of the task was to immediately detect and show problems identified by health check logic in the graphs and a table of alert states. The existing health checks design does not really allow this as the checks happen fairly infrequent and in a user defined interval. Should the user use a faster interval this would cause a flood of notifications. Also there is no concept of alert state. Each check is unaware of anything that happened before. Again, this is problematic for user experience in connection with frequent checks. Infrequent checks on the other hand would not show problems in the UI as they would be identified too late. The alert system fixes these problems and provides a general mechanism that now can be used on any metric.

How does the Watch and Alert System work?

A watch describes start and stop conditions for alerts. Each level, red, amber and green, can have its own conditions. Stop conditions are optional. When used they make sure an alert isn't started and then immediately stop due to fluctuations around the threshold. Conditions also allow to give a side condition that specifies for how long and in what fashion the threshold needs to be exceeded to start the alert.

When an alert is started it can transition between red and amber. This is still considered the same alert. The alert ends if it transitions to green or white. Green state can be used to describe the conditions of the value range that is particularly good. Everything else, not red, amber or green is white.

Alerts are per series and instance. That means CPU usage on DAS is connected to one alert, CPU usage on another instance to another alert. Both are watched by the same CPU Usage watch.

The advanced conditions, the transitioning between levels and the connection to series and instances are designed to prevent alerts from flooding the UI by making it possible to describe the alert circumstances in a way that is more similar to how humans would understand a window of data as "one alert situation".

While in its current form the system is used to collect watches from health check modules it is build so that watches can be installed by users as well. That mean if we build an GUI for composing the data a watch needs users can add alerting to any metric they want.

In this PR alerts only show in the graphs they belong to or in dedicated alert list widget. This has been build in way that makes it simple to also support alerts popping up in a "global" alert list.

Testing

The new health check metrics are collected as soon as the individual health checks are enabled in the Health Checks config. The overall service does not have to be enabled as that is only responsible for evaluating health checks to create notifier messages. This is intentionally as it allows to enable/disable the health checks individually without the need to also cause notifications via notifier.

Steps:

Build and run the server.
use set-monitoring-console-configuration --enabled=true to deploy MC
open MC at http://localhost:8080/monitoring-console/
(eventually clear local storage to reset)
switch to Health Checks page
open admin console and enable health checks in their configuration
change e.g. memory thresholds in the configuration and check that the change is reflected in MC
change a threshold so current value triggers an alert (e.g. low heap threshold) and check it appears in MC
acknowledge the alert in the list by clicking its checkbox and check the alert disappears from the corresponding line chart

Other things to try:

with some alerts in the list change alerts filters (Settings => Alerts for Alerts widget)
change random widget or data settings
change colour scheme (Settings => Colors)
change decoration thresholds (Settings => Decoration; e.g. for Core page CPU Usage)
enable page rotation (Settings => General; set time to some seconds)
enable request tracing and change config to cause some traces - check MC views work
create and start another instance, check MC shows its data (remember they might need config changes too)

Testing Done

I added unit tests for the watch and alert logic that cover the most scenarios and tested MC manually. As this PR changes many details it does make sense to perform various tests that aren't directly related to the main feature added.

…itoring data sources

…ring away outdated data no longer updates on server

…ch so watch job does not terminte

…entage; adds alert level legend coloring; fixes chart data range slicing

appserver/jdbc/jdbc-runtime/src/main/java/org/glassfish/jdbc/util/JdbcResourcesUtil.java

jbee · 2019-12-17T16:45:20Z

...onsole/core/src/main/java/fish/payara/monitoring/store/ConsumingMonitoringDataCollector.java

 *
 * @author Jan Bernitt
 */
-public class SinkDataCollector implements MonitoringDataCollector {


NB: I never liked the naming - calling it a consumer is more understandable I think.

jbee · 2019-12-17T16:51:16Z

appserver/monitoring-console/webapp/src/main/java/fish/payara/monitoring/web/ApiResponses.java

+ * 
+ * @author Jan Bernitt
+ */
+public final class ApiResponses {


NB: I intentionally moved all the structs for the JSON API responses into one class because I want to look at them at the same time and I don't want to have numerous files open I have to navigate in-between when checking or changing the API.

jbee · 2019-12-17T16:54:41Z

...hcheck-checker/src/main/java/fish/payara/healthcheck/mphealth/MicroProfileHealthChecker.java

-            });
-
-            try {
-                taskResult.get(options.getTimeout(), TimeUnit.MILLISECONDS);


NB: I believe it was unintentional that the future was resolved within the loop essentially making the use of asynchronous computation ineffective. The changed implementation will dispatch work first and later resolve the futures so these can happen in parallel.

jbee · 2019-12-17T16:55:42Z

...config-api/src/test/java/com/sun/enterprise/config/modularity/tests/BasicModularityTest.java

@@ -175,7 +175,7 @@ public void testLoadingAdminFile() throws Exception {

    @Test
    public void testLoadingEmbeddedFile() throws Exception {
-        List<com.sun.enterprise.config.modularity.customization.ConfigBeanDefaultValue> values = values = configModularityUtils.getDefaultConfigurations(ConfigExtensionTwo.class, "embedded");


NB: My IDE didn't like the values = values = :D

jbee · 2019-12-17T16:58:13Z

...e/src/main/java/fish/payara/nucleus/healthcheck/preliminary/GarbageCollectorHealthCheck.java

        }

-        return result;
+        public double percentage(GarbageCollectorMXBean gcBean) {


NB: This logic is intentionally not the same as the one replaced.

See #3291 for more details.

jbee · 2019-12-17T17:00:08Z

...src/main/java/fish/payara/nucleus/healthcheck/preliminary/MachineMemoryUsageHealthCheck.java

+                availableMemory = 0;
+                return;
+            }
+            long otherAvailableMemory = 0;


NB: I introduced otherAvailableMemory as single variable as the individual variables I removed would only be used to summarise them.

… from watch status

…rors caused by library complexity rather than real IO errors

…rget directly

jbee · 2020-01-06T09:49:39Z

...hcheck-checker/src/main/java/fish/payara/healthcheck/mphealth/MicroProfileHealthChecker.java

-    //send request to remote healthcheck endpoint to get the status
-    private HealthCheckResultEntry pingHealthEndpoint(String instanceName, URI remote) {
-        Client jaxrsClient = ClientBuilder.newClient();
-        WebTarget target = jaxrsClient.target(remote);


NB: Replaced this with pure HttpURLConnection usage as it was failing for reasons unrelated to the task itself. Some problem deep in the config of JAX-RS and its dependencies. I figured that if complexity can make this fail the fix is to get rid of it so I did.

Only a note: Client has close method, so not closing it is a resource leak. But it does not implement Closeable, so tools do not report it. I learned it when I created JMH performance tests in previous company and ended in OOME ;-)

The another problem is hunting us again and again - ClientBuilder uses ServiceLocator so it usually finds implementation from the server ... with tracing etc. Custcom have currently several issues around this.
So if you can use HttpURLConnection directly, it is perhaps even better solution in this case :-)

jbee · 2020-01-06T10:21:16Z

jenkins test please

jbee · 2020-01-06T16:45:16Z

jenkins test please

jbee · 2020-01-06T17:24:36Z

jenkins test please

jbee · 2020-01-08T14:46:26Z

jenkins test please

…tialised yet

…tialised yet (2)

jbee · 2020-01-08T16:13:05Z

jenkins test please

MeroRai · 2020-01-10T17:00:11Z

@jbee, I built and tested this locally, everything seems to work as intended. I skimmed through the code and it looks alright. Will go through it thoroughly sometime next week.

jbee added 9 commits December 9, 2019 18:18

APPSERV-11 adds basic alert system

a4c25b9

APPSERV-11 adds some health checks as metric (incomplete)

ec9a089

APPSERV-11 adds JS alert tables; updates most health checks to be mon…

31ba4c4

…itoring data sources

APPSERV-11 collect health check metrics and watches

4581c85

APPSERV-11 adds health check preset page (and a fix for CP watch)

5744a5e

APPSERV-11 adds missing data status text for health page, fixes filte…

1767ac1

…ring away outdated data no longer updates on server

APPSERV-11 fixed NPE when removing watch without alerts; adds try-cat…

412e0bb

…ch so watch job does not terminte

APPSERV-11 adds alert ack to UI and web API

e8fd1b6

APPSERV-11 MP health check liveliness metric and watch as single perc…

348ea41

…entage; adds alert level legend coloring; fixes chart data range slicing

jbee added 3:DevInProgress PR: DO NOT MERGE Don't merge PR until further notice labels Dec 17, 2019

jbee self-assigned this Dec 17, 2019

jbee mentioned this pull request Dec 17, 2019

Garbage Collection HealthCheck never triggers usefully #3291

Closed

jbee commented Dec 17, 2019

View reviewed changes

appserver/jdbc/jdbc-runtime/src/main/java/org/glassfish/jdbc/util/JdbcResourcesUtil.java Outdated Show resolved Hide resolved

jbee commented Dec 17, 2019

View reviewed changes

jbee added 10 commits December 17, 2019 20:56

APPSERV-11 only draw decoration lines once

0ccad82

APPSERV-11 adds gradient backgrounds for charts

e263249

APPSERV-11 adds line chart backgrounds coloured by watch thresholds

eb4f096

APPSERV-11 coloring line chart backs from watch thresholds and legend…

9864a8c

… from watch status

APPSERV-11 better watch indicator on axis

0ad37ad

APPSERV-11 line colors avoid colliding with indicator colors

9ad9607

APPSERV-11 adds alerts settings; fixes alert isStopped()

b873ca3

APPSERV-11 adds alert filter settings

cb644fb

APPSERV-11 adds monitoring page and watch for collection duration

5497c1f

APPSERV-11 adds lines for alert levels in line graphs

7ac9e2e

jbee added 7 commits January 4, 2020 15:21

APPSERV-11 fixed ping execution via plain HTTP connection to avoid er…

e51bbdd

…rors caused by library complexity rather than real IO errors

APPSERV-11 close URL connection

d3c725b

APPSERV-11 decoration lines via background areas instead of data series

20a64c0

APPSERV-11 coloring of 'series' now is per widget series

f2ed1fa

APPSERV-11 removes merged JS file from sources by generating it in ta…

e868eb8

…rget directly

APPSERV-11 adds JVM page

90a027f

APPSERV-11 adds seperate rowspan for widgets

51b1b37

jbee commented Jan 6, 2020

View reviewed changes

APPSERV-11 updates copyright header and adds javadoc

f5cee4a

APPSERV-11 adds more tests, some renames and javadoc

3f9a6e3

jbee changed the title ~~[WIP] APPSERV-11 Adds Health Check Alerts to Monitoring Console~~ APPSERV-11 Adds Health Check Alerts to Monitoring Console Jan 6, 2020

jbee removed 3:DevInProgress PR: DO NOT MERGE Don't merge PR until further notice labels Jan 6, 2020

APPSERV-11 fixed alert table filtering options for level

537c605

jbee requested review from MeroRai and Pandrex247 January 7, 2020 09:01

jbee added 3 commits January 7, 2020 17:36

Merge branch 'master' into APPSERV-11-health-check-alerts

3304dc0

Merge branch 'master' into APPSERV-11-health-check-alerts

3718531

APPSERV-11 reverts changes to JdbcResourcesUtil.java

2ce3ecf

jbee added 2 commits January 8, 2020 17:12

APPSERV-11 APPSERV-14 fixed NPE when health check options are not ini…

ef7ef56

…tialised yet

APPSERV-11 APPSERV-14 fixed NPE when health check options are not ini…

554e6ba

…tialised yet (2)

MeroRai approved these changes Jan 17, 2020

View reviewed changes

jbee merged commit 02a3fbb into payara:master Jan 17, 2020

jbee added this to the 5.201 milestone Jan 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APPSERV-11 Adds Health Check Alerts to Monitoring Console #4390

APPSERV-11 Adds Health Check Alerts to Monitoring Console #4390

jbee commented Dec 17, 2019 •

edited

Loading

jbee Dec 17, 2019

jbee Dec 17, 2019

jbee Dec 17, 2019

jbee Dec 17, 2019

jbee Dec 17, 2019

jbee Dec 17, 2019

jbee Jan 6, 2020

dmatej Jan 6, 2020 •

edited

Loading

dmatej Jan 6, 2020

jbee commented Jan 6, 2020

jbee commented Jan 6, 2020

jbee commented Jan 6, 2020

jbee commented Jan 8, 2020

jbee commented Jan 8, 2020

MeroRai commented Jan 10, 2020

APPSERV-11 Adds Health Check Alerts to Monitoring Console #4390

APPSERV-11 Adds Health Check Alerts to Monitoring Console #4390

Conversation

jbee commented Dec 17, 2019 • edited Loading

Summary

Why not use existing Health Check and Notifier system directly?

How does the Watch and Alert System work?

Testing

Testing Done

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmatej Jan 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbee commented Jan 6, 2020

jbee commented Jan 6, 2020

jbee commented Jan 6, 2020

jbee commented Jan 8, 2020

jbee commented Jan 8, 2020

MeroRai commented Jan 10, 2020

jbee commented Dec 17, 2019 •

edited

Loading

dmatej Jan 6, 2020 •

edited

Loading