Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APPSERV-11 Adds Health Check Alerts to Monitoring Console #4390

Merged
merged 34 commits into from
Jan 17, 2020

Conversation

jbee
Copy link
Contributor

@jbee jbee commented Dec 17, 2019

Summary

This PR extends the monitoring console with a server side alerting system that is used to track the health check status values and create alerts should they exceed the configured thresholds.
While the thresholds are those of the health check configuration the interval in the configuration has no effect as the alerting system has "immediate" evaluation (delay < 2 sec).

Many changes are general ones that create the foundation of the alerting so that it can be used for health checks. This approach is motivated in more detail further down.

Server Changes

  • adds a general concept server side evaluated Watches and Alerts that track changes of monitoring data metrics and cause alerts
  • adds internal API that makes watches accessible so that MonitoringDataSource now can also implement MonitoringWatchSource to communicate the detailed conditions without making modules dependent on monitoring console (as this is still an optional module)
  • changes to health check configurations are automatically reflected in watch conditions using the collection mechanism
  • adds ability to MonitoringDataSource to use the new @MonitoringData annotation on the collection method to put collected series in a common namespace and to only collect every n seconds (as opposed to every second). This was needed to archive a slower collection of the health checks as those can be more expensive or slow to compute.
  • extract the computation logic from included health checks so that it can be used both for their own check (see doCheckInternal) and when collecting monitoring data.
  • extends REST API for returning watches and alerts together with series data
  • adds REST API for watches and alerts
  • fixes Garbage Collection HealthCheck never triggers usefully #3291 - GC health check now computes the percentage of time spend doing GC in checked time window
  • fixes issue in MC Health checks where check fails due to some setup issue with web client library used to ping the health check endpoints
  • build now creates the merged JS file monitoring-console.js directly in target (removed from source to avoid further confusion during reviews)
  • adds a watch for data collection duration

Client Changes

  • adds Health Checks page preset
  • adds new widget type alert for tables of alerts
  • adds new configuration for defaults of red, amber, green state colour
  • adds alert state dependent colouring to graph legend
  • adds variable graph window size (as a side effect of collecting only every n seconds is more data then 1 min becomes available for those series using it)
  • adds active alerts being shown on top of the graph (in short form)
  • adds acknowledgement of alerts (to remove from list shown in graphs)
  • uses default colours from configuration for legend labels in critical or alerting state (not from CSS file)
  • adds possibility to group inputs in settings (to group default colour settings properly) and prevent splitting of label and input in settings
  • adds coloured background areas to line charts that show states red, amber, green (from watches) as well as alarming and critical thresholds (from decorations) more clearly.
  • does no longer use (multiple) data lines to show threshold lines.
  • grid now supports separate column and row spans
  • adds a Monitoring page preset (was needed for debugging)
  • adds a JVM page preset (also for better debugging)
  • updates the provided colour schemes so their colours used for lines do not collide with colours used to indicate states like red, amber and green.
  • fixed: secondary settings now are collapsed by default as intended before
  • adds a setting for data line width (thickness)
  • widget titles are now centred and make selection easier (as they span full width)

Why not use existing Health Check and Notifier system directly?

The goal of the task was to immediately detect and show problems identified by health check logic in the graphs and a table of alert states. The existing health checks design does not really allow this as the checks happen fairly infrequent and in a user defined interval. Should the user use a faster interval this would cause a flood of notifications. Also there is no concept of alert state. Each check is unaware of anything that happened before. Again, this is problematic for user experience in connection with frequent checks. Infrequent checks on the other hand would not show problems in the UI as they would be identified too late. The alert system fixes these problems and provides a general mechanism that now can be used on any metric.

How does the Watch and Alert System work?

A watch describes start and stop conditions for alerts. Each level, red, amber and green, can have its own conditions. Stop conditions are optional. When used they make sure an alert isn't started and then immediately stop due to fluctuations around the threshold. Conditions also allow to give a side condition that specifies for how long and in what fashion the threshold needs to be exceeded to start the alert.

When an alert is started it can transition between red and amber. This is still considered the same alert. The alert ends if it transitions to green or white. Green state can be used to describe the conditions of the value range that is particularly good. Everything else, not red, amber or green is white.

Alerts are per series and instance. That means CPU usage on DAS is connected to one alert, CPU usage on another instance to another alert. Both are watched by the same CPU Usage watch.

The advanced conditions, the transitioning between levels and the connection to series and instances are designed to prevent alerts from flooding the UI by making it possible to describe the alert circumstances in a way that is more similar to how humans would understand a window of data as "one alert situation".

While in its current form the system is used to collect watches from health check modules it is build so that watches can be installed by users as well. That mean if we build an GUI for composing the data a watch needs users can add alerting to any metric they want.

In this PR alerts only show in the graphs they belong to or in dedicated alert list widget. This has been build in way that makes it simple to also support alerts popping up in a "global" alert list.

Testing

The new health check metrics are collected as soon as the individual health checks are enabled in the Health Checks config. The overall service does not have to be enabled as that is only responsible for evaluating health checks to create notifier messages. This is intentionally as it allows to enable/disable the health checks individually without the need to also cause notifications via notifier.

Steps:

  • Build and run the server.
  • use set-monitoring-console-configuration --enabled=true to deploy MC
  • open MC at http://localhost:8080/monitoring-console/
  • (eventually clear local storage to reset)
  • switch to Health Checks page
  • open admin console and enable health checks in their configuration
  • change e.g. memory thresholds in the configuration and check that the change is reflected in MC
  • change a threshold so current value triggers an alert (e.g. low heap threshold) and check it appears in MC
  • acknowledge the alert in the list by clicking its checkbox and check the alert disappears from the corresponding line chart

Other things to try:

  • with some alerts in the list change alerts filters (Settings => Alerts for Alerts widget)
  • change random widget or data settings
  • change colour scheme (Settings => Colors)
  • change decoration thresholds (Settings => Decoration; e.g. for Core page CPU Usage)
  • enable page rotation (Settings => General; set time to some seconds)
  • enable request tracing and change config to cause some traces - check MC views work
  • create and start another instance, check MC shows its data (remember they might need config changes too)

Testing Done

I added unit tests for the watch and alert logic that cover the most scenarios and tested MC manually. As this PR changes many details it does make sense to perform various tests that aren't directly related to the main feature added.

@jbee jbee added 3:DevInProgress PR: DO NOT MERGE Don't merge PR until further notice labels Dec 17, 2019
@jbee jbee self-assigned this Dec 17, 2019
*
* @author Jan Bernitt
*/
public class SinkDataCollector implements MonitoringDataCollector {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: I never liked the naming - calling it a consumer is more understandable I think.

*
* @author Jan Bernitt
*/
public final class ApiResponses {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: I intentionally moved all the structs for the JSON API responses into one class because I want to look at them at the same time and I don't want to have numerous files open I have to navigate in-between when checking or changing the API.

});

try {
taskResult.get(options.getTimeout(), TimeUnit.MILLISECONDS);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: I believe it was unintentional that the future was resolved within the loop essentially making the use of asynchronous computation ineffective. The changed implementation will dispatch work first and later resolve the futures so these can happen in parallel.

@@ -175,7 +175,7 @@ public void testLoadingAdminFile() throws Exception {

@Test
public void testLoadingEmbeddedFile() throws Exception {
List<com.sun.enterprise.config.modularity.customization.ConfigBeanDefaultValue> values = values = configModularityUtils.getDefaultConfigurations(ConfigExtensionTwo.class, "embedded");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: My IDE didn't like the values = values = :D

}

return result;
public double percentage(GarbageCollectorMXBean gcBean) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: This logic is intentionally not the same as the one replaced.

See #3291 for more details.

availableMemory = 0;
return;
}
long otherAvailableMemory = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: I introduced otherAvailableMemory as single variable as the individual variables I removed would only be used to summarise them.

//send request to remote healthcheck endpoint to get the status
private HealthCheckResultEntry pingHealthEndpoint(String instanceName, URI remote) {
Client jaxrsClient = ClientBuilder.newClient();
WebTarget target = jaxrsClient.target(remote);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: Replaced this with pure HttpURLConnection usage as it was failing for reasons unrelated to the task itself. Some problem deep in the config of JAX-RS and its dependencies. I figured that if complexity can make this fail the fix is to get rid of it so I did.

Copy link
Contributor

@dmatej dmatej Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a note: Client has close method, so not closing it is a resource leak. But it does not implement Closeable, so tools do not report it. I learned it when I created JMH performance tests in previous company and ended in OOME ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The another problem is hunting us again and again - ClientBuilder uses ServiceLocator so it usually finds implementation from the server ... with tracing etc. Custcom have currently several issues around this.
So if you can use HttpURLConnection directly, it is perhaps even better solution in this case :-)

@jbee
Copy link
Contributor Author

jbee commented Jan 6, 2020

jenkins test please

@jbee
Copy link
Contributor Author

jbee commented Jan 6, 2020

jenkins test please

@jbee jbee changed the title [WIP] APPSERV-11 Adds Health Check Alerts to Monitoring Console APPSERV-11 Adds Health Check Alerts to Monitoring Console Jan 6, 2020
@jbee jbee removed 3:DevInProgress PR: DO NOT MERGE Don't merge PR until further notice labels Jan 6, 2020
@jbee
Copy link
Contributor Author

jbee commented Jan 6, 2020

jenkins test please

@jbee jbee requested review from MeroRai and Pandrex247 January 7, 2020 09:01
@jbee
Copy link
Contributor Author

jbee commented Jan 8, 2020

jenkins test please

@jbee
Copy link
Contributor Author

jbee commented Jan 8, 2020

jenkins test please

@MeroRai
Copy link
Member

MeroRai commented Jan 10, 2020

@jbee, I built and tested this locally, everything seems to work as intended. I skimmed through the code and it looks alright. Will go through it thoroughly sometime next week.

@jbee jbee merged commit 02a3fbb into payara:master Jan 17, 2020
@jbee jbee added this to the 5.201 milestone Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Garbage Collection HealthCheck never triggers usefully
3 participants