-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APPSERV-11 Adds Health Check Alerts to Monitoring Console #4390
Conversation
…itoring data sources
…ring away outdated data no longer updates on server
…ch so watch job does not terminte
…entage; adds alert level legend coloring; fixes chart data range slicing
appserver/jdbc/jdbc-runtime/src/main/java/org/glassfish/jdbc/util/JdbcResourcesUtil.java
Outdated
Show resolved
Hide resolved
* | ||
* @author Jan Bernitt | ||
*/ | ||
public class SinkDataCollector implements MonitoringDataCollector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: I never liked the naming - calling it a consumer is more understandable I think.
* | ||
* @author Jan Bernitt | ||
*/ | ||
public final class ApiResponses { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: I intentionally moved all the structs for the JSON API responses into one class because I want to look at them at the same time and I don't want to have numerous files open I have to navigate in-between when checking or changing the API.
}); | ||
|
||
try { | ||
taskResult.get(options.getTimeout(), TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: I believe it was unintentional that the future was resolved within the loop essentially making the use of asynchronous computation ineffective. The changed implementation will dispatch work first and later resolve the futures so these can happen in parallel.
@@ -175,7 +175,7 @@ public void testLoadingAdminFile() throws Exception { | |||
|
|||
@Test | |||
public void testLoadingEmbeddedFile() throws Exception { | |||
List<com.sun.enterprise.config.modularity.customization.ConfigBeanDefaultValue> values = values = configModularityUtils.getDefaultConfigurations(ConfigExtensionTwo.class, "embedded"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: My IDE didn't like the values = values =
:D
} | ||
|
||
return result; | ||
public double percentage(GarbageCollectorMXBean gcBean) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: This logic is intentionally not the same as the one replaced.
See #3291 for more details.
availableMemory = 0; | ||
return; | ||
} | ||
long otherAvailableMemory = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: I introduced otherAvailableMemory
as single variable as the individual variables I removed would only be used to summarise them.
… from watch status
…rors caused by library complexity rather than real IO errors
//send request to remote healthcheck endpoint to get the status | ||
private HealthCheckResultEntry pingHealthEndpoint(String instanceName, URI remote) { | ||
Client jaxrsClient = ClientBuilder.newClient(); | ||
WebTarget target = jaxrsClient.target(remote); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: Replaced this with pure HttpURLConnection
usage as it was failing for reasons unrelated to the task itself. Some problem deep in the config of JAX-RS and its dependencies. I figured that if complexity can make this fail the fix is to get rid of it so I did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a note: Client has close method, so not closing it is a resource leak. But it does not implement Closeable, so tools do not report it. I learned it when I created JMH performance tests in previous company and ended in OOME ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The another problem is hunting us again and again - ClientBuilder uses ServiceLocator so it usually finds implementation from the server ... with tracing etc. Custcom have currently several issues around this.
So if you can use HttpURLConnection directly, it is perhaps even better solution in this case :-)
jenkins test please |
jenkins test please |
jenkins test please |
jenkins test please |
jenkins test please |
@jbee, I built and tested this locally, everything seems to work as intended. I skimmed through the code and it looks alright. Will go through it thoroughly sometime next week. |
Summary
This PR extends the monitoring console with a server side alerting system that is used to track the health check status values and create alerts should they exceed the configured thresholds.
While the thresholds are those of the health check configuration the interval in the configuration has no effect as the alerting system has "immediate" evaluation (delay < 2 sec).
Many changes are general ones that create the foundation of the alerting so that it can be used for health checks. This approach is motivated in more detail further down.
Server Changes
Watch
es andAlert
s that track changes of monitoring data metrics and cause alertsMonitoringDataSource
now can also implementMonitoringWatchSource
to communicate the detailed conditions without making modules dependent on monitoring console (as this is still an optional module)MonitoringDataSource
to use the new@MonitoringData
annotation on the collection method to put collected series in a common namespace and to only collect every n seconds (as opposed to every second). This was needed to archive a slower collection of the health checks as those can be more expensive or slow to compute.doCheckInternal
) and when collecting monitoring data.monitoring-console.js
directly in target (removed from source to avoid further confusion during reviews)Client Changes
alert
for tables of alertsWhy not use existing Health Check and Notifier system directly?
The goal of the task was to immediately detect and show problems identified by health check logic in the graphs and a table of alert states. The existing health checks design does not really allow this as the checks happen fairly infrequent and in a user defined interval. Should the user use a faster interval this would cause a flood of notifications. Also there is no concept of alert state. Each check is unaware of anything that happened before. Again, this is problematic for user experience in connection with frequent checks. Infrequent checks on the other hand would not show problems in the UI as they would be identified too late. The alert system fixes these problems and provides a general mechanism that now can be used on any metric.
How does the Watch and Alert System work?
A watch describes start and stop conditions for alerts. Each level, red, amber and green, can have its own conditions. Stop conditions are optional. When used they make sure an alert isn't started and then immediately stop due to fluctuations around the threshold. Conditions also allow to give a side condition that specifies for how long and in what fashion the threshold needs to be exceeded to start the alert.
When an alert is started it can transition between red and amber. This is still considered the same alert. The alert ends if it transitions to green or white. Green state can be used to describe the conditions of the value range that is particularly good. Everything else, not red, amber or green is white.
Alerts are per series and instance. That means CPU usage on DAS is connected to one alert, CPU usage on another instance to another alert. Both are watched by the same CPU Usage watch.
The advanced conditions, the transitioning between levels and the connection to series and instances are designed to prevent alerts from flooding the UI by making it possible to describe the alert circumstances in a way that is more similar to how humans would understand a window of data as "one alert situation".
While in its current form the system is used to collect watches from health check modules it is build so that watches can be installed by users as well. That mean if we build an GUI for composing the data a watch needs users can add alerting to any metric they want.
In this PR alerts only show in the graphs they belong to or in dedicated alert list widget. This has been build in way that makes it simple to also support alerts popping up in a "global" alert list.
Testing
The new health check metrics are collected as soon as the individual health checks are enabled in the Health Checks config. The overall service does not have to be enabled as that is only responsible for evaluating health checks to create notifier messages. This is intentionally as it allows to enable/disable the health checks individually without the need to also cause notifications via notifier.
Steps:
set-monitoring-console-configuration --enabled=true
to deploy MCOther things to try:
Testing Done
I added unit tests for the watch and alert logic that cover the most scenarios and tested MC manually. As this PR changes many details it does make sense to perform various tests that aren't directly related to the main feature added.