APPSERV-19 Adds monitoring of stuck and hogging threads to monitoring console #4452

jbee · 2020-01-23T13:09:56Z

Summary

Changes on server side:

extends the annotation mechanism added in APPSERV-14 with the keyed flag (details below)
fixes calculation of duration a thread is working in stuck thread calculations
changes stuck thread calculations to be based on millisecond timestamps and duration only (nanosecond precision makes no sense as this would also require running the check below 1ms intervals)
adds collection of stuck threads metrics and annotations
adds watch for stuck threads (based on threshold duration)
cleanup and extension of hogging thread calculation
lowers hogging thread retry minimum to 0 (no retry)
adds collection of hogging threads metrics and annotations
adds watch for hogging threads (based on count of hogging threads)

Changes on client side:

adds a Threads page preset that shows stuck and hogging thread incidents

Keyed Annotations

Usually annotations are stored in a fixed size queue where newest annotation eventually overrides the oldest. When annotations are keyed this behaviour is altered slightly. For a keyed annotation the first annotation attribute value is considered as a key. All annotations of the same key are automatically removed from the queue when an annotation is added. Only if the size limit is still exceeded the newest still replaces the oldest. This addition allows to effectively update annotation on the same thing identified by the key without necessarily removing annotations on other things.
This allows annotations on many things for the same metric series which avoids unnecessarily detailed metrics for each thing.

In the context of this PR this means annotations on one thread (usually) do not replace annotations on another thread for the same metric, here stuck thread duration or hogging thread duration.

Testing

New unit tests have been added for the general mechanics of keyed annotations.

Reading the documentation changes https://github.com/payara/Payara-Server-Documentation/pull/699 might help to understand the context of the feature.

The feature was tested manual according to below test instructions:

General Setup:

build, install and start the server
use set-monitoring-console-configuration --enabled=true to deploy MC
open MC at http://localhost:8080/monitoring-console/
make sure browser cache for JS/CSS is cleared for MC's domain

Testing Stuck Threads Health Checks

By using debug mode:
0. Start MC in debug mode

Open admin console at http://localhost:4848/
navigate to Configurations => server-config => HealthCheck
open tab Stuck Threads, check Enabled, set a Threshold of 1 second and save
put a breakpoint in a method you know is called (e.g. fish.payara.monitoring.web.MonitoringConsoleResource.getSeriesData(SeriesRequest))
after some seconds at the break-point continue execution.
check in MC Threads page that annotation(s) occurs in widget Stuck Thread Incidents
check in MC Alerts and Health Checks page that an related alert exists

Alternatively one could deploy an app with a known REST API method that does take several seconds to complete.

Testing Hogging Threads Health Checks

With a testing app:

Open admin console at http://localhost:4848/
navigate to Configurations => server-config => HealthCheck
open tab Hogging Threads, check Enabled, set a Threshold of 50% and a Retry Count of 0-1 and save
deploy a test application (see JIRA) with a method that does a busy loop for a given length of time
invoke the method for some seconds
check in MC Threads page that widget Hogging Thread Incidents contains entries, check that Method refers to your method with the loop
check in MC Alerts and Heath Checks page that an alert exists

By restarting (there is a chance):

Open admin console at http://localhost:4848/
navigate to Configurations => server-config => HealthCheck
open tab Hogging Threads, check Enabled, set a Threshold of 1% and a Retry Count of 0-1 and save
restart the server, and wait until server is up again
check in MC Threads page that widget Hogging Thread Incidents contains entries
check in MC Alerts and Heath Checks page that an alert exists

…ging thread incidents

jbee · 2020-01-23T16:12:04Z

jenkins test please

jbee · 2020-01-23T16:41:44Z

...ore/src/main/java/fish/payara/nucleus/healthcheck/preliminary/HoggingThreadsHealthCheck.java

-                times.setEndCpuTime(c);
-                times.setEndUserTime(u);
-
-                long checkTime = getOptions().getUnit().toMillis(getOptions().getTime());


NB. using the interval from the options is only an approximation of the actual time passed in the measurement interval. As this now can be different intervals for the check run by the health check service and the monitoring data collection I changed this to use the actual time passed. This is also more accurate.

jbee · 2020-01-23T16:45:15Z

...check-stuck/src/main/java/fish/payara/nucleus/healthcheck/stuck/StuckThreadsHealthCheck.java

        ConcurrentHashMap<Long, Long> threads = stuckThreadsStore.getThreads();
-        for (Long thread : threads.keySet()){
-            Long timeHeld = threads.get(thread);


NB. As discussed in chat the timeHeld here was a semantic confusion. The map contained the timestamp when the thread started the work. I changed the algorithm accordingly and also changed all the computation to be based on milliseconds as nanosecond level only would make sense if we would run the check often and fast enough (every t with a t < 1ms)

I think this possibly needs a bit more cleanup, as you can still configure it with a threshold of 5 nanoseconds.

Particularly in the monitoring console you get a funny situation where you get a Threshold listed as 0 in comparison to something like 5ms. In this case, you could possibly change it to have it say the threshold is <1ms

Agreed, I think this should possibly be handled by a separate PR.

👍 I'll create a separate Jira

Pandrex247

Just some minor comments, otherwise looking good 👌

...onsole/core/src/main/java/fish/payara/monitoring/store/InMemoryMonitoringDataRepository.java

appserver/monitoring-console/webapp/src/main/webapp/js/mc-model.js

...ore/src/main/java/fish/payara/nucleus/healthcheck/preliminary/HoggingThreadsHealthCheck.java

Pandrex247 · 2020-01-24T12:32:17Z

...check-stuck/src/main/java/fish/payara/nucleus/healthcheck/stuck/StuckThreadsHealthCheck.java

        ConcurrentHashMap<Long, Long> threads = stuckThreadsStore.getThreads();
-        for (Long thread : threads.keySet()){
-            Long timeHeld = threads.get(thread);


I think this possibly needs a bit more cleanup, as you can still configure it with a threshold of 5 nanoseconds.

Particularly in the monitoring console you get a funny situation where you get a Threshold listed as 0 in comparison to something like 5ms. In this case, you could possibly change it to have it say the threshold is <1ms

…nitoring/store/InMemoryMonitoringDataRepository.java Co-Authored-By: Andrew Pielage <[email protected]>

…ara/nucleus/healthcheck/preliminary/HoggingThreadsHealthCheck.java Co-Authored-By: Andrew Pielage <[email protected]>

jbee · 2020-01-24T13:44:20Z

jenkins test please

jbee added 3 commits January 22, 2020 16:29

Merge branch 'master' into APPSERV-14-slow-sql

4daf5f5

Merge branch 'master' into APPSERV-19-threads

1c24432

APPSERV-19 adds initial version of threads page showing stuck and hog…

268b007

…ging thread incidents

jbee self-assigned this Jan 23, 2020

jbee added 2 commits January 23, 2020 16:27

APPSERV-19 updates copyright header and retry count min changed to zero

430fb23

APPSERV-19 adds unit tests for SeriesAnnotation keyed property

3c882b3

jbee added this to the 5.201 milestone Jan 23, 2020

jbee requested review from Cousjava and Pandrex247 January 23, 2020 16:35

jbee commented Jan 23, 2020

View reviewed changes

Pandrex247 reviewed Jan 24, 2020

View reviewed changes

jbee and others added 3 commits January 24, 2020 14:14

Update appserver/monitoring-console/core/src/main/java/fish/payara/mo…

a27d926

…nitoring/store/InMemoryMonitoringDataRepository.java Co-Authored-By: Andrew Pielage <[email protected]>

Update nucleus/payara-modules/healthcheck-core/src/main/java/fish/pay…

2ede58b

…ara/nucleus/healthcheck/preliminary/HoggingThreadsHealthCheck.java Co-Authored-By: Andrew Pielage <[email protected]>

Update nucleus/payara-modules/healthcheck-core/src/main/java/fish/pay…

268fd2a

…ara/nucleus/healthcheck/preliminary/HoggingThreadsHealthCheck.java Co-Authored-By: Andrew Pielage <[email protected]>

Pandrex247 approved these changes Jan 24, 2020

View reviewed changes

jbee merged commit ff7bf07 into payara:master Jan 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APPSERV-19 Adds monitoring of stuck and hogging threads to monitoring console #4452

APPSERV-19 Adds monitoring of stuck and hogging threads to monitoring console #4452

jbee commented Jan 23, 2020 •

edited

Loading

jbee commented Jan 23, 2020

jbee Jan 23, 2020

jbee Jan 23, 2020

Pandrex247 Jan 24, 2020

jbee Jan 24, 2020

Pandrex247 Jan 24, 2020

Pandrex247 left a comment

Pandrex247 Jan 24, 2020

jbee commented Jan 24, 2020

APPSERV-19 Adds monitoring of stuck and hogging threads to monitoring console #4452

APPSERV-19 Adds monitoring of stuck and hogging threads to monitoring console #4452

Conversation

jbee commented Jan 23, 2020 • edited Loading

Summary

Keyed Annotations

Testing

General Setup:

Testing Stuck Threads Health Checks

Testing Hogging Threads Health Checks

jbee commented Jan 23, 2020

jbee Jan 23, 2020

Choose a reason for hiding this comment

jbee Jan 23, 2020

Choose a reason for hiding this comment

Pandrex247 Jan 24, 2020

Choose a reason for hiding this comment

jbee Jan 24, 2020

Choose a reason for hiding this comment

Pandrex247 Jan 24, 2020

Choose a reason for hiding this comment

Pandrex247 left a comment

Choose a reason for hiding this comment

Pandrex247 Jan 24, 2020

Choose a reason for hiding this comment

jbee commented Jan 24, 2020

jbee commented Jan 23, 2020 •

edited

Loading