-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APPSERV-19 Adds monitoring of stuck and hogging threads to monitoring console #4452
Conversation
jenkins test please |
times.setEndCpuTime(c); | ||
times.setEndUserTime(u); | ||
|
||
long checkTime = getOptions().getUnit().toMillis(getOptions().getTime()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB. using the interval from the options is only an approximation of the actual time passed in the measurement interval. As this now can be different intervals for the check run by the health check service and the monitoring data collection I changed this to use the actual time passed. This is also more accurate.
ConcurrentHashMap<Long, Long> threads = stuckThreadsStore.getThreads(); | ||
for (Long thread : threads.keySet()){ | ||
Long timeHeld = threads.get(thread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB. As discussed in chat the timeHeld
here was a semantic confusion. The map contained the timestamp when the thread started the work. I changed the algorithm accordingly and also changed all the computation to be based on milliseconds as nanosecond level only would make sense if we would run the check often and fast enough (every t with a t < 1ms)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this possibly needs a bit more cleanup, as you can still configure it with a threshold of 5 nanoseconds.
Particularly in the monitoring console you get a funny situation where you get a Threshold listed as 0
in comparison to something like 5ms
. In this case, you could possibly change it to have it say the threshold is <1ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think this should possibly be handled by a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll create a separate Jira
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comments, otherwise looking good 👌
...onsole/core/src/main/java/fish/payara/monitoring/store/InMemoryMonitoringDataRepository.java
Outdated
Show resolved
Hide resolved
...ore/src/main/java/fish/payara/nucleus/healthcheck/preliminary/HoggingThreadsHealthCheck.java
Outdated
Show resolved
Hide resolved
...ore/src/main/java/fish/payara/nucleus/healthcheck/preliminary/HoggingThreadsHealthCheck.java
Outdated
Show resolved
Hide resolved
ConcurrentHashMap<Long, Long> threads = stuckThreadsStore.getThreads(); | ||
for (Long thread : threads.keySet()){ | ||
Long timeHeld = threads.get(thread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this possibly needs a bit more cleanup, as you can still configure it with a threshold of 5 nanoseconds.
Particularly in the monitoring console you get a funny situation where you get a Threshold listed as 0
in comparison to something like 5ms
. In this case, you could possibly change it to have it say the threshold is <1ms
…nitoring/store/InMemoryMonitoringDataRepository.java Co-Authored-By: Andrew Pielage <[email protected]>
…ara/nucleus/healthcheck/preliminary/HoggingThreadsHealthCheck.java Co-Authored-By: Andrew Pielage <[email protected]>
…ara/nucleus/healthcheck/preliminary/HoggingThreadsHealthCheck.java Co-Authored-By: Andrew Pielage <[email protected]>
jenkins test please |
Summary
Changes on server side:
keyed
flag (details below)0
(no retry)Changes on client side:
Keyed Annotations
Usually annotations are stored in a fixed size queue where newest annotation eventually overrides the oldest. When annotations are keyed this behaviour is altered slightly. For a keyed annotation the first annotation attribute value is considered as a key. All annotations of the same key are automatically removed from the queue when an annotation is added. Only if the size limit is still exceeded the newest still replaces the oldest. This addition allows to effectively update annotation on the same thing identified by the key without necessarily removing annotations on other things.
This allows annotations on many things for the same metric series which avoids unnecessarily detailed metrics for each thing.
In the context of this PR this means annotations on one thread (usually) do not replace annotations on another thread for the same metric, here stuck thread duration or hogging thread duration.
Testing
New unit tests have been added for the general mechanics of keyed annotations.
Reading the documentation changes https://github.com/payara/Payara-Server-Documentation/pull/699 might help to understand the context of the feature.
The feature was tested manual according to below test instructions:
General Setup:
set-monitoring-console-configuration --enabled=true
to deploy MCTesting Stuck Threads Health Checks
By using debug mode:
0. Start MC in debug mode
fish.payara.monitoring.web.MonitoringConsoleResource.getSeriesData(SeriesRequest)
)Alternatively one could deploy an app with a known REST API method that does take several seconds to complete.
Testing Hogging Threads Health Checks
With a testing app:
By restarting (there is a chance):