Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APPSERV-19 Adds monitoring of stuck and hogging threads to monitoring console #4452

Merged
merged 8 commits into from
Jan 24, 2020

Conversation

jbee
Copy link
Contributor

@jbee jbee commented Jan 23, 2020

Summary

Changes on server side:

  • extends the annotation mechanism added in APPSERV-14 with the keyed flag (details below)
  • fixes calculation of duration a thread is working in stuck thread calculations
  • changes stuck thread calculations to be based on millisecond timestamps and duration only (nanosecond precision makes no sense as this would also require running the check below 1ms intervals)
  • adds collection of stuck threads metrics and annotations
  • adds watch for stuck threads (based on threshold duration)
  • cleanup and extension of hogging thread calculation
  • lowers hogging thread retry minimum to 0 (no retry)
  • adds collection of hogging threads metrics and annotations
  • adds watch for hogging threads (based on count of hogging threads)

Changes on client side:

  • adds a Threads page preset that shows stuck and hogging thread incidents

Keyed Annotations

Usually annotations are stored in a fixed size queue where newest annotation eventually overrides the oldest. When annotations are keyed this behaviour is altered slightly. For a keyed annotation the first annotation attribute value is considered as a key. All annotations of the same key are automatically removed from the queue when an annotation is added. Only if the size limit is still exceeded the newest still replaces the oldest. This addition allows to effectively update annotation on the same thing identified by the key without necessarily removing annotations on other things.
This allows annotations on many things for the same metric series which avoids unnecessarily detailed metrics for each thing.

In the context of this PR this means annotations on one thread (usually) do not replace annotations on another thread for the same metric, here stuck thread duration or hogging thread duration.

Testing

New unit tests have been added for the general mechanics of keyed annotations.

Reading the documentation changes https://github.com/payara/Payara-Server-Documentation/pull/699 might help to understand the context of the feature.

The feature was tested manual according to below test instructions:

General Setup:

  1. build, install and start the server
  2. use set-monitoring-console-configuration --enabled=true to deploy MC
  3. open MC at http://localhost:8080/monitoring-console/
  4. make sure browser cache for JS/CSS is cleared for MC's domain

Testing Stuck Threads Health Checks

By using debug mode:
0. Start MC in debug mode

  1. Open admin console at http://localhost:4848/
  2. navigate to Configurations => server-config => HealthCheck
  3. open tab Stuck Threads, check Enabled, set a Threshold of 1 second and save
  4. put a breakpoint in a method you know is called (e.g. fish.payara.monitoring.web.MonitoringConsoleResource.getSeriesData(SeriesRequest))
  5. after some seconds at the break-point continue execution.
  6. check in MC Threads page that annotation(s) occurs in widget Stuck Thread Incidents
  7. check in MC Alerts and Health Checks page that an related alert exists

Alternatively one could deploy an app with a known REST API method that does take several seconds to complete.

Testing Hogging Threads Health Checks

With a testing app:

  1. Open admin console at http://localhost:4848/
  2. navigate to Configurations => server-config => HealthCheck
  3. open tab Hogging Threads, check Enabled, set a Threshold of 50% and a Retry Count of 0-1 and save
  4. deploy a test application (see JIRA) with a method that does a busy loop for a given length of time
  5. invoke the method for some seconds
  6. check in MC Threads page that widget Hogging Thread Incidents contains entries, check that Method refers to your method with the loop
  7. check in MC Alerts and Heath Checks page that an alert exists

By restarting (there is a chance):

  1. Open admin console at http://localhost:4848/
  2. navigate to Configurations => server-config => HealthCheck
  3. open tab Hogging Threads, check Enabled, set a Threshold of 1% and a Retry Count of 0-1 and save
  4. restart the server, and wait until server is up again
  5. check in MC Threads page that widget Hogging Thread Incidents contains entries
  6. check in MC Alerts and Heath Checks page that an alert exists

@jbee jbee self-assigned this Jan 23, 2020
@jbee
Copy link
Contributor Author

jbee commented Jan 23, 2020

jenkins test please

@jbee jbee added this to the 5.201 milestone Jan 23, 2020
@jbee jbee requested review from Cousjava and Pandrex247 January 23, 2020 16:35
times.setEndCpuTime(c);
times.setEndUserTime(u);

long checkTime = getOptions().getUnit().toMillis(getOptions().getTime());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB. using the interval from the options is only an approximation of the actual time passed in the measurement interval. As this now can be different intervals for the check run by the health check service and the monitoring data collection I changed this to use the actual time passed. This is also more accurate.

ConcurrentHashMap<Long, Long> threads = stuckThreadsStore.getThreads();
for (Long thread : threads.keySet()){
Long timeHeld = threads.get(thread);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB. As discussed in chat the timeHeld here was a semantic confusion. The map contained the timestamp when the thread started the work. I changed the algorithm accordingly and also changed all the computation to be based on milliseconds as nanosecond level only would make sense if we would run the check often and fast enough (every t with a t < 1ms)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this possibly needs a bit more cleanup, as you can still configure it with a threshold of 5 nanoseconds.

Particularly in the monitoring console you get a funny situation where you get a Threshold listed as 0 in comparison to something like 5ms. In this case, you could possibly change it to have it say the threshold is <1ms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think this should possibly be handled by a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll create a separate Jira

Copy link
Member

@Pandrex247 Pandrex247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments, otherwise looking good 👌

ConcurrentHashMap<Long, Long> threads = stuckThreadsStore.getThreads();
for (Long thread : threads.keySet()){
Long timeHeld = threads.get(thread);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this possibly needs a bit more cleanup, as you can still configure it with a threshold of 5 nanoseconds.

Particularly in the monitoring console you get a funny situation where you get a Threshold listed as 0 in comparison to something like 5ms. In this case, you could possibly change it to have it say the threshold is <1ms

jbee and others added 3 commits January 24, 2020 14:14
…nitoring/store/InMemoryMonitoringDataRepository.java

Co-Authored-By: Andrew Pielage <[email protected]>
…ara/nucleus/healthcheck/preliminary/HoggingThreadsHealthCheck.java

Co-Authored-By: Andrew Pielage <[email protected]>
…ara/nucleus/healthcheck/preliminary/HoggingThreadsHealthCheck.java

Co-Authored-By: Andrew Pielage <[email protected]>
@jbee
Copy link
Contributor Author

jbee commented Jan 24, 2020

jenkins test please

@jbee jbee merged commit ff7bf07 into payara:master Jan 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants