[Broker] Make health check fail if dead locked threads are detected #15155

lhotari · 2022-04-13T08:57:01Z

Motivation

The Admin API contains a health check endpoint. A common problem with Pulsar has been that there's a bug that causes a dead lock. This thread dead lock might not be detected by the health check that sends a message and consumes it.
The Java JVM contains methods in the JMX API to detect dead locked threads. It's useful to make the health check fail if any dead locked threads are detected.

Modifications

add additional check to health check endpoint to detect deadlocked threads
there was a small issue in the existing ThreadDumpUtil so that it used a limited method for detecting thread dead locks, "findMonitorDeadlockedThreads". The correct method to use is "findDeadlockedThreads" which will detect more case (javadoc: "Finds cycles of threads that are in deadlock waiting to acquire object monitors or ownable synchronizers.").
log the full thread dump of all threads when a dead lock is detected. A timestamp field is used to track when the thread dump was logged so that subsequent checks won't log the full thread dump for the next 10 minutes since the thread dump could overwhelm the logging system.
- in common use cases when the health check is used for k8s liveness check would be that k8s calls the check a few times and after a configured number of failed checks, k8s would stop the failed pod and start a new one

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java

lhotari · 2022-04-20T16:09:21Z

I was able to add a unit test org.apache.pulsar.broker.admin.AdminApiHealthCheckTest#testHealthCheckupDetectsDeadlock.

It demonstrates what gets logged when a deadlock is detected.

Here's some of the logging:

Found one Java-level deadlock:
=============================
"deadlockthread-1":
  waiting for ownable synchronizer 0x00000000d896b550, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "deadlockthread-2"
"deadlockthread-2":
  waiting for ownable synchronizer 0x00000000d896b4d8, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "deadlockthread-1"

Java stack information for the threads listed above:
===================================================
"deadlockthread-1":
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000000d896b550> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
	at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)
	at org.apache.pulsar.broker.admin.AdminApiHealthCheckTest.deadlock(AdminApiHealthCheckTest.java:133)
	at org.apache.pulsar.broker.admin.AdminApiHealthCheckTest.lambda$testHealthCheckupDetectsDeadlock$4(AdminApiHealthCheckTest.java:109)
	at org.apache.pulsar.broker.admin.AdminApiHealthCheckTest$$Lambda$720/0x00000001008abc40.run(Unknown Source)
	at java.lang.Thread.run([email protected]/Thread.java:829)
"deadlockthread-2":
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000000d896b4d8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
	at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)
	at org.apache.pulsar.broker.admin.AdminApiHealthCheckTest.deadlock(AdminApiHealthCheckTest.java:133)
	at org.apache.pulsar.broker.admin.AdminApiHealthCheckTest.lambda$testHealthCheckupDetectsDeadlock$5(AdminApiHealthCheckTest.java:113)
	at org.apache.pulsar.broker.admin.AdminApiHealthCheckTest$$Lambda$721/0x00000001008ab040.run(Unknown Source)
	at java.lang.Thread.run([email protected]/Thread.java:829)

Found 1 deadlock.

dlg99 · 2022-04-20T16:49:15Z

@lhotari What are the performance implications of this change?

https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html#findDeadlockedThreads() says "It might be an expensive operation."
I think it may require a safepoint and can effectively pause JVM/result in the latency spikes every time health check runs (every minute?).
I'd at least put it behind a feature flag.

It is a good thing to have for the admin API/rest API to do this on demand, for troubleshooting.

lhotari · 2022-04-20T17:08:34Z

@lhotari What are the performance implications of this change?

https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html#findDeadlockedThreads() says "It might be an expensive operation." I think it may require a safepoint and can effectively pause JVM/result in the latency spikes every time health check runs (every minute?). I'd at least put it behind a feature flag.

It is a good thing to have for the admin API/rest API to do this on demand, for troubleshooting.

I don't think that there's a performance concern in this case. it takes about 50ms to run this:

@Test(timeOut = 5000L)
public void testDeadlockDetectionOverhead() {
    ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
    for (int i=0; i < 1000; i++) {
        long[] threadIds = threadBean.findDeadlockedThreads();
    }
}

The healthcheck endpoint gets called usually a few times a minute at most. The existing healthcheck is very heavyweight compared to this check that takes 50 microseconds to execute.

michaeljmarshall · 2022-04-20T20:23:13Z

The healthcheck endpoint gets called usually a few times a minute at most. The existing healthcheck is very heavyweight compared to this check that takes 50 microseconds to execute.

I agree with @dlg99 that we should be careful not to introduce a performance regression here. However, I think @lhotari is correct to point out that the current health check is very expensive. It uses a pulsar client to produce and consume a message. Also, given the unfortunate frequency of deadlock, I think this will enhance the value of the health check.

…finish

…pache#15155) * [Broker] Make health check fail if dead locked threads are detected * Add unit test for detecting a dead lock * Use lockInterruptibly to unlock the deadlock and wait for threads to finish * Add test for testing the deadlock detection overhead (cherry picked from commit df0c110)

…pache#15155) * [Broker] Make health check fail if dead locked threads are detected * Add unit test for detecting a dead lock * Use lockInterruptibly to unlock the deadlock and wait for threads to finish * Add test for testing the deadlock detection overhead (cherry picked from commit df0c110) (cherry picked from commit 6b163f1)

…pache#15155) * [Broker] Make health check fail if dead locked threads are detected * Add unit test for detecting a dead lock * Use lockInterruptibly to unlock the deadlock and wait for threads to finish * Add test for testing the deadlock detection overhead (cherry picked from commit df0c110)

[Broker] Make health check fail if dead locked threads are detected

925dbd4

lhotari added type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages area/broker doc-not-needed Your PR changes do not impact docs labels Apr 13, 2022

lhotari self-assigned this Apr 13, 2022

lhotari requested review from eolivelli, Technoboy-, Jason918, addisonj, massakam and michaeljmarshall April 13, 2022 08:57

lhotari mentioned this pull request Apr 13, 2022

[Issue 13129] [pulsar-metadata] Add watchdog thread in metadata store and track long running tasks. #13130

Closed

2 tasks

Jason918 reviewed Apr 13, 2022

View reviewed changes

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java Show resolved Hide resolved

RobertIndie added this to the 2.11.0 milestone Apr 13, 2022

RobertIndie approved these changes Apr 13, 2022

View reviewed changes

shibd reviewed Apr 13, 2022

View reviewed changes

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java Show resolved Hide resolved

michaeljmarshall approved these changes Apr 13, 2022

View reviewed changes

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java Show resolved Hide resolved

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java Show resolved Hide resolved

lhotari requested a review from dlg99 April 20, 2022 16:18

nicoloboschi approved these changes Apr 20, 2022

View reviewed changes

lhotari added 3 commits April 21, 2022 08:36

Add unit test for detecting a dead lock

8d6fcc4

Use lockInterruptibly to unlock the deadlock and wait for threads to …

52328b1

…finish

Add test for testing the deadlock detection overhead

612b6ba

lhotari force-pushed the lh-detect-deadlocked-threads branch from ae574d2 to 612b6ba Compare April 21, 2022 05:39

lhotari merged commit df0c110 into apache:master Apr 21, 2022

lhotari mentioned this pull request Apr 22, 2022

Fix flaky test AdminApiHealthCheckTest. Wait for deadlocked status to clear before continuing #15266

Merged

tisonkun mentioned this pull request Dec 10, 2022

Add a watchdog monitor to monitor and log long-running metadata-store callback tasks #13129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Broker] Make health check fail if dead locked threads are detected #15155

[Broker] Make health check fail if dead locked threads are detected #15155

lhotari commented Apr 13, 2022

lhotari commented Apr 20, 2022

dlg99 commented Apr 20, 2022

lhotari commented Apr 20, 2022

michaeljmarshall commented Apr 20, 2022

[Broker] Make health check fail if dead locked threads are detected #15155

[Broker] Make health check fail if dead locked threads are detected #15155

Conversation

lhotari commented Apr 13, 2022

Motivation

Modifications

lhotari commented Apr 20, 2022

dlg99 commented Apr 20, 2022

lhotari commented Apr 20, 2022

michaeljmarshall commented Apr 20, 2022