Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Broker] Make health check fail if dead locked threads are detected #15155

Merged
merged 4 commits into from
Apr 21, 2022

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Apr 13, 2022

Motivation

The Admin API contains a health check endpoint. A common problem with Pulsar has been that there's a bug that causes a dead lock. This thread dead lock might not be detected by the health check that sends a message and consumes it.
The Java JVM contains methods in the JMX API to detect dead locked threads. It's useful to make the health check fail if any dead locked threads are detected.

Modifications

  • add additional check to health check endpoint to detect deadlocked threads
  • there was a small issue in the existing ThreadDumpUtil so that it used a limited method for detecting thread dead locks, "findMonitorDeadlockedThreads". The correct method to use is "findDeadlockedThreads" which will detect more case (javadoc: "Finds cycles of threads that are in deadlock waiting to acquire object monitors or ownable synchronizers.").
  • log the full thread dump of all threads when a dead lock is detected. A timestamp field is used to track when the thread dump was logged so that subsequent checks won't log the full thread dump for the next 10 minutes since the thread dump could overwhelm the logging system.
    • in common use cases when the health check is used for k8s liveness check would be that k8s calls the check a few times and after a configured number of failed checks, k8s would stop the failed pod and start a new one

@lhotari lhotari added type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages area/broker doc-not-needed Your PR changes do not impact docs labels Apr 13, 2022
@lhotari lhotari self-assigned this Apr 13, 2022
@RobertIndie RobertIndie added this to the 2.11.0 milestone Apr 13, 2022
@lhotari
Copy link
Member Author

lhotari commented Apr 20, 2022

I was able to add a unit test org.apache.pulsar.broker.admin.AdminApiHealthCheckTest#testHealthCheckupDetectsDeadlock.

It demonstrates what gets logged when a deadlock is detected.

Here's some of the logging:

Found one Java-level deadlock:
=============================
"deadlockthread-1":
  waiting for ownable synchronizer 0x00000000d896b550, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "deadlockthread-2"
"deadlockthread-2":
  waiting for ownable synchronizer 0x00000000d896b4d8, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "deadlockthread-1"

Java stack information for the threads listed above:
===================================================
"deadlockthread-1":
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000000d896b550> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
	at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)
	at org.apache.pulsar.broker.admin.AdminApiHealthCheckTest.deadlock(AdminApiHealthCheckTest.java:133)
	at org.apache.pulsar.broker.admin.AdminApiHealthCheckTest.lambda$testHealthCheckupDetectsDeadlock$4(AdminApiHealthCheckTest.java:109)
	at org.apache.pulsar.broker.admin.AdminApiHealthCheckTest$$Lambda$720/0x00000001008abc40.run(Unknown Source)
	at java.lang.Thread.run([email protected]/Thread.java:829)
"deadlockthread-2":
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000000d896b4d8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
	at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)
	at org.apache.pulsar.broker.admin.AdminApiHealthCheckTest.deadlock(AdminApiHealthCheckTest.java:133)
	at org.apache.pulsar.broker.admin.AdminApiHealthCheckTest.lambda$testHealthCheckupDetectsDeadlock$5(AdminApiHealthCheckTest.java:113)
	at org.apache.pulsar.broker.admin.AdminApiHealthCheckTest$$Lambda$721/0x00000001008ab040.run(Unknown Source)
	at java.lang.Thread.run([email protected]/Thread.java:829)

Found 1 deadlock.

@lhotari lhotari requested a review from dlg99 April 20, 2022 16:18
@dlg99
Copy link
Contributor

dlg99 commented Apr 20, 2022

@lhotari What are the performance implications of this change?

https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html#findDeadlockedThreads() says "It might be an expensive operation."
I think it may require a safepoint and can effectively pause JVM/result in the latency spikes every time health check runs (every minute?).
I'd at least put it behind a feature flag.

It is a good thing to have for the admin API/rest API to do this on demand, for troubleshooting.

@lhotari
Copy link
Member Author

lhotari commented Apr 20, 2022

@lhotari What are the performance implications of this change?

https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html#findDeadlockedThreads() says "It might be an expensive operation." I think it may require a safepoint and can effectively pause JVM/result in the latency spikes every time health check runs (every minute?). I'd at least put it behind a feature flag.

It is a good thing to have for the admin API/rest API to do this on demand, for troubleshooting.

I don't think that there's a performance concern in this case. it takes about 50ms to run this:

@Test(timeOut = 5000L)
public void testDeadlockDetectionOverhead() {
    ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
    for (int i=0; i < 1000; i++) {
        long[] threadIds = threadBean.findDeadlockedThreads();
    }
}

The healthcheck endpoint gets called usually a few times a minute at most. The existing healthcheck is very heavyweight compared to this check that takes 50 microseconds to execute.

@michaeljmarshall
Copy link
Member

The healthcheck endpoint gets called usually a few times a minute at most. The existing healthcheck is very heavyweight compared to this check that takes 50 microseconds to execute.

I agree with @dlg99 that we should be careful not to introduce a performance regression here. However, I think @lhotari is correct to point out that the current health check is very expensive. It uses a pulsar client to produce and consume a message. Also, given the unfortunate frequency of deadlock, I think this will enhance the value of the health check.

@lhotari lhotari force-pushed the lh-detect-deadlocked-threads branch from ae574d2 to 612b6ba Compare April 21, 2022 05:39
@lhotari lhotari merged commit df0c110 into apache:master Apr 21, 2022
lhotari added a commit to datastax/pulsar that referenced this pull request Apr 22, 2022
…pache#15155)

* [Broker] Make health check fail if dead locked threads are detected

* Add unit test for detecting a dead lock

* Use lockInterruptibly to unlock the deadlock and wait for threads to finish

* Add test for testing the deadlock detection overhead

(cherry picked from commit df0c110)
lhotari added a commit to datastax/pulsar that referenced this pull request Apr 22, 2022
…pache#15155)

* [Broker] Make health check fail if dead locked threads are detected

* Add unit test for detecting a dead lock

* Use lockInterruptibly to unlock the deadlock and wait for threads to finish

* Add test for testing the deadlock detection overhead

(cherry picked from commit df0c110)
lhotari added a commit to datastax/pulsar that referenced this pull request Apr 22, 2022
…pache#15155)

* [Broker] Make health check fail if dead locked threads are detected

* Add unit test for detecting a dead lock

* Use lockInterruptibly to unlock the deadlock and wait for threads to finish

* Add test for testing the deadlock detection overhead

(cherry picked from commit df0c110)
(cherry picked from commit 6b163f1)
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Apr 28, 2022
…pache#15155)

* [Broker] Make health check fail if dead locked threads are detected

* Add unit test for detecting a dead lock

* Use lockInterruptibly to unlock the deadlock and wait for threads to finish

* Add test for testing the deadlock detection overhead

(cherry picked from commit df0c110)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker doc-not-needed Your PR changes do not impact docs type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants