-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Broker] Make health check fail if dead locked threads are detected #15155
Conversation
pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java
Show resolved
Hide resolved
I was able to add a unit test It demonstrates what gets logged when a deadlock is detected. Here's some of the logging:
|
@lhotari What are the performance implications of this change? https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html#findDeadlockedThreads() says "It might be an expensive operation." It is a good thing to have for the admin API/rest API to do this on demand, for troubleshooting. |
I don't think that there's a performance concern in this case. it takes about 50ms to run this: @Test(timeOut = 5000L)
public void testDeadlockDetectionOverhead() {
ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
for (int i=0; i < 1000; i++) {
long[] threadIds = threadBean.findDeadlockedThreads();
}
} The healthcheck endpoint gets called usually a few times a minute at most. The existing healthcheck is very heavyweight compared to this check that takes 50 microseconds to execute. |
I agree with @dlg99 that we should be careful not to introduce a performance regression here. However, I think @lhotari is correct to point out that the current health check is very expensive. It uses a pulsar client to produce and consume a message. Also, given the unfortunate frequency of deadlock, I think this will enhance the value of the health check. |
ae574d2
to
612b6ba
Compare
…pache#15155) * [Broker] Make health check fail if dead locked threads are detected * Add unit test for detecting a dead lock * Use lockInterruptibly to unlock the deadlock and wait for threads to finish * Add test for testing the deadlock detection overhead (cherry picked from commit df0c110)
…pache#15155) * [Broker] Make health check fail if dead locked threads are detected * Add unit test for detecting a dead lock * Use lockInterruptibly to unlock the deadlock and wait for threads to finish * Add test for testing the deadlock detection overhead (cherry picked from commit df0c110)
…pache#15155) * [Broker] Make health check fail if dead locked threads are detected * Add unit test for detecting a dead lock * Use lockInterruptibly to unlock the deadlock and wait for threads to finish * Add test for testing the deadlock detection overhead (cherry picked from commit df0c110) (cherry picked from commit 6b163f1)
…pache#15155) * [Broker] Make health check fail if dead locked threads are detected * Add unit test for detecting a dead lock * Use lockInterruptibly to unlock the deadlock and wait for threads to finish * Add test for testing the deadlock detection overhead (cherry picked from commit df0c110)
Motivation
The Admin API contains a health check endpoint. A common problem with Pulsar has been that there's a bug that causes a dead lock. This thread dead lock might not be detected by the health check that sends a message and consumes it.
The Java JVM contains methods in the JMX API to detect dead locked threads. It's useful to make the health check fail if any dead locked threads are detected.
Modifications