-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add heartbeat to detect down slaves #927
Conversation
The PUSH and PULL sockets being used caused hatch messages to get routed to slaves that may have become unresponsive or crashed. This change includes the client id in the messages sent out from the master which ensures that hatch messages are going to slaves the are READY or RUNNING. This should also fix the issue locustio#911 where slaves are not receiving the stop message. I think these issues are a result of PUSH-PULL sockets using a round robin approach.
The server checks to see if clients have expired and if they have updates their status to "missing". The client has a worker that will send a heartbeat on a regular interval. The heart also relays the slave state back to the master so that they stay in sync.
Wait until all slaves are reporting in as ready before stating that the master is stopped.
Codecov Report
@@ Coverage Diff @@
## master #927 +/- ##
==========================================
+ Coverage 66.55% 67.52% +0.97%
==========================================
Files 14 14
Lines 1438 1472 +34
Branches 226 230 +4
==========================================
+ Hits 957 994 +37
+ Misses 430 429 -1
+ Partials 51 49 -2
Continue to review full report at Codecov.
|
I think this looks better than using msg[1].
Using parse_options during test setup can conflict with test runners like pytest. Essentially it will swallow up the options that are meant to be passed to the test runner and instead treats them as options being passed to the test.
Coverage breaks with gevent and does not fully report green threads as having been tested. Setting concurrency in .coveragerc will fix the issue. https://bitbucket.org/ned/coveragepy/issues/149/coverage-gevent-looks-broken
Checks that start_hatching sends messages to ready, running, and hatching slaves.
Hi, @Jonnymcc It's a very useful feature. So far, the master can not detect slaves that exit unexpectedly. But, can you keep compatible with previous versions of locust and third party tools? What about binding to another port? |
@myzhan I think we can support previous versions. Let me know if I understand what is needed though.
By compatible with previous versions are you implying running a master/client with a master/client of another version? That sounds like asking for bugs regardless of the changes made in the PR. |
@Jonnymcc 👍 this is really nice! |
@cgoldberg Any plans for release? I checked last release was in September, 2018. This is quite a major fix, do you think we should have a release now? |
@mangatmodi this is now released 🎉 |
I guess the fix to stop the test after hitting the stop button in the UI did not make it to the 0.11.0 release |
I have noticed some issues bringing up master and slave nodes within Kubernetes. I noticed in some other github issues for Locust that some of the issues may be solved by adding a heartbeat between the nodes.
The changes include the addition of a heartbeat thread on the master. The slaves check in with the master by means of their own heartbeat worker thread. There is also a new state (missing) that will show in the UI if a slave fails to check in with the heartbeat to the master.
I also think that it prevents some stale state by having the slaves also send their current state to the master. Reason being that if a slave fails to check in and goes missing, how else would the master know what the slave is doing after it continues to check in.
I got my inspiration from the ZMQ guide, specifically this pattern...
http://zguide.zeromq.org/page:all#Robust-Reliable-Queuing-Paranoid-Pirate-Pattern
Thoughts, comments?