Web UI does not stop slave servers #911

giantryansaul · 2018-11-07T17:41:46Z

Description of issue / feature request

When clicking "Stop" on the web UI, the users on Slave servers will not stop sending requests. The master instance still says the test is in the "running" state, but the stop button has disappeared.

Expected behavior

All slave servers are stopped and the user count goes back to 0.

Actual behavior

All slave servers remain active and the count of users is over 0.

Environment settings (for bug reports)

OS: Ubuntu 16.04
Python version: 3.6
Locust version: 0.9.0

Steps to reproduce (for bug reports)

(can't share my current code, but it is a very simple mix of get and post requests and no processing or validation of data)

Setup a simple requests task.
Setup 3 or more slave servers with a master server to distribute task.
Run in Web UI mode
Run for over 20 minutes
Click the Stop button on the Web UI

The PUSH and PULL sockets being used caused hatch messages to get routed to slaves that may have become unresponsive or crashed. This change includes the client id in the messages sent out from the master which ensures that hatch messages are going to slaves the are READY or RUNNING. This should also fix the issue locustio#911 where slaves are not receiving the stop message. I think these issues are a result of PUSH-PULL sockets using a round robin approach.

Jonnymcc · 2018-12-12T14:24:36Z

Hi @giantryansaul, I have created this PR #927 which I believe will solve the issue you are experiencing. If you have time can you test it out? Thanks!

giantryansaul · 2018-12-14T18:49:55Z

Thanks @Jonnymcc I don't have any time currently, but I'll see if I can get to setting up a test next week. Thanks!

andrerivas · 2019-02-06T18:32:46Z

Hey @Jonnymcc, I've tried to use the branch you created but it still does not kill slaves. Not permanently, anyway. I have seen that pressing Stop will briefly stop all requests, but after a few seconds, the slaves start sending requests again.

Is this the correct way to install your branch for testing?

pip install git+https://github.com/Jonnymcc/locust.git@heartbeat#egg=locustio

* Replace zmq sockets with one DEALER-ROUTER socket The PUSH and PULL sockets being used caused hatch messages to get routed to slaves that may have become unresponsive or crashed. This change includes the client id in the messages sent out from the master which ensures that hatch messages are going to slaves the are READY or RUNNING. This should also fix the issue #911 where slaves are not receiving the stop message. I think these issues are a result of PUSH-PULL sockets using a round robin approach. * Remove client_id parameter from send_multipart method * Add heartbeat worker to server and client The server checks to see if clients have expired and if they have updates their status to "missing". The client has a worker that will send a heartbeat on a regular interval. The heart also relays the slave state back to the master so that they stay in sync. * Use new clients.all property in heartbeat worker * Fix reporting of stopped state Wait until all slaves are reporting in as ready before stating that the master is stopped. * Fix tests after changing ZMQ sockets to DEALER-ROUTER * Change heartbeat log msg to info so that it does not appear in tests * Add tests for zmqrpc.py * Remove commented imports, add note about sleep * Support str/unicode diff in py2 vs py3 * Ensure failed zmqrpc tests clean up bound sockets * Create throw away variable for identity from from ZMQ message I think this looks better than using msg[1]. * Replace usage of parse_options in tests with mock options Using parse_options during test setup can conflict with test runners like pytest. Essentially it will swallow up the options that are meant to be passed to the test runner and instead treats them as options being passed to the test. * Set coverage concurrency to gevent Coverage breaks with gevent and does not fully report green threads as having been tested. Setting concurrency in .coveragerc will fix the issue. https://bitbucket.org/ned/coveragepy/issues/149/coverage-gevent-looks-broken * Add test that shows master heartbeat worker marks slaves missing * Add assertions to test_zmqrpc.py * Use unittest assertions * Change assertion value to bytes object * Add cmdline options for heartbeat liveness and interval * Add new option heartbeat_liveness to test_runners mock options * Ensure SlaveNode class uses heartbeat_liveness default or passed * Ensure hatch data can be updated for slaves currently hatching * Add test for start hatching accepted slave states Checks that start_hatching sends messages to ready, running, and hatching slaves. * Remove unneeded imports of mock

Jonnymcc · 2019-02-08T13:50:38Z

That is odd. Looks like you are installing it the right way. In my testing, even today, I cannot replicate the problem you are seeing. Provided the stop signal is sent and the slave receives it and stops, I do not know how the tasks would start up again. Unless the master continued to send hatch jobs.

What do you see in the master logs? Here is a test to see if you are using the right install. I created a master in one shell and a slave in another. Then I closed the other shell (without first stopping the slave). Eventually the master misses the heartbeats and logs the slave as disconnected.

[2019-02-08 08:41:09,617] JonathanMBP.local/INFO/locust.main: Starting web monitor at *:8089
[2019-02-08 08:41:09,618] JonathanMBP.local/INFO/locust.main: Starting Locust 0.9.0
[2019-02-08 08:41:32,529] JonathanMBP.local/INFO/locust.runners: Client 'JonathanMBP.local_13eb0a9bb33744248001d5df851768cd' reported as ready. Currently 1 clients ready to swarm.
[2019-02-08 08:41:57,778] JonathanMBP.local/INFO/locust.runners: Slave JonathanMBP.local_13eb0a9bb33744248001d5df851768cd failed to send heartbeat, setting state to missing.

LRAbbade · 2019-05-13T15:52:55Z

I'm having the same problem, RPS drops to zero but then it starts to climb again, sometimes creating even more users than originally designated. Only "solution" I've found so far is deleting kube deployment.

Jonnymcc · 2019-05-13T18:39:52Z

In kubernetes it could be that new pods are coming up and registering as newly attached slaves. Not sure why the slave pods would exit though. I thought even when they are not running they do not exit. I may be wrong.

…

On Mon, May 13, 2019, 11:53 AM Lucas Abbade ***@***.***> wrote: I'm having the same problem, RPS drops to zero but then it starts to climb again, sometimes creating even more users than originally designated. Only "solution" I've found so far is deleting kube deployment. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#911?email_source=notifications&email_token=AAH6JPQ4RNBWGKVD4PBPEMLPVGFGVA5CNFSM4GCLUU3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVIX7EI#issuecomment-491880337>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAH6JPSFLGBFPC3XVNSDXRLPVGFGVANCNFSM4GCLUU3A> .

LRAbbade · 2019-05-14T17:01:43Z

I think you are right, the problem might be that I'm running from the Python image, and the container probably exits once the application exits, making kubernetes start another instance. I will look into this later today, thanks for the help!

jayudey-vertex · 2019-06-03T21:23:13Z

I'm still seeing this behavior when running locust 0.11.0 in docker swarm.

UPDATE: I started to look further into the issue this morning. When I installed the current version of master it seems to be working as expected.

clapero · 2019-07-26T01:26:11Z

I am also seeing this issue with Locust 0.11.0. Same as above, I am deploying the Master and Slaves on different pods on a Kubernetes cluster.
From @Jonnymcc : "Provided the stop signal is sent and the slave receives it and stops" makes me wonder if there are specific ports to open on the Slave pods to actually receive the signals. I currently only have the ports 80, 443 and 8089.

liuchunming033 · 2019-08-15T03:59:30Z

any update?

tamilhce · 2019-08-18T13:31:04Z

Im also facing this issue, any update on this

max-rocket-internet · 2019-08-19T15:37:22Z

Also seeing this issue. Quite annoying. Only way to solve it is to delete all locust pods.

tsykora-verimatrix · 2019-09-05T07:08:46Z

(btw) this is not limited to kubernetes deployments and the issue still persist in locust 0.11.0

tamilhce · 2019-09-08T12:25:13Z

Any update on this issue ?

Jonnymcc · 2019-09-09T16:59:58Z

Just tested this out with v0.8.1 on my Mac with two slaves. I let it run for two hours and then stopped it and the slaves stayed stopped without sending more requests. 🤷‍♂

tamilhce · 2019-09-09T17:05:19Z

@Jonnymcc, the problem here is, if delete /power off the salve machines.In the locust master it still shows the 2 slaves machines in ready state.
Could you delete the slave machines and check in the master if the slaves count changes to zero.

Jonnymcc · 2019-09-09T18:29:51Z

I stopped a slave abruptly by terminating the shell instead of sending sigterm. This is what I see and is as expected. If I restart the test, only the one ready slave will begin sending requests.

max-rocket-internet · 2019-09-10T09:39:24Z

Just tested this out with v0.8.1 on my Mac with two slaves

For me it seems to work fine with a very low number of slaves, i.e 2-6. But usually we are running 20-80 and it doesn't work at all.

yp-photobox · 2019-09-12T13:48:44Z

I'm seeing the same issue running locust 0.11.0 as docker containers on AWS ECS

sonja455 · 2019-09-18T13:36:42Z

I think you are right, the problem might be that I'm running from the Python image, and the container probably exits once the application exits, making kubernetes start another instance. I will look into this later today, thanks for the help!

Did you find a solution for this problem so far? Is there a way to stop kubernetes from starting another instance?

cgoldberg · 2019-09-18T16:50:48Z

Any update on this issue ?

Updates would be posted here and the issue would be closed.

finchmeister · 2019-09-24T15:54:22Z

Installing the latest master version in the dockerfile worked for me:

RUN pip install -e git://github.com/locustio/locust.git@master#egg=locustio

Background: The fix was made in #982 and the version was incremented to 0.11.1 however right now, by default pip will give you an older version without the fix.

jebrage · 2019-09-25T11:12:26Z

This hack worked for me. Also it can be done in the web ui itself.

Hit 'Edit' or 'New test' below status on top right.

Then make number of users=0.

All the users should stop and thus the tests should also stop running.

jmattiace · 2019-09-28T01:58:58Z

Just confirmed it works for me as well

Various fixes and improvements. Including a fix for the slaves which keep sending requests when stopping the load test: locustio/locust#911

cgoldberg closed this as completed Sep 24, 2019

lvdh added a commit to lvdh/distributed-locustio-on-aws that referenced this issue Oct 20, 2019

Bump locustio version

e308518

Various fixes and improvements. Including a fix for the slaves which keep sending requests when stopping the load test: locustio/locust#911

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web UI does not stop slave servers #911

Web UI does not stop slave servers #911

giantryansaul commented Nov 7, 2018 •

edited

Loading

Jonnymcc commented Dec 12, 2018

giantryansaul commented Dec 14, 2018

andrerivas commented Feb 6, 2019

Jonnymcc commented Feb 8, 2019

LRAbbade commented May 13, 2019

Jonnymcc commented May 13, 2019 via email

LRAbbade commented May 14, 2019

jayudey-vertex commented Jun 3, 2019 •

edited

Loading

clapero commented Jul 26, 2019

liuchunming033 commented Aug 15, 2019

tamilhce commented Aug 18, 2019

max-rocket-internet commented Aug 19, 2019

tsykora-verimatrix commented Sep 5, 2019

tamilhce commented Sep 8, 2019

Jonnymcc commented Sep 9, 2019

tamilhce commented Sep 9, 2019 •

edited

Loading

Jonnymcc commented Sep 9, 2019

max-rocket-internet commented Sep 10, 2019

yp-photobox commented Sep 12, 2019 •

edited

Loading

sonja455 commented Sep 18, 2019

cgoldberg commented Sep 18, 2019

finchmeister commented Sep 24, 2019

jebrage commented Sep 25, 2019

jmattiace commented Sep 28, 2019

Web UI does not stop slave servers #911

Web UI does not stop slave servers #911

Comments

giantryansaul commented Nov 7, 2018 • edited Loading

Description of issue / feature request

Expected behavior

Actual behavior

Environment settings (for bug reports)

Steps to reproduce (for bug reports)

Jonnymcc commented Dec 12, 2018

giantryansaul commented Dec 14, 2018

andrerivas commented Feb 6, 2019

Jonnymcc commented Feb 8, 2019

LRAbbade commented May 13, 2019

Jonnymcc commented May 13, 2019 via email

LRAbbade commented May 14, 2019

jayudey-vertex commented Jun 3, 2019 • edited Loading

clapero commented Jul 26, 2019

liuchunming033 commented Aug 15, 2019

tamilhce commented Aug 18, 2019

max-rocket-internet commented Aug 19, 2019

tsykora-verimatrix commented Sep 5, 2019

tamilhce commented Sep 8, 2019

Jonnymcc commented Sep 9, 2019

tamilhce commented Sep 9, 2019 • edited Loading

Jonnymcc commented Sep 9, 2019

max-rocket-internet commented Sep 10, 2019

yp-photobox commented Sep 12, 2019 • edited Loading

sonja455 commented Sep 18, 2019

cgoldberg commented Sep 18, 2019

finchmeister commented Sep 24, 2019

jebrage commented Sep 25, 2019

jmattiace commented Sep 28, 2019

giantryansaul commented Nov 7, 2018 •

edited

Loading

jayudey-vertex commented Jun 3, 2019 •

edited

Loading

tamilhce commented Sep 9, 2019 •

edited

Loading

yp-photobox commented Sep 12, 2019 •

edited

Loading