Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web UI does not stop slave servers #911

Closed
giantryansaul opened this issue Nov 7, 2018 · 24 comments
Closed

Web UI does not stop slave servers #911

giantryansaul opened this issue Nov 7, 2018 · 24 comments

Comments

@giantryansaul
Copy link
Contributor

giantryansaul commented Nov 7, 2018

Description of issue / feature request

When clicking "Stop" on the web UI, the users on Slave servers will not stop sending requests. The master instance still says the test is in the "running" state, but the stop button has disappeared.

Expected behavior

All slave servers are stopped and the user count goes back to 0.

Actual behavior

All slave servers remain active and the count of users is over 0.

Environment settings (for bug reports)

  • OS: Ubuntu 16.04
  • Python version: 3.6
  • Locust version: 0.9.0

Steps to reproduce (for bug reports)

(can't share my current code, but it is a very simple mix of get and post requests and no processing or validation of data)

  • Setup a simple requests task.
  • Setup 3 or more slave servers with a master server to distribute task.
  • Run in Web UI mode
  • Run for over 20 minutes
  • Click the Stop button on the Web UI
Jonnymcc pushed a commit to Jonnymcc/locust that referenced this issue Dec 10, 2018
The PUSH and PULL sockets being used caused hatch messages to get routed
to slaves that may have become unresponsive or crashed. This change
includes the client id in the messages sent out from the master which
ensures that hatch messages are going to slaves the are READY or
RUNNING.

This should also fix the issue locustio#911 where slaves are not receiving the
stop message. I think these issues are a result of PUSH-PULL sockets
using a round robin approach.
@Jonnymcc
Copy link
Contributor

Hi @giantryansaul, I have created this PR #927 which I believe will solve the issue you are experiencing. If you have time can you test it out? Thanks!

@giantryansaul
Copy link
Contributor Author

Thanks @Jonnymcc I don't have any time currently, but I'll see if I can get to setting up a test next week. Thanks!

@andrerivas
Copy link

Hey @Jonnymcc, I've tried to use the branch you created but it still does not kill slaves. Not permanently, anyway. I have seen that pressing Stop will briefly stop all requests, but after a few seconds, the slaves start sending requests again.

Is this the correct way to install your branch for testing?

pip install git+https://github.com/Jonnymcc/locust.git@heartbeat#egg=locustio

cgoldberg pushed a commit that referenced this issue Feb 6, 2019
* Replace zmq sockets with one DEALER-ROUTER socket

The PUSH and PULL sockets being used caused hatch messages to get routed
to slaves that may have become unresponsive or crashed. This change
includes the client id in the messages sent out from the master which
ensures that hatch messages are going to slaves the are READY or
RUNNING.

This should also fix the issue #911 where slaves are not receiving the
stop message. I think these issues are a result of PUSH-PULL sockets
using a round robin approach.

* Remove client_id parameter from send_multipart method

* Add heartbeat worker to server and client

The server checks to see if clients have expired and if they have
updates their status to "missing".

The client has a worker that will send a heartbeat on a regular
interval. The heart also relays the slave state back to the
master so that they stay in sync.

* Use new clients.all property in heartbeat worker

* Fix reporting of stopped state

Wait until all slaves are reporting in as ready before stating
that the master is stopped.

* Fix tests after changing ZMQ sockets to DEALER-ROUTER

* Change heartbeat log msg to info so that it does not appear in tests

* Add tests for zmqrpc.py

* Remove commented imports, add note about sleep

* Support str/unicode diff in py2 vs py3

* Ensure failed zmqrpc tests clean up bound sockets

* Create throw away variable for identity from from ZMQ message

I think this looks better than using msg[1].

* Replace usage of parse_options in tests with mock options

Using parse_options during test setup can conflict with test runners
like pytest. Essentially it will swallow up the options that are
meant to be passed to the test runner and instead treats them
as options being passed to the test.

* Set coverage concurrency to gevent

Coverage breaks with gevent and does not fully report green threads
as having been tested. Setting concurrency in .coveragerc will
fix the issue. https://bitbucket.org/ned/coveragepy/issues/149/coverage-gevent-looks-broken

* Add test that shows master heartbeat worker marks slaves missing

* Add assertions to test_zmqrpc.py

* Use unittest assertions

* Change assertion value to bytes object

* Add cmdline options for heartbeat liveness and interval

* Add new option heartbeat_liveness to test_runners mock options

* Ensure SlaveNode class uses heartbeat_liveness default or passed

* Ensure hatch data can be updated for slaves currently hatching

* Add test for start hatching accepted slave states

Checks that start_hatching sends messages to ready, running, and
hatching slaves.

* Remove unneeded imports of mock
@Jonnymcc
Copy link
Contributor

Jonnymcc commented Feb 8, 2019

That is odd. Looks like you are installing it the right way. In my testing, even today, I cannot replicate the problem you are seeing. Provided the stop signal is sent and the slave receives it and stops, I do not know how the tasks would start up again. Unless the master continued to send hatch jobs.

What do you see in the master logs? Here is a test to see if you are using the right install. I created a master in one shell and a slave in another. Then I closed the other shell (without first stopping the slave). Eventually the master misses the heartbeats and logs the slave as disconnected.

[2019-02-08 08:41:09,617] JonathanMBP.local/INFO/locust.main: Starting web monitor at *:8089
[2019-02-08 08:41:09,618] JonathanMBP.local/INFO/locust.main: Starting Locust 0.9.0
[2019-02-08 08:41:32,529] JonathanMBP.local/INFO/locust.runners: Client 'JonathanMBP.local_13eb0a9bb33744248001d5df851768cd' reported as ready. Currently 1 clients ready to swarm.
[2019-02-08 08:41:57,778] JonathanMBP.local/INFO/locust.runners: Slave JonathanMBP.local_13eb0a9bb33744248001d5df851768cd failed to send heartbeat, setting state to missing.

@LRAbbade
Copy link

I'm having the same problem, RPS drops to zero but then it starts to climb again, sometimes creating even more users than originally designated. Only "solution" I've found so far is deleting kube deployment.

@Jonnymcc
Copy link
Contributor

Jonnymcc commented May 13, 2019 via email

@LRAbbade
Copy link

I think you are right, the problem might be that I'm running from the Python image, and the container probably exits once the application exits, making kubernetes start another instance. I will look into this later today, thanks for the help!

@jayudey-vertex
Copy link

jayudey-vertex commented Jun 3, 2019

I'm still seeing this behavior when running locust 0.11.0 in docker swarm.

UPDATE: I started to look further into the issue this morning. When I installed the current version of master it seems to be working as expected.

@clapero
Copy link

clapero commented Jul 26, 2019

I am also seeing this issue with Locust 0.11.0. Same as above, I am deploying the Master and Slaves on different pods on a Kubernetes cluster.
From @Jonnymcc : "Provided the stop signal is sent and the slave receives it and stops" makes me wonder if there are specific ports to open on the Slave pods to actually receive the signals. I currently only have the ports 80, 443 and 8089.

@liuchunming033
Copy link

any update?

@tamilhce
Copy link

Im also facing this issue, any update on this

@max-rocket-internet
Copy link
Contributor

Also seeing this issue. Quite annoying. Only way to solve it is to delete all locust pods.

@tsykora-verimatrix
Copy link

(btw) this is not limited to kubernetes deployments and the issue still persist in locust 0.11.0

@tamilhce
Copy link

tamilhce commented Sep 8, 2019

Any update on this issue ?

@Jonnymcc
Copy link
Contributor

Jonnymcc commented Sep 9, 2019

Just tested this out with v0.8.1 on my Mac with two slaves. I let it run for two hours and then stopped it and the slaves stayed stopped without sending more requests. 🤷‍♂

@tamilhce
Copy link

tamilhce commented Sep 9, 2019

@Jonnymcc, the problem here is, if delete /power off the salve machines.In the locust master it still shows the 2 slaves machines in ready state.
Could you delete the slave machines and check in the master if the slaves count changes to zero.

@Jonnymcc
Copy link
Contributor

Jonnymcc commented Sep 9, 2019

I stopped a slave abruptly by terminating the shell instead of sending sigterm. This is what I see and is as expected. If I restart the test, only the one ready slave will begin sending requests.
Screen Shot 2019-09-09 at 2 27 56 PM

@max-rocket-internet
Copy link
Contributor

Just tested this out with v0.8.1 on my Mac with two slaves

For me it seems to work fine with a very low number of slaves, i.e 2-6. But usually we are running 20-80 and it doesn't work at all.

@yp-photobox
Copy link

yp-photobox commented Sep 12, 2019

I'm seeing the same issue running locust 0.11.0 as docker containers on AWS ECS

@sonja455
Copy link

I think you are right, the problem might be that I'm running from the Python image, and the container probably exits once the application exits, making kubernetes start another instance. I will look into this later today, thanks for the help!

Did you find a solution for this problem so far? Is there a way to stop kubernetes from starting another instance?

@cgoldberg
Copy link
Member

Any update on this issue ?

Updates would be posted here and the issue would be closed.

@finchmeister
Copy link

Installing the latest master version in the dockerfile worked for me:

RUN pip install -e git://github.com/locustio/locust.git@master#egg=locustio

Background: The fix was made in #982 and the version was incremented to 0.11.1 however right now, by default pip will give you an older version without the fix.

@jebrage
Copy link

jebrage commented Sep 25, 2019

This hack worked for me. Also it can be done in the web ui itself.

Hit 'Edit' or 'New test' below status on top right.

Then make number of users=0.

All the users should stop and thus the tests should also stop running.

@jmattiace
Copy link

Just confirmed it works for me as well

lvdh added a commit to lvdh/distributed-locustio-on-aws that referenced this issue Oct 20, 2019
Various fixes and improvements. Including a fix for the slaves which keep sending requests when stopping the load test: locustio/locust#911
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests