Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When a slave process crashes and restarts, the master counts and waits for input from both #699

Closed
kathleentully opened this issue Nov 28, 2017 · 23 comments

Comments

@kathleentully
Copy link

kathleentully commented Nov 28, 2017

Description of issue / feature request

I have 2 instances running - 1 master and 1 slave. I start running a load test and the UI shows 1 slave. Once that slave hits an unrecoverable exception, the process crashes and I'm automatically bringing it back up. The new process registers with the master process as a new process. The UI now shows 2 slaves - the dead one and the new one, both from the same instance.

In my case, the unrecoverable error is a MemoryError. I have many users uploading and downloading large files. I am otherwise working on managing these files more efficiently, but Locust should still account for failing slaves.

My instances are set up in a similar way as https://github.com/awslabs/eb-locustio-sample

Expected behavior

  • As I only have 2 instances, UI should only show 1 slave at all times.
  • Clicking stop in the UI stops all processes and allows the user to start a new test.
  • Starting a new test will change the number of users to the new value entered.

Actual behavior

  • UI shows 2 slaves after one process crashes and is brought back up.
  • After 2 slaves are showing in the UI, clicking stop in the UI can only stop the processes that are currently alive. It hangs waiting for processes that are no longer running.
  • Starting a new test increases the number of users displayed by the value entered. For example, if I run a test with 20 users, "stop" it, try to run a new one with 10 users, it will display 30 users.

Environment settings (for bug reports)

  • OS: Amazon Linux (RedHat-ish)
  • Python version: 2.7
  • Locust version: 0.8.1

Steps to reproduce (for bug reports)

Set up 2 instances and start a load test. As it is going, kill the slave instance and start it up again.

@cgoldberg
Copy link
Member

but Locust should still account for failing slaves.

this is not currently implemented. after an unrecoverable error, you will have to restart from scratch.

@AnotherDevBoy
Copy link
Contributor

Hi @cgoldberg,

Is there a reason why this is not implemented yet? I find it pretty useful specially for cloud environments, where your nodes can be deleted and created somewhere else in different order.

If there isn't, I was thinking to implement this based on heartbeats and send a pull request.

@cgoldberg
Copy link
Member

cgoldberg commented Jan 31, 2018

@coderlifter

Is there a reason why this is not implemented yet?

because nobody has implemented it :)

I also don't see a use case that warrants adding the complexity to Locust.. maybe you can you explain in more detail the problem this solves? Specifically, Why is it ever useful to re-add failed slaves without restarting the master?

@kathleentully
Copy link
Author

kathleentully commented Jan 31, 2018 via email

@cgoldberg
Copy link
Member

Once you hit memory constraints on a slave and it crashes, isn't it already useless?

@heyman
Copy link
Member

heyman commented Jan 31, 2018

I think a good first step could be to add an additional tab in the web UI where one can see the current status of the slave/worker nodes. The workers already send a kind of heartbeat when they report in the statistics. If the master node hasn't gotten a ping in X seconds, it could display "worker lost" or something similar. We could also display potential delay time in the reporting which would indicate that the slave machine has too high CPU load.

That would at least make it easier to understand what has gone wrong if a slave node crashes.

@AnotherDevBoy
Copy link
Contributor

I am also running in a container based environment (not Beanstalk but similar) and I had a similar problem.

Imagine you have 1 master and 1 slave and suddenly, your slave node dies. The container orchestrator will respawn a new slave but then master thinks it is a new one and adds it to the total count.

Now the master thinks there are 2 slaves and keep sending instructions to both of them even though only one of them is actually response.

I would like to use heartbeats as a way for the master to notice when a slave dies and remove it from the client list. I agree with @HeyHugo , reusing the statistics report sounds like the right thing to do.

On top of this (not sure if this is an environment issue yet), I noticed that master and slave need to be started in sequence. With this heartbeat mechanism, it would add a little bit more flexibility as well.

@cgoldberg
Copy link
Member

We could also display potential delay time in
the reporting which would indicate that the slave
machine has too high CPU load.

if we add psutil, slaves could report back real metrics.

@cgoldberg
Copy link
Member

@coderlifter

Now the master thinks there are 2 slaves and keep sending
instructions to both of them even though only one of them
is actually response.

I guess my point was that once a slave dies, all bets are off and your results are no good.... so why continue generating load at all?

@AnotherDevBoy
Copy link
Contributor

It depends what you are trying to achieve with the tests really.

If your goal is to run a continous load for X hours, then you probably want the slaves to recover after restart and come back to work.

In that scenario, having 1 slave down for a few minutes shouldn't make the results useless.

@bubanoid
Copy link

bubanoid commented Mar 2, 2018

I guess my point was that once a slave dies, all bets are off and your results are no good....

Why? I run slave under the Circus. When slave dies it doesn't load target server, but it is not important. It lasts some short period of time until circus raises it up again. All statistics from this slave is already saved in master. Then new slave continues working.

Of course, such scenario is only useful if master can detect dead slaves and stops sending signals and gathering statistics from such slaves.

Do I misunderstand something?

@AnotherDevBoy
Copy link
Contributor

AnotherDevBoy commented Mar 2, 2018

I agree with @bubanoid. In my case, I am using it with Docker.

Personally, I noticed some improvement when I was running Locust on exec form (CMD [...]). That way, SIGTERM signal was handled properly and the slaves were notifying the master properly when they were getting killed.

However, I still think that there is value with an extra mechanism that is able to:

  • On the master node: detect when a slave disappears and redistribute the load (users) with the remaining slaves.
  • On the slave nodes: detect when the master disappears, stop shooting load and send 'ready' signal to master until it comes back

@heyman
Copy link
Member

heyman commented Mar 2, 2018

On the master node: detect when a slave disappears and redistribute the load (users) with the remaining slaves.

Then the question is when to consider a slave as lost? There's a possible scenario where a slave node is so loaded that the stats reporting is delayed long enough for the master to think it's dead. The master node would then order all the other slave nodes to spawn more users, even though the original slave node still has users running.

Perhaps this would be OK, as long as it's made extremely clear in both the web UI, as well as the console (for headless tests), that a worker has been lost, the users has been redistributed, and that one might want to consider restarting the whole test.

On the slave nodes: detect when the master disappears, stop shooting load and send 'ready' signal to master until it comes back

For this, I'm definitely -1. The master node holds all the state for the current running test and it shouldn't die. To both try to reconstruct the master node's state from the pings that the slave nodes sends, as well as make the slave nodes "smart" by detecting when they think the master node has gone down, would add a lot of complexity.

@AnotherDevBoy
Copy link
Contributor

AnotherDevBoy commented Mar 2, 2018

Thanks for your response @heyman !

The scenario I was thinking about for the second point is the following:

I currently have a deployment of distributed Locust which I am using for a few days in a row for different tests.

  1. On day 1, everything works fine and I get some numbers that I need to study.
  2. I come back on day 2 and, because master node has died/rebooted, now I need to redeploy the slaves nodes again before continuing with the tests (which is a pain).

TL;DR Master recovery doesn't make sense in the context of a single load test but it makes sense when you are running multiple tests for a longer period of time.

@heyman
Copy link
Member

heyman commented Mar 2, 2018

@coderlifter I understand the problem but I currently don't think it's worth implementing master recovery (for the reasons stated above).

How come the master node dies? If it's crashing it's something we should address in Locust. If it's due to your environment, then I would suggest you fix the environment.

@AnotherDevBoy
Copy link
Contributor

Hmm, not sure if I would call this specific to my environment.

I am deploying Locust on a few Docker containers. In the container world, the orchestrator could decide to move your containers to another box (host machine reboot, patching, etc).

In that case, the orchestrator will kill the container, create a new one, and update whatever service mechanism it uses (i.e. repointing DNS record).

Implementing a solution to this problem would make Locust more Docker/Cloud friendly, which is a plus in my opinion.

@heyman
Copy link
Member

heyman commented Mar 2, 2018

@coderlifter There are a lot of different stateful applications that doesn't support getting arbitrarily killed at any time. From the top of my head I can think of Redis, PostgreSQL (unless you configure a quite complex high availability setup with streaming replicas and your own solution for promoting a standby server to new master), MySQL, etc.

It sounds hard to run those kind of services in a server/container-environment where you can't "pin" a certain service/container so that it won't get arbitrarily killed.

@AnotherDevBoy
Copy link
Contributor

All those products you are listing are databases, which are inherently stateful. Given that Locust doesn't persist any test data (outside of memory), I was looking at it as a stateless product.

Without a master recovery mechanism, when the master comes back, I see 2 main problems which currently require slaves restart:

  • If there isn't a running test, master can't command slaves to start the test.
  • If there is a running test, master can't command slaves to stop the test.

Implementing a solution for both will add complexity, as you pointed out. However, I think there is still value on implementing a solution for the second problem: when the slaves detect that the master is down, they stop the test.

That way, you won't have machine guns shooting "out of control".

@heyman
Copy link
Member

heyman commented Mar 2, 2018

I know they are DBs. The point I was trying to make was that it sounds unusual with a container environment that doesn't support running any stateful applications. And currently both Locust master and slave nodes are stateful.

@bubanoid
Copy link

bubanoid commented Mar 3, 2018

@coderlifter, @heyman

Personally, I noticed some improvement when I was running Locust on exec form (CMD [...]). That way, SIGTERM signal was handled properly and the slaves were notifying the master properly when they were getting killed.

Unfortunately, it doesn't indicate correct work. Simple experiment shows that if you send SIGTERM to a slave and then restart that slave this will stop gathering statistics by master at least from that slave. We need restart spawning users on master completely to get reliable results. However, master will really indicate correct number of slaves that means nothing. If you send SIGKILL to the slave and start that slave again then you will get indication on master that additional slave appeared. Of course, statistics will not be gathered and full tests restart is needed again.

On the slave nodes: detect when the master disappears, stop shooting load and send 'ready' signal to master until it comes back

In common use master will not die unexpectedly as it doesn't undergo significant load and doesn't take a lot of memory. So, it is really unlikely to encounter master death.

On the master node: detect when a slave disappears and redistribute the load (users) with the remaining slaves.

I think a good first step could be to add an additional tab in the web UI where one can see the current status of the slave/worker nodes. The workers already send a kind of heartbeat when they report in the statistics. If the master node hasn't gotten a ping in X seconds, it could display "worker lost" or something similar. We could also display potential delay time in the reporting which would indicate that the slave machine has too high CPU load.

Ok, heartbeat would be great, but there is a question: why could slave stop responding? I see two answers: due to CPU overload or memory errors (lack of the memory). Now imagine that slave responds again or died slave is restarted (my case with Circus usage) and master can properly handle such slave. But critical load didn't disappear. Slave is operating on the edge of correct work/hanging and, hence, doesn't provide the load to the target we expect. Even if master will give us some statistics it wouldn't be the result we need. So, I see the only correct solution in such situation: decrease the number of spawned users or increase the number of slaves.

@AnotherDevBoy
Copy link
Contributor

I see two answers: due to CPU overload or memory errors (lack of the memory).

I would also add network issues to your breakdown: disconnects, max network bandwidth reached (frequent when running on cloud environments), max number of connections reached, etc

In my opinion, if a slave loses connectivity to master (for a relatively long period of time), it might as well consider it down and come back to ready state instead of keep shooting requests.

@fanjindong
Copy link

I need this feature

@bill-within
Copy link

I would also find this useful. Google is holding out Locust on Kubernetes as way of implementing distributed load testing (https://github.com/GoogleCloudPlatform/distributed-load-testing-using-kubernetes). I'd like to run a single node with a master and slave pods, then auto-scale the slaves out (and back in) using a horizontal pod autoscaler and the cluster autoscaler. It was disappointing to see that it only kinda half-works, because the master doesn't gracefully handle slaves disappearing. Otherwise it's a very slick system and I appreciate the work you folks have put into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants