Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding TCP Keep Alive to guarantee master-slave communication after idle periods #1020

Closed
wants to merge 58 commits into from
Closed

Conversation

AnotherDevBoy
Copy link
Contributor

Hi Locust folks!

This PR is a revive of a previous PR 740 from about a year ago.

For context, the cloud infrastructure my company uses has a firewall that closes idle connections after 5 minutes. For us, that means that if we leave a test running for a while and come back to it, we are unable to control the slaves through the master UI.

If you look through the comments PR 740, you can see that this is easily fixed by adding keep alive at the ZMQ socket level.

We thought that PR 927 would make the change redundant but, after testing Locust 0.11.0 (which includes the change) we were able to reproduce the problem again pretty consistently.

The change got reviewed by @heyman and I believe his concerns are addressed.

Thanks for your time and sorry for the confusion!

heyman and others added 30 commits December 19, 2017 15:22
…ests work without adding a dependency on having a working C compiler environment set up when installing Locust (since geventhttpclient package doesn't have wheels).
…. The text property function tries to decode the text in the same way as python-requests. This makes the API more similar to HttpSession.

Added more tests.
… requests with FastHttpSession, in order to manually control if a request should result in a success or failure
Improved documentation of arguments for FastHttpSession.request() method.
# Conflicts:
#	docs/writing-a-locustfile.rst
…the response was very quick (not a failure) and that 0.0 was a valid response time for a local server. Because of this i changes it to assertGreaterEqual.
@Jonnymcc
Copy link
Contributor

@albertowar Is this statement from the ZMQ guide not accurate then?
"Do heartbeating on the same socket you use for messages, so your heartbeats also act as a keep-alive to stop the network connection from going stale (some firewalls can be unkind to silent connections)."
Could it be that heartbeats are only being sent to the master and nothing besides hatch message get sent from the master?

@AnotherDevBoy
Copy link
Contributor Author

AnotherDevBoy commented Jun 19, 2019

I'd say the recommendation from the guide (http://zguide.zeromq.org/php:chapter4) is good. However, the current implementation of heartbeats in Locust is not cutting it for this particular use case (private clouds).

Could it be that heartbeats are only being sent to the master and nothing besides hatch message get sent from the master?
I suspect that could be it because the issues I experienced on the UI occur when I try to send an order to the slaves (increase users, start, stop, etc).

Adding TCP Keep Alive seems trivial enough and I personally have been operating on my fork (which includes this change) for over a year without any issues.

@AnotherDevBoy
Copy link
Contributor Author

AnotherDevBoy commented Jun 19, 2019

@Jonnymcc, after reading the code a bit more, I think we can confirm what you suspected: https://github.com/locustio/locust/blob/master/locust/runners.py#L420

The slaves are reporting heartbeats to the master but the master is not reporting heartbeats to the slaves. Given that the communication occurs in different ports, one of the connection is kept alive by the heartbeats but the other one is being closed.

@Jonnymcc
Copy link
Contributor

Jonnymcc commented Jun 19, 2019 via email

@AnotherDevBoy
Copy link
Contributor Author

AnotherDevBoy commented Jun 19, 2019

That's definitely an approach to solve it and it could work (I would have to test it to be 100% sure). However, I am not sure if that's the most optimal way to do it.

IIRC, the current slaves heartbeats were added to improve the distribution of load between the slaves when some of them are missing. As a side effect, those heartbeats help to maintain one of the connections alive.

This scenario is a bit different because I only want to ensure the connection master-slave is healthy.

Implementing this at the application level (with heartbeats) is a CPU overhead (another thread doing something in the background) that is not yet required. It could be handy if we ever wanted the slaves to react to a dead master, but that's not a feature people have interest in (as far as I know).

If we use TCP Keep Alive, it will be handled at the transport layer, which should be more efficient CPU-wise. On top of that, I have been using this for a long time and at scale (80 slaves+), hence I am confident this solves the problem I highlighted.

cclauss and others added 16 commits June 26, 2019 10:06
Discovered via: __flake8 . --count --select=E9,F63,F72,F82 --show-source --statistics__

Legacy __print__ statements are syntax errors in Python 3 but __print()__ function works as expected in both Python 2 and Python 3.
__sudo: required__ no longer is...  [Travis are now recommending removing the __sudo__ tag](https://blog.travis-ci.com/2018-11-19-required-linux-infrastructure-migration).

"_If you currently specify __sudo:__ in your __.travis.yml__, we recommend removing that configuration_"
:s/it/is
The fail percentage was calculated incorrectly. Something that failed
all of the time would be reported as failing 50% total.
added spacing
…en-running-distributed

Ensure that the last samples get sent by slave and received by master.
Using docker multi-stage for 50% smaller image
@AnotherDevBoy
Copy link
Contributor Author

@mbeacom, if you have time, would you mind reviewing this one as well?

@mbeacom
Copy link
Member

mbeacom commented Oct 1, 2019

@albertowar It appears you have a wonky rebase. Any chance you could squash commits here or cherry pick to another branch?

@AnotherDevBoy
Copy link
Contributor Author

Will do that in a separate PR then!

@AnotherDevBoy
Copy link
Contributor Author

@mbeacom I will close this one in favor of: #1101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.