Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure the connection between master and slave in heartbeat #1280

Merged
merged 6 commits into from
Apr 5, 2020

Conversation

delulu
Copy link
Contributor

@delulu delulu commented Mar 9, 2020

With heartbeating enabled, I still noticed network issues (packet drop, invalid byte stream) in long-term running on k8s cluster (overlay network).

And a straightforward fix is to reestablish the connection when any network issue is detected.

I've applied this fix in my locust tests for one week's running, and it works as expected.

@codecov
Copy link

codecov bot commented Mar 9, 2020

Codecov Report

Merging #1280 into master will decrease coverage by 0.08%.
The diff coverage is 65.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1280      +/-   ##
==========================================
- Coverage   80.21%   80.12%   -0.09%     
==========================================
  Files          23       23              
  Lines        2118     2179      +61     
  Branches      321      324       +3     
==========================================
+ Hits         1699     1746      +47     
- Misses        339      350      +11     
- Partials       80       83       +3     
Impacted Files Coverage Δ
locust/runners.py 75.88% <61.36%> (+0.03%) ⬆️
locust/rpc/zmqrpc.py 81.66% <69.69%> (-2.55%) ⬇️
locust/exception.py 100.00% <100.00%> (ø)
locust/web.py 89.33% <0.00%> (-0.67%) ⬇️
locust/core.py 95.98% <0.00%> (+0.36%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a0a6ef...0f9070b. Read the comment docs.

try:
self.client.close()
self.client = rpc.Client(self.master_host, self.master_port, self.client_id)
except Exception as e:
Copy link
Collaborator

@cyberw cyberw Mar 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to change this to only catch specific exceptions?

I've never looked into this part of the code base very much, so I feel unqualified to approve/decline the PR though :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intended to catch all the exceptions, to make it reliable with respect to all the possible failures.

also it's not against the current logic, this reset_connection is newly introduced in function loop and it's not supposed to throw any exception out unless there's a strong reason.

I understand you concern, will add tests to cover this change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching all types of exceptions is generally considered bad practice as it may hide more serious issues or put the program in an unknown state, causing hard-to-debug problems later on. But in this case it may be warranted, I can't really tell :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, will update the catching based on the exceptions I observed during my tests.

@delulu delulu force-pushed the ensureconnection branch from 105d15a to 9b79520 Compare March 9, 2020 13:22
@max-rocket-internet
Copy link
Contributor

@delulu does this fix the occasional missing slave? I see this now and again and I'm also running on k8s.

#1158

@delulu
Copy link
Contributor Author

delulu commented Mar 10, 2020

@delulu does this fix the occasional missing slave? I see this now and again and I'm also running on k8s.

#1158

Yes, it fixes it well.

@delulu delulu force-pushed the ensureconnection branch from c70eaea to dbd5b87 Compare March 16, 2020 06:55
@cyberw
Copy link
Collaborator

cyberw commented Mar 16, 2020

Do we really need to wrap the exceptions in our own exception class? I dont really see what value that adds..

@delulu delulu force-pushed the ensureconnection branch from dbd5b87 to f523314 Compare March 16, 2020 08:53
@heyman
Copy link
Member

heyman commented Mar 16, 2020

Do we really need to wrap the exceptions in our own exception class? I dont really see what value that adds..

Also, if we're wrapping the exceptions we should use raise ... from e.

@delulu
Copy link
Contributor Author

delulu commented Mar 18, 2020

Do we really need to wrap the exceptions in our own exception class? I dont really see what value that adds..

Also, if we're wrapping the exceptions we should use raise ... from e.

good suggestion! I will update it.

The main purpose is to handle these exceptions in the same place rather than scattered in runners.py. Also it shows how to deal with RPCError, it reduces the maintenance effort.

@delulu
Copy link
Contributor Author

delulu commented Mar 18, 2020

I would like to add test case to test reset_connection in test_runners.py, but I haven't figured out a good way, free to let me know if you have any ideas.

@delulu
Copy link
Contributor Author

delulu commented Mar 21, 2020

I've added a test case of test_reset_connection:

  1. check connection_broken will be true when there's a RPCError in recv_from_client.

    client_id, msg = self.server.recv_from_client()

  2. make sure RPCError is tolerated in reset_connection.

Please have a review.

@cyberw
Copy link
Collaborator

cyberw commented Mar 21, 2020

Looks nice! But I still dont understand what wrapping the exception helps with? I would prefer catching zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError etc in the places where it is relevant instead of catching RPCError. Less code, less magic.

@delulu
Copy link
Contributor Author

delulu commented Mar 23, 2020

I'm exactly "catching zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError etc in the places where it is relevant".

I mean in runners.py, it deals with rpc and it has no idea or context about these errors (zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError) and how to handle them.

@cyberw
Copy link
Collaborator

cyberw commented Mar 23, 2020

I'm exactly "catching zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError etc in the places where it is relevant".

I mean in runners.py, it deals with rpc and it has no idea or context about these errors (zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError) and how to handle them.

I dont quite understand what you mean by "it has no idea or context about these errors and how to handle them"?

Why can't runner do except (zmq.error.ZMQError, <and whatever other exceptions we are wrappring>) as e: instead of except RPCError as e: and handle it exactly the same way? Wrapping them adds no value imo, it just complicates the code and makes it less obvious what really happened.

@delulu
Copy link
Contributor Author

delulu commented Mar 23, 2020

I'm exactly "catching zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError etc in the places where it is relevant".
I mean in runners.py, it deals with rpc and it has no idea or context about these errors (zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError) and how to handle them.

I dont quite understand what you mean by "it has no idea or context about these errors and how to handle them"?

Why can't runner do except (zmq.error.ZMQError, <and whatever other exceptions we are wrappring>) as e: instead of except RPCError as e: and handle it exactly the same way? Wrapping them adds no value imo, it just complicates the code and makes it less obvious what really happened.

rpc is dealing with zmq, msg decode and msgpack directly, so it has context of these exceptions and know the possible causes of them. so it can wrap them together and add the cause info. while runners don't have these context, and it have no knowledge in what scenarios these exceptions will be raised.

And I totally disagree that it complicates the code.

The handling of these three exceptions are the same, so wrap them in a unified exception with description and the callers only need to care this one exception, rather than handling them in different scenarios.

For example in some place one of these three should be caught, in another place two of these three or all the three should be caught. This is complicate and it require caller to understand the details of rpc.

@cyberw
Copy link
Collaborator

cyberw commented Mar 27, 2020

Hi @delulu ! I'm sorry we were not able to agree on this. I would love to see more robust connection handling, but I won't merge with the (IMO) needless exception wrapping.

Perhaps I can at least convince you that the wrapping is not so important to you as to stop the fix?

If you make the requested changes & have a look at possibly speeding up the test case I'll be happy to merge.

@heyman
Copy link
Member

heyman commented Mar 27, 2020

Unfortunately I don't have time to review the full PR at the moment. However I don't see any problem with with raising a common RPCError exception (from the specific one), as long as the proper way to handle them in runners.py is the same. I actually think it makes the abstraction less leaky.

@max-rocket-internet
Copy link
Contributor

I would love to see something merged to ensure better communication between slaves and masters 🙏

I still can't reproduce it reliably but we still see now and again missing or slaves that don't stop hatching.

@cyberw
Copy link
Collaborator

cyberw commented Apr 1, 2020

If I'm the only one who thinks the exception wrapping is weird, and someone resolves the conflicts then I'm ok with merging.

@delulu
Copy link
Contributor Author

delulu commented Apr 3, 2020

sorry for my late response.

@cyberw I have to say that the wrap is as important as the fix, because I think this is the right thing to do. if you see anything wrong you need to point it out and convince me, and I'll be happy to make the changes.

as for test case you mentioned above, it tests three scenarios:

  1. RPCError is handled and the connection_broken will be set to "True".
  2. When a normal msg is well received, the connection_broken will be set back to "False".
  3. Other exceptions ( HeyAnException ) wont be handled, so no change on connection_broken.

for each scenario, it takes about 3 seconds to get the msg and the test case doesn't work as expected when I try to reduce the sleep time to 2 seconds.

I'll rebase it with latest master branch and test it. will commit the change when test looks good.

@cyberw
Copy link
Collaborator

cyberw commented Apr 3, 2020

@cyberw I have to say that the wrap is as important as the fix

I disagree, but as I said, if I'm the only one who considers it less clean than just catching the underlying exceptions I wont argue any more. Maybe I am missing something but let's move on. It's not that important.

As for the test cases, couldn't you just reduce the timeouts before the test (and reduce the sleeps)? Or is that not possible for some reason?

@delulu delulu force-pushed the ensureconnection branch from ae7961d to 0f9070b Compare April 5, 2020 01:53
@delulu
Copy link
Contributor Author

delulu commented Apr 5, 2020

@cyberw I have to say that the wrap is as important as the fix

I disagree, but as I said, if I'm the only one who considers it less clean than just catching the underlying exceptions I wont argue any more. Maybe I am missing something but let's move on. It's not that important.

As for the test cases, couldn't you just reduce the timeouts before the test (and reduce the sleeps)? Or is that not possible for some reason?

when a msg is sent, it seems take 3 seconds to go to the code line of updating connection_broken. because when I reduce it to 2 seconds, the status of connection_broken is not updated as expected.

so I can't reduce the sleep time or sleeps.

conflicts have been resolved and the latest branch has been tested with one day's running, please take a further check.

@cyberw
Copy link
Collaborator

cyberw commented Apr 5, 2020

Thanks for your contribution!

@cyberw cyberw merged commit 8685a4b into locustio:master Apr 5, 2020
@heyman
Copy link
Member

heyman commented Apr 5, 2020

when a msg is sent, it seems take 3 seconds to go to the code line of updating connection_broken. because when I reduce it to 2 seconds, the status of connection_broken is not updated as expected

The reason for this was the FALLBACK_INTERVAL that was introduced by this pull request. The first sleep wasn't needed at all, but when both the first and second was 3, they surpassed the FALLBACK_INTERVAL that is 5 (that's why it failed when both sleeps were 2). I've now fixed this in 2ac0a84 by temporarily setting down the FALLBACK_INTERVAL.

What's the purpose of the FALLBACK_INTERVAL btw, and how was the 5 seconds selected?

I also removed the case with an unhandled exception in MasterLocustRunner.client_listener because it was causing an error stack trace to be printed when the test was run, and if an unhandled exception happens in client_listener we're fucked anyways, so we don't need to check for it (unless we're planning to handle the exception in some way).

@delulu
Copy link
Contributor Author

delulu commented Apr 6, 2020

@heyman you explains it well and your fix looks nice! I forgot the mocked rpc context in test.

as for FALLBACK_INTERVAL in my env HEARTBEAT_INTERVAL has been set to 3, and connection will be reset in heartbeat if connection broken is detected. so I simply choose the prime number to make sure reset_connection has been executed and the interval does not collide with the HEARTBEAT_INTERVAL.

and I'm fine that you removed the case. this case is to check that only RPCError will trigger the reset_connection.

@heyman
Copy link
Member

heyman commented Apr 6, 2020

as for FALLBACK_INTERVAL in my env HEARTBEAT_INTERVAL has been set to 3, and connection will be reset in heartbeat if connection broken is detected. so I simply choose the prime number to make sure reset_connection has been executed and the interval does not collide with the HEARTBEAT_INTERVAL.

Why do we need to call reset_connection() in heartbeat_worker()? Couldn't we just call reset_connection() in the exception handler (in client_listener) and skip the sleep() call?

@delulu
Copy link
Contributor Author

delulu commented Apr 7, 2020

Why do we need to call reset_connection() in heartbeat_worker()? Couldn't we just call reset_connection() in the exception handler (in client_listener) and skip the sleep() call?

Because I prefer to make it consistent with WorkerLocustRunner that the connection status check and connection reset are done in heartbeat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants