-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ensure the connection between master and slave in heartbeat #1280
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1280 +/- ##
==========================================
- Coverage 80.21% 80.12% -0.09%
==========================================
Files 23 23
Lines 2118 2179 +61
Branches 321 324 +3
==========================================
+ Hits 1699 1746 +47
- Misses 339 350 +11
- Partials 80 83 +3
Continue to review full report at Codecov.
|
locust/runners.py
Outdated
try: | ||
self.client.close() | ||
self.client = rpc.Client(self.master_host, self.master_port, self.client_id) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to change this to only catch specific exceptions?
I've never looked into this part of the code base very much, so I feel unqualified to approve/decline the PR though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intended to catch all the exceptions, to make it reliable with respect to all the possible failures.
also it's not against the current logic, this reset_connection is newly introduced in function loop and it's not supposed to throw any exception out unless there's a strong reason.
I understand you concern, will add tests to cover this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching all types of exceptions is generally considered bad practice as it may hide more serious issues or put the program in an unknown state, causing hard-to-debug problems later on. But in this case it may be warranted, I can't really tell :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, will update the catching based on the exceptions I observed during my tests.
c70eaea
to
dbd5b87
Compare
Do we really need to wrap the exceptions in our own exception class? I dont really see what value that adds.. |
dbd5b87
to
f523314
Compare
Also, if we're wrapping the exceptions we should use |
good suggestion! I will update it. The main purpose is to handle these exceptions in the same place rather than scattered in runners.py. Also it shows how to deal with RPCError, it reduces the maintenance effort. |
I would like to add test case to test reset_connection in test_runners.py, but I haven't figured out a good way, free to let me know if you have any ideas. |
I've added a test case of test_reset_connection:
Please have a review. |
Looks nice! But I still dont understand what wrapping the exception helps with? I would prefer catching zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError etc in the places where it is relevant instead of catching RPCError. Less code, less magic. |
I'm exactly "catching zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError etc in the places where it is relevant". I mean in runners.py, it deals with rpc and it has no idea or context about these errors (zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError) and how to handle them. |
I dont quite understand what you mean by "it has no idea or context about these errors and how to handle them"? Why can't runner do |
rpc is dealing with zmq, msg decode and msgpack directly, so it has context of these exceptions and know the possible causes of them. so it can wrap them together and add the cause info. while runners don't have these context, and it have no knowledge in what scenarios these exceptions will be raised. And I totally disagree that it complicates the code. The handling of these three exceptions are the same, so wrap them in a unified exception with description and the callers only need to care this one exception, rather than handling them in different scenarios. For example in some place one of these three should be caught, in another place two of these three or all the three should be caught. This is complicate and it require caller to understand the details of rpc. |
Hi @delulu ! I'm sorry we were not able to agree on this. I would love to see more robust connection handling, but I won't merge with the (IMO) needless exception wrapping. Perhaps I can at least convince you that the wrapping is not so important to you as to stop the fix? If you make the requested changes & have a look at possibly speeding up the test case I'll be happy to merge. |
Unfortunately I don't have time to review the full PR at the moment. However I don't see any problem with with raising a common RPCError exception (from the specific one), as long as the proper way to handle them in runners.py is the same. I actually think it makes the abstraction less leaky. |
I would love to see something merged to ensure better communication between slaves and masters 🙏 I still can't reproduce it reliably but we still see now and again missing or slaves that don't stop hatching. |
If I'm the only one who thinks the exception wrapping is weird, and someone resolves the conflicts then I'm ok with merging. |
sorry for my late response. @cyberw I have to say that the wrap is as important as the fix, because I think this is the right thing to do. if you see anything wrong you need to point it out and convince me, and I'll be happy to make the changes. as for test case you mentioned above, it tests three scenarios:
for each scenario, it takes about 3 seconds to get the msg and the test case doesn't work as expected when I try to reduce the sleep time to 2 seconds. I'll rebase it with latest master branch and test it. will commit the change when test looks good. |
I disagree, but as I said, if I'm the only one who considers it less clean than just catching the underlying exceptions I wont argue any more. Maybe I am missing something but let's move on. It's not that important. As for the test cases, couldn't you just reduce the timeouts before the test (and reduce the sleeps)? Or is that not possible for some reason? |
when a msg is sent, it seems take 3 seconds to go to the code line of updating connection_broken. because when I reduce it to 2 seconds, the status of connection_broken is not updated as expected. so I can't reduce the sleep time or sleeps. conflicts have been resolved and the latest branch has been tested with one day's running, please take a further check. |
Thanks for your contribution! |
The reason for this was the What's the purpose of the I also removed the case with an unhandled exception in |
@heyman you explains it well and your fix looks nice! I forgot the mocked rpc context in test. as for and I'm fine that you removed the case. this case is to check that only RPCError will trigger the |
Why do we need to call |
Because I prefer to make it consistent with WorkerLocustRunner that the connection status check and connection reset are done in heartbeat. |
With heartbeating enabled, I still noticed network issues (packet drop, invalid byte stream) in long-term running on k8s cluster (overlay network).
And a straightforward fix is to reestablish the connection when any network issue is detected.
I've applied this fix in my locust tests for one week's running, and it works as expected.