Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnboundLocalError after receiving ZMQ corrupted message #2260

Closed
marcinh opened this issue Nov 18, 2022 · 17 comments
Closed

UnboundLocalError after receiving ZMQ corrupted message #2260

marcinh opened this issue Nov 18, 2022 · 17 comments

Comments

@marcinh
Copy link
Contributor

marcinh commented Nov 18, 2022

PR 2120 introduced problem with wrong handling of ZMQ corrupted message what leads to:
UnboundLocalError: local variable 'client_id' referenced before assignment.

locust-master: 2022-11-11T07:03:39.637+0000 |CRITICAL|locust.runners           |  Unhandled exception in greenlet: <Greenlet at 0x7fab1317dae0: <bound method MasterRunner.client_listener of <locust.runners.MasterRunner object at 0x7fab0767d580>>>
locust-master: Traceback (most recent call last):
locust-master: File "/root/.local/share/virtualenvs/locust-1T7N9san/lib/python3.9/site-packages/locust/rpc/zmqrpc.py", line 44, in recv_from_client
locust-master: addr = data[0].decode()
locust-master: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 9: invalid start byte
locust-master: The above exception was the direct cause of the following exception:
locust-master: Traceback (most recent call last):
locust-master: File "/root/.local/share/virtualenvs/locust-1T7N9san/lib/python3.9/site-packages/locust/runners.py", line 960, in client_listener
locust-master: client_id, msg = self.server.recv_from_client()
locust-master: File "/root/.local/share/virtualenvs/locust-1T7N9san/lib/python3.9/site-packages/locust/rpc/zmqrpc.py", line 46, in recv_from_client
locust-master: raise RPCReceiveError("ZMQ interrupted or corrupted message") from e
locust-master: locust.exception.RPCReceiveError: ZMQ interrupted or corrupted message
locust-master: During handling of the above exception, another exception occurred:
locust-master: Traceback (most recent call last):
locust-master: File "src/gevent/greenlet.py", line 906, in gevent._gevent_cgreenlet.Greenlet.run
locust-master: File "/root/.local/share/virtualenvs/locust-1T7N9san/lib/python3.9/site-packages/locust/runners.py", line 962, in client_listener
locust-master: logger.error(f"RPCError when receiving from client: {e}. Will reset client {client_id}.")
locust-master: UnboundLocalError: local variable 'client_id' referenced before assignment
locust-master: <gevent._socket3.socket at 0x7fab12f7ac40 object, fd=19, family=2, type=1, proto=0>: Expected GET method: '\x16\x03\x01\x010\x01\x00\x01,\x03\x03\x9a.2ã*:ª\x86\x94\x91}Mbv1\x18¶\xa0HC\x80÷väRòM\x08&û!¯\x00\x00¬À0À,À(À$À\x14À\n

Locust: 2.10.1
Python: 3.9
OS: CentOS

@cyberw
Copy link
Collaborator

cyberw commented Nov 18, 2022

Please only file bugs for latest version, 2.13 at the moment. Is it still an issue there?

@marcinh
Copy link
Contributor Author

marcinh commented Nov 18, 2022

That part of the code is the same

@cyberw
Copy link
Collaborator

cyberw commented Nov 18, 2022

@solowalker27 Maybe you have a minute to look at this?

@marcinh
Copy link
Contributor Author

marcinh commented Nov 24, 2022

@cyberw

I've done some analysis of problematic code and for me it seems it does not work at all the way it suppose to.

        while True:
            try:
                client_id, msg = self.server.recv_from_client()
            except RPCReceiveError as e:
                logger.error(f"RPCError when receiving from client: {e}. Will reset client {client_id}.")
                try:
                    self.server.send_to_client(Message("reconnect", None, client_id))
                except Exception as e:
                    logger.error(f"Error sending reconnect message to worker: {e}. Will reset RPC server.")
                    self.connection_broken = True
                    gevent.sleep(FALLBACK_INTERVAL)
                    continue

Case 1:
Exception will be raised in first loop - client_id will be not assigned - we will observe issue described in first post.

Case 2:
Exception will appear in subsequent loop, client_id will have value from previous loop thus random worker will be reconnected.

Additionally, Locust doesn't have any protection against any external communication on bind port because of this code. I've made simple test -> 1 master + 1 worker and send simple get to the master bind port, then whole ZMQ communication collapsed.

In my opinion this code should be replaced with simple warning in exception handling and simply ignore it. Worker's reconnection should be made only when heartbeat is missing. What do you think?

@marcinh
Copy link
Contributor Author

marcinh commented Nov 24, 2022

I openen BUG for pyzmq so that address is not included in RPCReceiveError. If they fix it we could think about more specific error handling)

@cyberw
Copy link
Collaborator

cyberw commented Nov 24, 2022

Oh yea, I see now, that code looks very strange. Tbh, I'm not sure PR #2120 was a good idea at all.

@solowalker27
Copy link
Contributor

Sorry for the bug. Perhaps it shouldn't try to reference the client_id? Or maybe still try if it's not None but give a warning that it might not be the right address. At any rate, the code worked for me in solving the issues I had. Perhaps a simpler try/catch with just a warning might work, too, without trying to reset any connections? I'm not going to have time to revisit this for a long while, though, so someone else would need to make a new PR to fix this better, if desired.

@marcinh
Copy link
Contributor Author

marcinh commented Nov 25, 2022

I would probably have some space to fix it in near future.

@cyberw
Copy link
Collaborator

cyberw commented Nov 26, 2022

The new/current implementation (as I understand it now) doesnt actually know what connection to reset - it just resets the last one it got a message from, so I’m not sure the current way is salvagable.

@marcinh
Copy link
Contributor Author

marcinh commented Nov 29, 2022

I would just log warning and simply ignore such message - when ZMQ implementation changes and provides address data we can add reconnect mechanism. What do you think?

@cyberw
Copy link
Collaborator

cyberw commented Nov 29, 2022

Tbh, I dont know whether the old implementation was better or worse. Waiting for a change in ZMQ implementation that can tell us which connection to reset might take a long time.

I think the main change in 2120 (apart from attempting to reset worker connections) is that it no longer resets the rpc server on errors, which may be a good thing or a bad thing (I am not very familiar with this part of the code). Maybe @delulu (who made some changes in this area before) or @heyman has an opinion?

I tried reverting the old PR and it was easy enough. I think that is the right thing to do here, but I'm not sure.

@marcinh
Copy link
Contributor Author

marcinh commented Nov 29, 2022

From my perspective, when you use Locust inside company with security restrictions and periodical scans for open ports, resetting whole RPC server is not a good think. This is what we are facing now and our results during such scan are broken because of bouncing RPC connection. That's why I'm pushing to have simple warning as an exception handling or at least have possibility to control this behavior via some king of config entry (like in case o HEARTBET_INTERVAL)

@cyberw
Copy link
Collaborator

cyberw commented Nov 29, 2022

But if we dont reset the rpc server, wont the connection just stay broken? This was introduced in #1280

What is your security scan actually doing? Just closing ports it doesnt like?

@marcinh
Copy link
Contributor Author

marcinh commented Nov 29, 2022

Catching RPCReceiveError doesn't mean connection is broken - it means Server couldn't parse incoming message so imho resetting connection is not needed here.

As posted in first message you can send simple GET to the server to trigger such exception. Frankly, I don't know what such scans are actually doing - we're investigating it at the moment.

@cyberw
Copy link
Collaborator

cyberw commented Nov 29, 2022

Oh, now I get it - its just a port scan, I thought it was doing something more intrusive. Yea, restarting the server because of that makes no sense.

@cyberw
Copy link
Collaborator

cyberw commented Nov 29, 2022

We can do a proper fix later. For now just changing it to a warning is good, looking forward to your PR.

@marcinh
Copy link
Contributor Author

marcinh commented Dec 1, 2022

Thanks for quick release!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants