UnboundLocalError after receiving ZMQ corrupted message #2260

marcinh · 2022-11-18T10:39:45Z

PR 2120 introduced problem with wrong handling of ZMQ corrupted message what leads to:
UnboundLocalError: local variable 'client_id' referenced before assignment.

locust-master: 2022-11-11T07:03:39.637+0000 |CRITICAL|locust.runners           |  Unhandled exception in greenlet: <Greenlet at 0x7fab1317dae0: <bound method MasterRunner.client_listener of <locust.runners.MasterRunner object at 0x7fab0767d580>>>
locust-master: Traceback (most recent call last):
locust-master: File "/root/.local/share/virtualenvs/locust-1T7N9san/lib/python3.9/site-packages/locust/rpc/zmqrpc.py", line 44, in recv_from_client
locust-master: addr = data[0].decode()
locust-master: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 9: invalid start byte
locust-master: The above exception was the direct cause of the following exception:
locust-master: Traceback (most recent call last):
locust-master: File "/root/.local/share/virtualenvs/locust-1T7N9san/lib/python3.9/site-packages/locust/runners.py", line 960, in client_listener
locust-master: client_id, msg = self.server.recv_from_client()
locust-master: File "/root/.local/share/virtualenvs/locust-1T7N9san/lib/python3.9/site-packages/locust/rpc/zmqrpc.py", line 46, in recv_from_client
locust-master: raise RPCReceiveError("ZMQ interrupted or corrupted message") from e
locust-master: locust.exception.RPCReceiveError: ZMQ interrupted or corrupted message
locust-master: During handling of the above exception, another exception occurred:
locust-master: Traceback (most recent call last):
locust-master: File "src/gevent/greenlet.py", line 906, in gevent._gevent_cgreenlet.Greenlet.run
locust-master: File "/root/.local/share/virtualenvs/locust-1T7N9san/lib/python3.9/site-packages/locust/runners.py", line 962, in client_listener
locust-master: logger.error(f"RPCError when receiving from client: {e}. Will reset client {client_id}.")
locust-master: UnboundLocalError: local variable 'client_id' referenced before assignment
locust-master: <gevent._socket3.socket at 0x7fab12f7ac40 object, fd=19, family=2, type=1, proto=0>: Expected GET method: '\x16\x03\x01\x010\x01\x00\x01,\x03\x03\x9a.2ã*:ª\x86\x94\x91}Mbv1\x18¶\xa0HC\x80÷väRòM\x08&û!¯\x00\x00¬À0À,À(À$À\x14À\n

Locust: 2.10.1
Python: 3.9
OS: CentOS

The text was updated successfully, but these errors were encountered:

cyberw · 2022-11-18T11:22:11Z

Please only file bugs for latest version, 2.13 at the moment. Is it still an issue there?

marcinh · 2022-11-18T11:37:56Z

That part of the code is the same

cyberw · 2022-11-18T19:17:00Z

@solowalker27 Maybe you have a minute to look at this?

marcinh · 2022-11-24T09:54:27Z

@cyberw

I've done some analysis of problematic code and for me it seems it does not work at all the way it suppose to.

        while True:
            try:
                client_id, msg = self.server.recv_from_client()
            except RPCReceiveError as e:
                logger.error(f"RPCError when receiving from client: {e}. Will reset client {client_id}.")
                try:
                    self.server.send_to_client(Message("reconnect", None, client_id))
                except Exception as e:
                    logger.error(f"Error sending reconnect message to worker: {e}. Will reset RPC server.")
                    self.connection_broken = True
                    gevent.sleep(FALLBACK_INTERVAL)
                    continue

Case 1:
Exception will be raised in first loop - client_id will be not assigned - we will observe issue described in first post.

Case 2:
Exception will appear in subsequent loop, client_id will have value from previous loop thus random worker will be reconnected.

Additionally, Locust doesn't have any protection against any external communication on bind port because of this code. I've made simple test -> 1 master + 1 worker and send simple get to the master bind port, then whole ZMQ communication collapsed.

In my opinion this code should be replaced with simple warning in exception handling and simply ignore it. Worker's reconnection should be made only when heartbeat is missing. What do you think?

marcinh · 2022-11-24T11:13:02Z

I openen BUG for pyzmq so that address is not included in RPCReceiveError. If they fix it we could think about more specific error handling)

cyberw · 2022-11-24T17:40:01Z

Oh yea, I see now, that code looks very strange. Tbh, I'm not sure PR #2120 was a good idea at all.

solowalker27 · 2022-11-24T21:15:33Z

Sorry for the bug. Perhaps it shouldn't try to reference the client_id? Or maybe still try if it's not None but give a warning that it might not be the right address. At any rate, the code worked for me in solving the issues I had. Perhaps a simpler try/catch with just a warning might work, too, without trying to reset any connections? I'm not going to have time to revisit this for a long while, though, so someone else would need to make a new PR to fix this better, if desired.

marcinh · 2022-11-25T14:20:38Z

I would probably have some space to fix it in near future.

cyberw · 2022-11-26T20:28:40Z

The new/current implementation (as I understand it now) doesnt actually know what connection to reset - it just resets the last one it got a message from, so I’m not sure the current way is salvagable.

marcinh · 2022-11-29T08:24:09Z

I would just log warning and simply ignore such message - when ZMQ implementation changes and provides address data we can add reconnect mechanism. What do you think?

fixes #2260

cyberw · 2022-11-29T11:04:26Z

Tbh, I dont know whether the old implementation was better or worse. Waiting for a change in ZMQ implementation that can tell us which connection to reset might take a long time.

I think the main change in 2120 (apart from attempting to reset worker connections) is that it no longer resets the rpc server on errors, which may be a good thing or a bad thing (I am not very familiar with this part of the code). Maybe @delulu (who made some changes in this area before) or @heyman has an opinion?

I tried reverting the old PR and it was easy enough. I think that is the right thing to do here, but I'm not sure.

marcinh · 2022-11-29T13:36:14Z

From my perspective, when you use Locust inside company with security restrictions and periodical scans for open ports, resetting whole RPC server is not a good think. This is what we are facing now and our results during such scan are broken because of bouncing RPC connection. That's why I'm pushing to have simple warning as an exception handling or at least have possibility to control this behavior via some king of config entry (like in case o HEARTBET_INTERVAL)

cyberw · 2022-11-29T14:34:42Z

But if we dont reset the rpc server, wont the connection just stay broken? This was introduced in #1280

What is your security scan actually doing? Just closing ports it doesnt like?

marcinh · 2022-11-29T15:21:49Z

Catching RPCReceiveError doesn't mean connection is broken - it means Server couldn't parse incoming message so imho resetting connection is not needed here.

As posted in first message you can send simple GET to the server to trigger such exception. Frankly, I don't know what such scans are actually doing - we're investigating it at the moment.

cyberw · 2022-11-29T15:29:55Z

Oh, now I get it - its just a port scan, I thought it was doing something more intrusive. Yea, restarting the server because of that makes no sense.

cyberw · 2022-11-29T16:01:44Z

We can do a proper fix later. For now just changing it to a warning is good, looking forward to your PR.

marcinh · 2022-12-01T11:19:05Z

Thanks for quick release!

cyberw added a commit that referenced this issue Nov 29, 2022

Revert worker connection reset feature (introduced in #2120 / 357f106),

71795ad

fixes #2260

marcinh mentioned this issue Dec 1, 2022

Dont reset connection to worker if master receives a corrupted zmq message #2266

Merged

cyberw closed this as completed Dec 1, 2022

marcinh mentioned this issue Dec 7, 2022

Fix: Ask worker to reconnect if master gets a broken RPC message #2271

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnboundLocalError after receiving ZMQ corrupted message #2260

UnboundLocalError after receiving ZMQ corrupted message #2260

marcinh commented Nov 18, 2022

cyberw commented Nov 18, 2022 •

edited

Loading

marcinh commented Nov 18, 2022 •

edited

Loading

cyberw commented Nov 18, 2022

marcinh commented Nov 24, 2022

marcinh commented Nov 24, 2022

cyberw commented Nov 24, 2022

solowalker27 commented Nov 24, 2022

marcinh commented Nov 25, 2022

cyberw commented Nov 26, 2022

marcinh commented Nov 29, 2022

cyberw commented Nov 29, 2022 •

edited

Loading

marcinh commented Nov 29, 2022

cyberw commented Nov 29, 2022 •

edited

Loading

marcinh commented Nov 29, 2022 •

edited

Loading

cyberw commented Nov 29, 2022

cyberw commented Nov 29, 2022 •

edited

Loading

marcinh commented Dec 1, 2022

UnboundLocalError after receiving ZMQ corrupted message #2260

UnboundLocalError after receiving ZMQ corrupted message #2260

Comments

marcinh commented Nov 18, 2022

cyberw commented Nov 18, 2022 • edited Loading

marcinh commented Nov 18, 2022 • edited Loading

cyberw commented Nov 18, 2022

marcinh commented Nov 24, 2022

marcinh commented Nov 24, 2022

cyberw commented Nov 24, 2022

solowalker27 commented Nov 24, 2022

marcinh commented Nov 25, 2022

cyberw commented Nov 26, 2022

marcinh commented Nov 29, 2022

cyberw commented Nov 29, 2022 • edited Loading

marcinh commented Nov 29, 2022

cyberw commented Nov 29, 2022 • edited Loading

marcinh commented Nov 29, 2022 • edited Loading

cyberw commented Nov 29, 2022

cyberw commented Nov 29, 2022 • edited Loading

marcinh commented Dec 1, 2022

cyberw commented Nov 18, 2022 •

edited

Loading

marcinh commented Nov 18, 2022 •

edited

Loading

cyberw commented Nov 29, 2022 •

edited

Loading

cyberw commented Nov 29, 2022 •

edited

Loading

marcinh commented Nov 29, 2022 •

edited

Loading

cyberw commented Nov 29, 2022 •

edited

Loading