Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receptor does not handle exceptions properly when starting the server #133

Closed
elyezer opened this issue Feb 7, 2020 · 6 comments
Closed
Assignees
Labels
bug Something isn't working needs_test This item needs to be tested

Comments

@elyezer
Copy link
Member

elyezer commented Feb 7, 2020

Receptor is ignoring any exception raised when starting the server and continue if everything was alright. It should check if it could start the server before trying to do anything else.

To reproduce this consider that you got a controller running on port 9999 by running the following command:

$ receptor --debug \
    --node-id=controller \
    -d /tmp/controller \
    controller \
    --listen=receptor://127.0.0.1:9999 

Then try to start a node and also make it listen on port 9999:

$ receptor --debug \
    --node-id=node-a \
    -d /tmp/node-a \
    node \
    --listen=receptor://127.0.0.1:9999 \
    --peer=receptor://localhost:9999

INFO 2020-02-07 14:48:08,722 node-a entrypoints Running as Receptor node with ID: node-a
INFO 2020-02-07 14:48:08,722 node-a controller Serving on receptor://127.0.0.1:9999
INFO 2020-02-07 14:48:08,723 node-a controller Connecting to peer receptor://localhost:9999
Task exception was never retrieved
future: <Task finished coro=<start_server() done, defined at /usr/lib64/python3.6/asyncio/streams.py:86> exception=OSError(98, "error while attempting to bind on address ('127.0.0.1', 9999): address already in use")>
Traceback (most recent call last):
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 1073, in create_server
    sock.bind(sa)
OSError: [Errno 98] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/asyncio/streams.py", line 119, in start_server
    return (yield from loop.create_server(factory, host, port, **kwds))
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 1077, in create_server
    % (sa, err.strerror.lower()))
OSError: [Errno 98] error while attempting to bind on address ('127.0.0.1', 9999): address already in use
DEBUG 2020-02-07 14:48:08,727 node-a base sending HI
DEBUG 2020-02-07 14:48:08,728 node-a base waiting for HI
DEBUG 2020-02-07 14:48:08,730 node-a base sending routes
DEBUG 2020-02-07 14:48:08,730 node-a receptor Emitting Route Advertisements, excluding set()
DEBUG 2020-02-07 14:48:08,731 node-a base starting normal loop
DEBUG 2020-02-07 14:48:08,732 node-a receptor spawning message_handler
DEBUG 2020-02-07 14:48:08,733 node-a receptor message_handler: FramedMessage(msg_id=107637885988034463241505519927166033911, header={'cmd': 'ROUTE', 'id': 'controller', 'capabilities': [['receptor_sleep', '1.0.0']], 'groups': [], 'edges': [['controller', 'node-a', 1]], 'seen': ['controller', 'node-a']}, payload=None)
DEBUG 2020-02-07 14:48:08,733 node-a receptor Emitting Route Advertisements, excluding {'node-a', 'controller'}

So it tried to bind on a address that was already in use but never checked that it failed to connect and continued as if everything was fine. When the above happen the following was shown on the controller logs:

DEBUG 2020-02-07 14:48:08,727 controller base waiting for HI
DEBUG 2020-02-07 14:48:08,728 controller base sending HI
DEBUG 2020-02-07 14:48:08,729 controller base sending routes
DEBUG 2020-02-07 14:48:08,729 controller receptor Emitting Route Advertisements, excluding set()
DEBUG 2020-02-07 14:48:08,731 controller base starting normal loop
DEBUG 2020-02-07 14:48:08,731 controller receptor spawning message_handler
DEBUG 2020-02-07 14:48:08,735 controller receptor message_handler: FramedMessage(msg_id=221019859519773369567608306767646276616, header={'cmd': 'ROUTE', 'id': 'node-a', 'capabilities': [['receptor_sleep', '1.0.0']], 'groups': [], 'edges': [['controller', 'node-a', 100]], 'seen': ['controller', 'node-a']}, payload=None)
DEBUG 2020-02-07 14:48:08,736 controller receptor Emitting Route Advertisements, excluding {'controller', 'node-a'}
DEBUG 2020-02-07 14:48:08,738 controller receptor message_handler: FramedMessage(msg_id=111384449847426542261153884145803424085, header={'cmd': 'ROUTE', 'id': 'node-a', 'capabilities': [['receptor_sleep', '1.0.0']], 'groups': [], 'edges': [['controller', 'node-a', 1]], 'seen': ['node-a', 'controller']}, payload=None)
DEBUG 2020-02-07 14:48:08,738 controller receptor Emitting Route Advertisements, excluding {'controller', 'node-a'}

So the node-a connected to the controller and it was considered to be a node of the network no problem since it connected to the controller but it could not serve since the address was already in use.

On the scenario above you can actually send work to node-a but you can't have nodes connected to it since it failed to start the listener server. In other words it handler the --peer option no problem but failed to properly handle the --listen.

With that said, should receptor fail in case any of the expected behavior fails to start? Maybe failing only if --listen fails is the way to go since --peer could point to a --peer that is temporarily unavailable and will come back at some point. Not sure here but getting receptor running even though it had initialization issues may bring some headache in the future, specially when trying to debug things.

@ghjm
Copy link
Contributor

ghjm commented Mar 2, 2020

The duplicate --listen issue is fixed by ansible/receptor#149, which also makes some changes to how exceptions are handled on server startup. I think this issue is resolved but let me know if you still see a problem.

@ghjm ghjm self-assigned this Mar 2, 2020
@ghjm ghjm added bug Something isn't working needs_test This item needs to be tested labels Mar 2, 2020
@Ichimonji10
Copy link
Contributor

Ichimonji10 commented Mar 5, 2020

Unfortunately, this issue isn't solved in the latest version of receptor. If I start two receptor processes and tell them to bind on the same socket address (by default: 8888), then one will bind, and the other will emit an error about being unable to bind but continue running.

$ poetry run receptor --data-dir="$(mktemp --directory)" node
$ poetry run receptor --data-dir="$(mktemp --directory)" node
ERROR 2020-03-05 13:57:20,800  controller [Errno 98] error while attempting to bind on address ('0.0.0.0', 8888): address already in use
Traceback (most recent call last):
  File "/home/ichimonji10/code/receptor/receptor/controller.py", line 46, in exit_on_exceptions_in
    await task
  File "/usr/lib/python3.8/asyncio/streams.py", line 94, in start_server
    return await loop.create_server(factory, host, port, **kwds)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1459, in create_server
    raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8888): address already in use

...unless this behavior is intentional? That doesn't seem to be the case, though. From #149:

When two receptor nodes attempt to bind to the same port, the second one should error out and exit.

@elyezer
Copy link
Member Author

elyezer commented Mar 5, 2020

I am experiencing the same as @Ichimonji10 is experiencing.

One thing to add is that I tried to stop both nodes and got the following stack trace:

^CERROR 2020-03-05 16:33:49,409 controller __main__ main: an error occured while running receptor
Traceback (most recent call last):
  File "/home/elyezer/code/receptor/receptor/receptor/__main__.py", line 59, in main
    config.go()
  File "/home/elyezer/code/receptor/receptor/receptor/config.py", line 525, in go
    self._parsed_args.func(self)
  File "/home/elyezer/code/receptor/receptor/receptor/entrypoints.py", line 72, in run_as_controller
    cleanup_tmpdir()
TypeError: cleanup_tmpdir() missing 1 required positional argument: 'controller'

Edit: I've submitted ansible/receptor#154 to fix that.

@Ichimonji10
Copy link
Contributor

If I start two receptor nodes and ask them to bind to the same port, then the second one will exit:

$ poetry run receptor --data-dir="$(mktemp --directory)" node
ERROR 2020-03-16 14:31:46,753  controller [Errno 98] error while attempting to bind on address ('0.0.0.0', 8888): address already in use
Traceback (most recent call last):
  File "/home/ichimonji10/code/receptor/receptor/controller.py", line 44, in exit_on_exceptions_in
    await task
  File "/usr/lib/python3.8/asyncio/streams.py", line 94, in start_server
    return await loop.create_server(factory, host, port, **kwds)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1463, in create_server
    raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8888): address already in use

However, the receptor process which bails exits with a status code of zero. That seems like a bug. Would you like to open a new issue for incorrect exit status codes, or does this issue suffice?

@ghjm
Copy link
Contributor

ghjm commented Mar 16, 2020

I agree it should set the exit code. I have no preference whether we track this on the current issue or a new issue.

@Ichimonji10
Copy link
Contributor

This issue seems to focus on the issue of "receptor should bail but doesn't," which is fixed, and which I'm therefore closing.

#180 tracks "receptor is bailing with the wrong status code."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs_test This item needs to be tested
Projects
None yet
Development

No branches or pull requests

3 participants