Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core/swarm] Emit events for active connection close and fix disconnect(). #1619

Merged
merged 16 commits into from
Aug 4, 2020

Conversation

romanb
Copy link
Contributor

@romanb romanb commented Jun 19, 2020

The Network does currently not emit events for actively closed connections, e.g. via EstablishedConnection::close
or ConnectedPeer::disconnect(). As a result, when actively closing connections, there will be ConnectionEstablished
events emitted without eventually a matching ConnectionClosed event. This seems undesirable and has the consequence that
the Swarm::ban_peer_id feature in libp2p-swarm does not result in appropriate calls to NetworkBehaviour::inject_connection_closed and NetworkBehaviour::inject_disconnected. Furthermore, the disconnect() functionality in libp2p-core is currently broken as it leaves the Pool in an inconsistent state.

This PR does the following:

  1. When connection background tasks are dropped (i.e. removed from the Manager), they always terminate immediately, without attempting an orderly close of the connection.

  2. An orderly close is sent to the background task of a connection as a regular command. The background task emits a Closed event before terminating.

  3. Pool::disconnect() removes all connection tasks for the affected peer from the Manager, i.e. without an orderly close, thereby also fixing the discovered state inconsistency due to not removing the corresponding entries in the Pool itself after removing them from the Manager. The Pool ensures that ConnectionClosed events are emitted for these connections. The former NetworkEvent::ConnectionError has been renamed to NetworkEvent::ConnectionClosed with the error field being an Option, thus error: None indicates an active (but not necessarily orderly) close.

  4. A new test is added to libp2p-swarm that exercises the ban/unban functionality (currently somewhat broken, see Banning a connected peer does not result in calls to NetworkBehaviour methods #1584) and places assertions on the number and order of calls to the NetworkBehaviour. In that context some new testing utilities have been added to libp2p-swarm as well.

Points (1)-(3) ensure that each NetworkEvent::ConnectionEstablished is eventually paired with a NetworkEvent::ConnectionClosed, also in the case of actively closed connections (with orderly shutdown or not), thus addressing #1584. (1) and (2) went along with some internal simplifications in the state machine implementation of the background Tasks.

The `Network` does currently not emit events for actively
closed connections, e.g. via `EstablishedConnection::close`
or `ConnectedPeer::disconnect()`. As a result, when actively
closing connections, there will be `ConnectionEstablished`
events emitted without eventually a matching `ConnectionClosed`
event. This seems undesirable and has the consequence that
the `Swarm::ban_peer_id` feature in `libp2p-swarm` does not
result in appropriate calls to `NetworkBehaviour::inject_connection_closed`
and `NetworkBehaviour::inject_disconnected`. Furthermore,
the `disconnect()` functionality in `libp2p-core` is currently
broken as it leaves the `Pool` in an inconsistent state.

This commit does the following:

  1. When connection background tasks are dropped
     (i.e. removed from the `Manager`), they
     always terminate immediately, without attempting
     an orderly close of the connection.
  2. An orderly close is sent to the background task
     of a connection as a regular command. The
     background task emits a `Closed` event
     before terminating.
  3. `Pool::disconnect()` removes all connection
     tasks for the affected peer from the `Manager`,
     i.e. without an orderly close, thereby also
     fixing the discovered state inconsistency
     due to not removing the corresponding entries
     in the `Pool` itself after removing them from
     the `Manager`.
  4. A new test is added to `libp2p-swarm` that
     exercises the ban/unban functionality and
     places assertions on the number and order
     of calls to the `NetworkBehaviour`. In that
     context some new testing utilities have
     been added to `libp2p-swarm`.

This addresses libp2p#1584.
@tomaka
Copy link
Member

tomaka commented Jun 19, 2020

I haven't read the code changes yet, but it is completely intentional that no event is emitted if close or disconnect is called.

Unless that has been changed, the API of close and disconnect guarantees that the connection in question will no longer be visible in the API of Network afterwards. The user can assume that calling close/disconnect implicitly generates an instantaneous ConnectionClosed event.

Unless there is a reason why the closing has to be asynchronous on the API level, I do strongly prefer when disconnecting appears to be synchronous.

@romanb
Copy link
Contributor Author

romanb commented Jun 19, 2020

I haven't read the code changes yet, but it is completely intentional that no event is emitted if close or disconnect is called.

I think that is not a good idea, as can be seen by #1584 which is just incorrect behaviour. The Swarm relies on Established/Closed events coming in pairs and it seems cleaner to not make a distinction there between active/passive close.

Unless that has been changed, the API of close and disconnect guarantees that the connection in question will no longer be visible in the API of Network afterwards.

That is still the case for disconnect (used by Swarm::ban_peer_id), which is essentially an ungraceful "go away" signal. A proper close, however, is only complete once the background task has completed the orderly close. Implying an immediate "connection closed" success here, as is done right now, is just incorrect in my opinion, as the orderly shutdown may also fail. As I wrote in the PR description, this PR is making a clear distinction between an orderly shutdown/close of a connection and an abrupt disconnect.

Unless there is a reason why the closing has to be asynchronous on the API level, I do strongly prefer when disconnecting appears to be synchronous.

Nothing much changed on the API-level, since the background tasks are still there. The difference is that a) disconnect() ensures there is never an attempt at a graceful close in the background task, i.e. it stops immediately and b) For each closed connection that has been actively closed (disconnect or close) an event is also emitted that pairs up with the earlier ConnectionEstablished.

@romanb
Copy link
Contributor Author

romanb commented Jun 19, 2020

I want to hightlight again here the motivations of the PR description: a) The disconnect functionality in libp2p-core is currently broken and b) The banning functionality in the Swarm is also broken, partly as a consequence of a) but also because the NetworkBehaviour callbacks are not invoked for closed connections resulting from an active close (including disconnect).

@tomaka
Copy link
Member

tomaka commented Jun 19, 2020

I think that is not a good idea, as can be seen by #1584 which is just incorrect behaviour. The Swarm relies on Established/Closed events coming in pairs and it seems cleaner to not make a distinction there between active/passive close.

My argument is that we should change the Swarm to no longer rely on that. The Swarm knows when close/disconnect is called and can generate a SwarmEvent itself. I don't see any reason to have a 1:1 mapping between the events from the Network and the events from the Swarm.

Implying an immediate "connection closed" success here, as is done right now, is just incorrect in my opinion, as the orderly shutdown may also fail.

I don't understand why it is important to know whether the shutdown has been successful or not. There is nothing the user can do if for example we fail to shut down a connection because of an I/O error.

a) The disconnect functionality in libp2p-core is currently broken

That's not clear to me. I would expect disconnect() (and close) to perform a clean shut down.

but also because the NetworkBehaviour callbacks are not invoked for closed connections resulting from an active close (including disconnect).

Which can be fixed by making the NetworkBehaviour call the callback from within ban_peer_id. We don't need to modify the behaviour of disconnect() for that.

@romanb
Copy link
Contributor Author

romanb commented Jun 20, 2020

I don't see any reason to have a 1:1 mapping between the events from the Network and the events from the Swarm.

Such a an overall 1:1 mapping between Network and Swarm events is not the motivation or goal here, this is only about connection established/closed events. I do see a good reason for having Network::ConnectionEstablishedand Network::ConnectionClosed (and similarly Swarm::ConnectionEstablished/Closed) always come in pairs, because the symmetry makes it easy to understand and write correct code that is handling these events. As witnessed by #1584, it is easy to make the mistake of forgetting that you don't get Closed-events for connections that have been closed as a result of some code path using close() or disconnect(). Now as a user of libp2p-core's Network like the Swarm here, instead of just writing code that handles closed connections in a single place, where you handle Network::ConnectionClosed, you need to remember that this code won't run if you explicitly close() or disconnect(). That is just not a nice API to program against in my opinion.

I don't understand why it is important to know whether the shutdown has been successful or not. There is nothing the user can do if for example we fail to shut down a connection because of an I/O error.

How do you know? If I try to perform a clean shutdown of a connection, which implies flushing and sending all remaining buffered data, I may certainly be interested in whether that succeeded. If it doesn't, I certainly have to assume that some of the data was not sent and may want to take action on that. If I don't care or know at all about whether an orderly connection shutdown succeeds, there isn't much point in doing it in the first place, I can just drop the connection. The current code (before this PR) always tries to perform an orderly close() in the background task when it notices that the Manager has dropped the task, but at that point no-one will or can know about the outcome, which seems pointless to me. If errors on close/shutdown were irrelevant as you say, surely there would be no need to even return an error from futures::io::AsyncWrite::poll_close or tokio::io::AsyncWrite::poll_shutdown.

a) The disconnect functionality in libp2p-core is currently broken

That's not clear to me. I would expect disconnect() (and close) to perform a clean shut down.

What is not clear? Why it is broken? As I hinted at in the PR description and should be seen in the diff (and has been reproduced by the included new libp2p-swarm test), Pool::disconnect() currently leaves the Pool in an inconsistent state because it does not remove the corresponding connection entries it removed from the Manager from its own established and pending lists.
If instead you mean it is not clear why I made disconnect() and close() behave differently, that is motivated by two things:

  1. The current use of disconnect() in the context of Swarm::ban_peer_id. When you ban a peer, I don't think you want to have the connections to that peer potentially hanging around for quite some time to complete the orderly close(), which is what the background tasks will currently try to do. Rather, you want to drop the connections to the peer immediately.

  2. If you want to cleanly shut down all connections to a peer, you can already do that by iterating through the connections to that peer and either abort() (for pending connections) or close() them (for established connections). There is really no need for a separate disconnect() functionality just to achieve that.

My argument is that we should change the Swarm to no longer rely on that.

The thing is, the Swarm already does and needs to rely on Established / Closed coming in pairs at least for the case where connections are not actively closed, since that is instrumental to deciding when to call inject_disconnected(), i.e. when the last connection closes. You're thus proposing it shouldn't rely on them coming in pairs for the case of actively closed connections, as you say:

The Swarm knows when close/disconnect is called and can generate a SwarmEvent itself.
[..]
Which can be fixed by making the NetworkBehaviour call the callback from within ban_peer_id. We don't need to modify the behaviour of disconnect() for that.

Yes, it can, as long as you can get access to all the information you need to feed into the callbacks and / or events. I think that is possible now, but must also always remain possible in the future (that is, in principle, any information you get in a Network::ConnectionClosed event must be obtainable through other APIs, since any such information may need to be passed to inject_connection_closed or _disconnected). However, because of that and for reasons stated at the beginning of this post whereby you have to repeat the code that needs to run on connection closure in multiple places, I don't think this is a good solution in terms of API design and hence did not propose it but rather an approach that I find preferable, which is the pairing of established/closed events no matter how connections are closed. If you find it preferable to address this in the Swarm only in the way you suggest, feel free to take from this PR what you like and leave what you don't (just don't forget the necessary changes to core::connection::Pool::disconnect(), which I think are necessary in any case).

@romanb
Copy link
Contributor Author

romanb commented Jul 7, 2020

Any further comments / questions? If my arguments are not convincing, and no-one else seems to have an opinion, feel free to close the PR. Otherwise I'm happy to rebase it.

core/src/connection/manager.rs Outdated Show resolved Hide resolved
core/src/connection/manager/task.rs Show resolved Hide resolved
core/src/connection/manager/task.rs Outdated Show resolved Hide resolved
core/src/connection/manager/task.rs Outdated Show resolved Hide resolved
core/src/connection/pool.rs Show resolved Hide resolved
core/src/connection/pool.rs Show resolved Hide resolved
swarm/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@twittner twittner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

core/src/connection/manager.rs Outdated Show resolved Hide resolved
romanb pushed a commit to romanb/rust-libp2p that referenced this pull request Jul 28, 2020
Building on the ability to wait for connection shutdown to
complete introduced in libp2p#1619,
this commit extends the ability for performing graceful
shutdowns in the following ways:

  1. The `ConnectionHandler` (and thus also `ProtocolsHandler`) can
  participate in the shutdown, via new `poll_close` methods. The
  muxer and underlying transport connection only starts closing once
  the connection handler signals readiness to do so.

  2. A `Network` can be gracefully shut down, which involves a
  graceful shutdown of the underlying connection `Pool`. The `Pool`
  in turn proceeds with a shutdown by rejecting new connections
  while draining established connections.

  3. A `Swarm` can be gracefully shut down, which involves a
  graceful shutdown of the underlying `Network` followed by
  polling the `NetworkBehaviour` until it returns `Poll::Pending`,
  i.e. it has no more output.

In particular, the following are important details:

  * Analogous to new inbound and outbound connections during shutdown,
  while a single connection is shutting down, it rejects new inbound substreams
  and, by the return type of `ConnectionHandler::poll_close`,
  no new outbound substreams can be requested.

  * The `NodeHandlerWrapper` managing the `ProtocolsHandler`
  always waits for already ongoing inbound and outbound substream
  upgrades to complete. Since the `NodeHandlerWrapper` is a
  `ConnectionHandler`, the previous point applies w.r.t. new inbound
  and outbound substreams.

  * When the `connection_keep_alive` expires, a graceful shutdown
  is initiated.
romanb pushed a commit to romanb/rust-libp2p that referenced this pull request Jul 29, 2020
Building on the ability to wait for connection shutdown to
complete introduced in libp2p#1619,
this commit extends the ability for performing graceful
shutdowns in the following ways:

  1. The `ConnectionHandler` (and thus also `ProtocolsHandler`) can
  participate in the shutdown, via new `poll_close` methods. The
  muxer and underlying transport connection only starts closing once
  the connection handler signals readiness to do so.

  2. A `Network` can be gracefully shut down, which involves a
  graceful shutdown of the underlying connection `Pool`. The `Pool`
  in turn proceeds with a shutdown by rejecting new connections
  while draining established connections.

  3. A `Swarm` can be gracefully shut down, which involves a
  graceful shutdown of the underlying `Network` followed by
  polling the `NetworkBehaviour` until it returns `Poll::Pending`,
  i.e. it has no more output.

In particular, the following are important details:

  * Analogous to new inbound and outbound connections during shutdown,
  while a single connection is shutting down, it rejects new inbound substreams
  and, by the return type of `ConnectionHandler::poll_close`,
  no new outbound substreams can be requested.

  * The `NodeHandlerWrapper` managing the `ProtocolsHandler`
  always waits for already ongoing inbound and outbound substream
  upgrades to complete. Since the `NodeHandlerWrapper` is a
  `ConnectionHandler`, the previous point applies w.r.t. new inbound
  and outbound substreams.

  * When the `connection_keep_alive` expires, a graceful shutdown
  is initiated.
@romanb
Copy link
Contributor Author

romanb commented Jul 29, 2020

@twittner Thanks for the review, I incorporated pretty much all your suggestions.

@tomaka I didn't hear back from you yet - do you also have a final verdict? I would go ahead with merging eventually unless you still have strong objections and my answers to earlier questions were not convincing. If you just didn't find the time to take a closer look yet, I'm happy to leave this PR open for a while longer. The gist of the changes is still that *Established events are always paired with *Closed events, regardless of how connections are closed. It is true that with these changes connections that are gracefully closed (EstablishedConnection::start_close) do not immediately disappear from the API until they are, well, actually closed, which surely isn't non-sensical. This shouldn't really be a drastic change for a user either, because due to the background tasks any "established" connection one has at hand can suddenly actually be closed, which shows in the API (e.g. the possible errors of notify_handler). It may make even more sense if I mention here that in #1682 while the connection is performing a graceful shutdown but before it is ultimately closed, the handler is still allowed to receive and emit events (but not get or create any new substreams, see that PR). Of course, whether this is in itself a good idea or not should be discussed in that PR and not here.

@romanb romanb merged commit 8e1d4ed into libp2p:master Aug 4, 2020
@romanb romanb deleted the active-close branch September 9, 2020 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants