-
Notifications
You must be signed in to change notification settings - Fork 788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove all batches related to a peer on disconnect #5969
Remove all batches related to a peer on disconnect #5969
Conversation
For block lookups we should iterate the active requests here and fail them lighthouse/beacon_node/network/src/sync/network_context.rs Lines 120 to 123 in f1d88ba
Plus add tests to ensure that lookups can handle getting late errors for a requests that has already been failed for disconnection. Current lookup code is quite strict in that it expects a single event per network request. |
Note that sync doesn't need to check if the peer is connected before sending a request. Consider these two cases: Case 1: (Receive disconnect after request)
Case 2: (Receive disconnect before request)
|
I've re-targeted this at |
I think we can probably also now remove the RPC Error logic we added. I think its not that necessary if we are now relying on the PeerDisconnection message. i.e I think we can remove this entire function: https://github.com/sigp/lighthouse/blob/stable/beacon_node/lighthouse_network/src/rpc/handler.rs#L356 |
I can also remove this extra code in a future PR if we prefer to test this standalone? |
@AgeManning removed in cd17f9f @dapplion I changed lookups peer_disconnected behaviour such that it drops a lookup only if there are no peers to drive the request forward or if there are no existing downloaded components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for driving this Pawan!
…n-disconnect-revert
Squashed commit of the following: commit 0c8efc1 Merge: 62167fb 758b58c Author: Michael Sproul <[email protected]> Date: Tue Jun 25 11:21:39 2024 +1000 Merge remote-tracking branch 'origin/release-v5.2.1' into rpc-error-on-disconnect-revert commit 62167fb Author: Pawan Dhananjay <[email protected]> Date: Tue Jun 25 02:30:53 2024 +0530 Fix test commit b930c7c Author: Pawan Dhananjay <[email protected]> Date: Tue Jun 25 01:58:47 2024 +0530 Cleanup commit dfaf238 Author: Pawan Dhananjay <[email protected]> Date: Tue Jun 25 01:56:29 2024 +0530 Revert pretty response types commit 3e1c41a Author: Pawan Dhananjay <[email protected]> Date: Tue Jun 25 01:45:20 2024 +0530 Fix issue raised by lion commit 6ead176 Author: Pawan Dhananjay <[email protected]> Date: Mon Jun 24 18:13:24 2024 +0530 Remove redundant test commit cd17f9f Author: Pawan Dhananjay <[email protected]> Date: Mon Jun 24 15:39:34 2024 +0530 Remove poll_close on rpc behaviour commit bc10fb2 Author: Pawan Dhananjay <[email protected]> Date: Mon Jun 24 15:25:56 2024 +0530 Fix tests commit a8f64f2 Author: Pawan Dhananjay <[email protected]> Date: Mon Jun 24 13:03:30 2024 +0530 Remove lookup if it cannot progress commit c7fd21d Author: Pawan Dhananjay <[email protected]> Date: Mon Jun 24 10:50:37 2024 +0530 Fix lints commit a38caf4 Author: Pawan Dhananjay <[email protected]> Date: Fri Jun 21 21:43:41 2024 +0530 fmt commit 312be2c Author: Pawan Dhananjay <[email protected]> Date: Fri Jun 21 21:43:33 2024 +0530 Pretty response types commit 70af7d1 Author: Pawan Dhananjay <[email protected]> Date: Fri Jun 21 18:02:40 2024 +0530 Allow lookups to continue in case of disconnections commit 9d90e39 Author: Pawan Dhananjay <[email protected]> Date: Fri Jun 21 17:34:01 2024 +0530 Cleanup map entries after disconnect commit 92a6e77 Author: Pawan Dhananjay <[email protected]> Date: Thu Jun 20 12:50:19 2024 +0530 Remove all batches related to a peer on disconnect
@pawanjay176 Added some comments relevant to the API expectations of lookup sync. I think they are helpful to understand bug sources and overall behaviour |
Add comments about expected event api
@mergify queue |
✅ The pull request has been merged automaticallyThe pull request has been merged automatically at bf4cbd3 |
Issue Addressed
N/A
Proposed Changes
In #5680 , we decoupled a peer disconnect from an rpc error to simplify the logic in sync a bit.
The peer disconnect message from the network was responsible for removing the peer from all the sync data structures.
The rpc error was responsible for re-requesting the failed batch/lookup.
We assumed that an rpc error would always go together with a peer disconnect, however we learned the hard way that there are many edge cases where the rpc error never gets generated when the peer disconnects. e.g. the peer manager disconnects a peer that has its rpc response mid-flight, peer disconnects before a rpc request reaches the network etc. There are many other cases we may not know about.
This leads to multiple bugs in sync where sync stalls because a batch/lookup never gets re-requested on peer disconnects.
#5967 was an attempt at adding timeouts to the rpc requests sent from sync that trigger re-requests when the timer elapses. However, as @AgeManning noted, this would lead to many more edge cases.
This PR triggers re-request of batches/lookups on peer disconnects partially reverting #5680 . This isn't the cleanest but considering that we cannot always rely on
lighthouse_network
to generate rpc errors to go with peer disconnects, this seems the best option.I intend to test this extensively on mainnet before merging