-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starting Connection Forward
infinite loop
#413
Comments
Is it infinite or is it just |
It's infinite as far as I can tell. When I saw it it just kept going until the agent eventually shut down (it was happening several times every few minutes for almost an hour). |
Yeah this absolutely can't be right, this agent started up over 15 hours ago and it's still trying to connect to a deployed agent (that's now dead) and my own agent (also dead).
Looks like a message every 20 seconds, and it contacts these two addresses in turn. |
As suggested here #415 (comment) these logs might be from the deployed agent trying to reconnect to an agent it previously had an open connection with after it goes offline. If this is the case then it shouldn't be doing this - if an agent goes offline we should stop trying to connect back to it. This is especially the case if the agent is behind a NAT, because even if the agent came back online we wouldn't be able to reach it. We need to wait for that agent to connect back to us instead. |
It is intentional that there's no GC of an existing node in the nodegraph until it is pushed out by new nodes. This means a node can offline, and other nodes may attempt reconnection later. However the amount of reconnection attempts looks incorrect. It's doing it every 20 seconds, which the time given to start a connection. We should only try to restart a connection the node, upon refreshing the bucket. The problem is similar to #415, except this time from the perspective of the seed node. So there may be 256 buckets, all trying to refresh buckets, they all end up contacting the one other node it knows about, they should all coalesce during connection startup. What we want to change here, is that now that the task manager is concurrent, they should try to start a connection at roughly the same time, assuming no jitter. If there is jitter, there will be staggered attempts. Either way, we should not see reconnection attempts every 20 seconds. But otherwise, reconnecting to an existing node in the nodegraph is intentional... We can also add an exponential delay to reconnecting to nodes that no longer respond. An exponential delay would have to be done by the So if it is within a certain time range, we do nothing. If it exceeds it, we do contact that node. This time range increases by an exponential factor. You can do this just by incrementing the time range by a multiplication factor of |
We could apply this "delay" to both here and #415. Here would use exponential delay. But in #415, it'd be a fixed delay. This is because if something is not responding, it's likely to keep not responding. Whereas in the other case, its' just simply unlikely to return different data (but only if it is within the threshold of small network). |
Updated the description with tasks, Time for this one is a little fuzzy, I'm thinking 0.5-1 day. |
I decided to make this a in memory map to track the back off period. It seems unnecessary to make this persistent between node restart so I don't see the need to put it in the database. As for locking, I think it would be needed however updating the backoff data only happens inside of |
Updating in-memory data shouldn't experience race conditions though. Due to the single threaded main thread. |
Don't forget to tick off the tasks above too. |
…ip recently offline nodes This is done with an in-memory map of `nodeIdstring` to some data tracking the backoff period. it defaults to 5 min and doubles each failure. #413
…ip recently offline nodes This is done with an in-memory map of `nodeIdstring` to some data tracking the backoff period. it defaults to 5 min and doubles each failure. #413
…ip recently offline nodes This is done with an in-memory map of `nodeIdstring` to some data tracking the backoff period. it defaults to 5 min and doubles each failure. #413
Describe the bug
This bug has been observed on a deployed agent on AWS attempting to connect to a local agent that has been stopped. The deployed agent continuously logs the message
INFO:ConnectionForward [host]:[port]:Starting Connection Forward
(where the host and port are those of the local agent that was shut down until the deployed agent eventually stops. There appears to be some sort of infinite loop here.To Reproduce
npm run polykey -- agent start --seednodes="[deployedAgentNodeId]@[deployedAgentHost]:[deployedAgentPort] --verbose --format json
)Expected behavior
We do not expect to see any sort of infinite looping or reconnecting. If an agent goes offline we shouldn't continue trying to re-connect to it.
Tasks
NodeIdString
->{lastAttempt: number, delay: number}
.findNode
andgetClosestGlobalNodes
. We may want the ability to ignore these offline nodes or check every time. So we can add a flag to toggle skipping these nodes.getClosestGlobalNodes
when pinging nodes.The text was updated successfully, but these errors were encountered: