Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting Connection Forward infinite loop #413

Closed
5 tasks done
emmacasolin opened this issue Jul 13, 2022 · 10 comments · Fixed by #445
Closed
5 tasks done

Starting Connection Forward infinite loop #413

emmacasolin opened this issue Jul 13, 2022 · 10 comments · Fixed by #445
Assignees
Labels
bug Something isn't working r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@emmacasolin
Copy link
Contributor

emmacasolin commented Jul 13, 2022

Describe the bug

This bug has been observed on a deployed agent on AWS attempting to connect to a local agent that has been stopped. The deployed agent continuously logs the message INFO:ConnectionForward [host]:[port]:Starting Connection Forward (where the host and port are those of the local agent that was shut down until the deployed agent eventually stops. There appears to be some sort of infinite loop here.

...
  | 2022-07-13T16:29:58.863+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:30:19.864+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:30:40.865+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:31:01.880+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:31:22.858+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:31:43.874+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:32:04.858+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:32:25.855+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:32:46.858+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
...

To Reproduce

  1. Connect to a deployed agent (npm run polykey -- agent start --seednodes="[deployedAgentNodeId]@[deployedAgentHost]:[deployedAgentPort] --verbose --format json)
  2. At some point after a connection is established kill the local agent

Expected behavior

We do not expect to see any sort of infinite looping or reconnecting. If an agent goes offline we shouldn't continue trying to re-connect to it.

Tasks

  • 1. Add an exponential back off delay to contacting offline nodes. With implementation and testing, could take 0.5 - 1 day.
    • Use an in memory Map for NodeIdString -> {lastAttempt: number, delay: number}.
    • If a node is within it's 'backoff' period then we don't attempt to connect to it. This will have to be applied to the findNode and getClosestGlobalNodes. We may want the ability to ignore these offline nodes or check every time. So we can add a flag to toggle skipping these nodes.
    • This should be updated in getClosestGlobalNodes when pinging nodes.
    • Create a test for this feature.
@tegefaulkes
Copy link
Contributor

Is it infinite or is it just syncNodeGraph and the refreshBucketQueue doing it's thing as part of the startup proceedure?

@emmacasolin
Copy link
Contributor Author

It's infinite as far as I can tell. When I saw it it just kept going until the agent eventually shut down (it was happening several times every few minutes for almost an hour).

@emmacasolin
Copy link
Contributor Author

emmacasolin commented Jul 13, 2022

Yeah this absolutely can't be right, this agent started up over 15 hours ago and it's still trying to connect to a deployed agent (that's now dead) and my own agent (also dead).

  | 2022-07-14T09:49:36.816+10:00 | INFO:ConnectionForward 129.94.8.25:53225:Starting Connection Forward
  | 2022-07-14T09:49:57.725+10:00 | INFO:ConnectionForward 54.253.9.3:1314:Starting Connection Forward
  | 2022-07-14T09:50:18.618+10:00 | INFO:ConnectionForward 129.94.8.25:53225:Starting Connection Forward
  | 2022-07-14T09:50:39.527+10:00 | INFO:ConnectionForward 54.253.9.3:1314:Starting Connection Forward
  | 2022-07-14T09:51:00.419+10:00 | INFO:ConnectionForward 129.94.8.25:53225:Starting Connection Forward
  | 2022-07-14T09:51:21.326+10:00 | INFO:ConnectionForward 54.253.9.3:1314:Starting Connection Forward
  | 2022-07-14T09:51:42.220+10:00 | INFO:ConnectionForward 129.94.8.25:53225:Starting Connection Forward
  | 2022-07-14T09:52:03.127+10:00 | INFO:ConnectionForward 54.253.9.3:1314:Starting Connection Forward

Looks like a message every 20 seconds, and it contacts these two addresses in turn.

@emmacasolin
Copy link
Contributor Author

As suggested here #415 (comment) these logs might be from the deployed agent trying to reconnect to an agent it previously had an open connection with after it goes offline. If this is the case then it shouldn't be doing this - if an agent goes offline we should stop trying to connect back to it. This is especially the case if the agent is behind a NAT, because even if the agent came back online we wouldn't be able to reach it. We need to wait for that agent to connect back to us instead.

@CMCDragonkai
Copy link
Member

It is intentional that there's no GC of an existing node in the nodegraph until it is pushed out by new nodes.

This means a node can offline, and other nodes may attempt reconnection later.

However the amount of reconnection attempts looks incorrect. It's doing it every 20 seconds, which the time given to start a connection.

We should only try to restart a connection the node, upon refreshing the bucket.

The problem is similar to #415, except this time from the perspective of the seed node.

So there may be 256 buckets, all trying to refresh buckets, they all end up contacting the one other node it knows about, they should all coalesce during connection startup.

What we want to change here, is that now that the task manager is concurrent, they should try to start a connection at roughly the same time, assuming no jitter. If there is jitter, there will be staggered attempts.

Either way, we should not see reconnection attempts every 20 seconds.

But otherwise, reconnecting to an existing node in the nodegraph is intentional...

We can also add an exponential delay to reconnecting to nodes that no longer respond. An exponential delay would have to be done by the NodeConnectionManager.findNode function. This would require one to keep track of when the last attempt was done.

So if it is within a certain time range, we do nothing. If it exceeds it, we do contact that node. This time range increases by an exponential factor. You can do this just by incrementing the time range by a multiplication factor of 2*.

@CMCDragonkai
Copy link
Member

We could apply this "delay" to both here and #415.

Here would use exponential delay.

But in #415, it'd be a fixed delay.

This is because if something is not responding, it's likely to keep not responding.

Whereas in the other case, its' just simply unlikely to return different data (but only if it is within the threshold of small network).

@tegefaulkes
Copy link
Contributor

Updated the description with tasks, Time for this one is a little fuzzy, I'm thinking 0.5-1 day.

@tegefaulkes
Copy link
Contributor

I decided to make this a in memory map to track the back off period. It seems unnecessary to make this persistent between node restart so I don't see the need to put it in the database.

As for locking, I think it would be needed however updating the backoff data only happens inside of NodeConnectionManager.acquireConnection and that's covered by the object map locking so no issue here.

@CMCDragonkai
Copy link
Member

Updating in-memory data shouldn't experience race conditions though. Due to the single threaded main thread.

@CMCDragonkai
Copy link
Member

Don't forget to tick off the tasks above too.

tegefaulkes added a commit that referenced this issue Sep 16, 2022
…ip recently offline nodes

This is done with an in-memory map of `nodeIdstring` to some data tracking the backoff period. it defaults to 5 min and doubles each failure.

#413
tegefaulkes added a commit that referenced this issue Sep 21, 2022
…ip recently offline nodes

This is done with an in-memory map of `nodeIdstring` to some data tracking the backoff period. it defaults to 5 min and doubles each failure.

#413
tegefaulkes added a commit that referenced this issue Sep 21, 2022
…ip recently offline nodes

This is done with an in-memory map of `nodeIdstring` to some data tracking the backoff period. it defaults to 5 min and doubles each failure.

#413
@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

Successfully merging a pull request may close this issue.

3 participants