`Starting Connection Forward` infinite loop #413

emmacasolin · 2022-07-13T06:51:13Z

Describe the bug

This bug has been observed on a deployed agent on AWS attempting to connect to a local agent that has been stopped. The deployed agent continuously logs the message INFO:ConnectionForward [host]:[port]:Starting Connection Forward (where the host and port are those of the local agent that was shut down until the deployed agent eventually stops. There appears to be some sort of infinite loop here.

...
  | 2022-07-13T16:29:58.863+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:30:19.864+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:30:40.865+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:31:01.880+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:31:22.858+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:31:43.874+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:32:04.858+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:32:25.855+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
  | 2022-07-13T16:32:46.858+10:00 | INFO:ConnectionForward 120.18.194.227:3806:Starting Connection Forward
...

To Reproduce

Connect to a deployed agent (npm run polykey -- agent start --seednodes="[deployedAgentNodeId]@[deployedAgentHost]:[deployedAgentPort] --verbose --format json)
At some point after a connection is established kill the local agent

Expected behavior

We do not expect to see any sort of infinite looping or reconnecting. If an agent goes offline we shouldn't continue trying to re-connect to it.

Tasks

1. Add an exponential back off delay to contacting offline nodes. With implementation and testing, could take 0.5 - 1 day.
- Use an in memory Map for NodeIdString -> {lastAttempt: number, delay: number}.
- If a node is within it's 'backoff' period then we don't attempt to connect to it. This will have to be applied to the findNode and getClosestGlobalNodes. We may want the ability to ignore these offline nodes or check every time. So we can add a flag to toggle skipping these nodes.
- This should be updated in getClosestGlobalNodes when pinging nodes.
- Create a test for this feature.

The text was updated successfully, but these errors were encountered:

tegefaulkes · 2022-07-13T07:30:27Z

Is it infinite or is it just syncNodeGraph and the refreshBucketQueue doing it's thing as part of the startup proceedure?

emmacasolin · 2022-07-13T07:33:02Z

It's infinite as far as I can tell. When I saw it it just kept going until the agent eventually shut down (it was happening several times every few minutes for almost an hour).

emmacasolin · 2022-07-13T23:54:17Z

Yeah this absolutely can't be right, this agent started up over 15 hours ago and it's still trying to connect to a deployed agent (that's now dead) and my own agent (also dead).

  | 2022-07-14T09:49:36.816+10:00 | INFO:ConnectionForward 129.94.8.25:53225:Starting Connection Forward
  | 2022-07-14T09:49:57.725+10:00 | INFO:ConnectionForward 54.253.9.3:1314:Starting Connection Forward
  | 2022-07-14T09:50:18.618+10:00 | INFO:ConnectionForward 129.94.8.25:53225:Starting Connection Forward
  | 2022-07-14T09:50:39.527+10:00 | INFO:ConnectionForward 54.253.9.3:1314:Starting Connection Forward
  | 2022-07-14T09:51:00.419+10:00 | INFO:ConnectionForward 129.94.8.25:53225:Starting Connection Forward
  | 2022-07-14T09:51:21.326+10:00 | INFO:ConnectionForward 54.253.9.3:1314:Starting Connection Forward
  | 2022-07-14T09:51:42.220+10:00 | INFO:ConnectionForward 129.94.8.25:53225:Starting Connection Forward
  | 2022-07-14T09:52:03.127+10:00 | INFO:ConnectionForward 54.253.9.3:1314:Starting Connection Forward

Looks like a message every 20 seconds, and it contacts these two addresses in turn.

emmacasolin · 2022-07-14T00:43:51Z

As suggested here #415 (comment) these logs might be from the deployed agent trying to reconnect to an agent it previously had an open connection with after it goes offline. If this is the case then it shouldn't be doing this - if an agent goes offline we should stop trying to connect back to it. This is especially the case if the agent is behind a NAT, because even if the agent came back online we wouldn't be able to reach it. We need to wait for that agent to connect back to us instead.

CMCDragonkai · 2022-09-14T05:28:16Z

It is intentional that there's no GC of an existing node in the nodegraph until it is pushed out by new nodes.

This means a node can offline, and other nodes may attempt reconnection later.

However the amount of reconnection attempts looks incorrect. It's doing it every 20 seconds, which the time given to start a connection.

We should only try to restart a connection the node, upon refreshing the bucket.

The problem is similar to #415, except this time from the perspective of the seed node.

So there may be 256 buckets, all trying to refresh buckets, they all end up contacting the one other node it knows about, they should all coalesce during connection startup.

What we want to change here, is that now that the task manager is concurrent, they should try to start a connection at roughly the same time, assuming no jitter. If there is jitter, there will be staggered attempts.

Either way, we should not see reconnection attempts every 20 seconds.

But otherwise, reconnecting to an existing node in the nodegraph is intentional...

We can also add an exponential delay to reconnecting to nodes that no longer respond. An exponential delay would have to be done by the NodeConnectionManager.findNode function. This would require one to keep track of when the last attempt was done.

So if it is within a certain time range, we do nothing. If it exceeds it, we do contact that node. This time range increases by an exponential factor. You can do this just by incrementing the time range by a multiplication factor of 2*.

CMCDragonkai · 2022-09-14T05:30:16Z

We could apply this "delay" to both here and #415.

Here would use exponential delay.

But in #415, it'd be a fixed delay.

This is because if something is not responding, it's likely to keep not responding.

Whereas in the other case, its' just simply unlikely to return different data (but only if it is within the threshold of small network).

tegefaulkes · 2022-09-15T02:20:58Z

Updated the description with tasks, Time for this one is a little fuzzy, I'm thinking 0.5-1 day.

tegefaulkes · 2022-09-16T02:43:38Z

I decided to make this a in memory map to track the back off period. It seems unnecessary to make this persistent between node restart so I don't see the need to put it in the database.

As for locking, I think it would be needed however updating the backoff data only happens inside of NodeConnectionManager.acquireConnection and that's covered by the object map locking so no issue here.

CMCDragonkai · 2022-09-16T03:16:18Z

Updating in-memory data shouldn't experience race conditions though. Due to the single threaded main thread.

CMCDragonkai · 2022-09-16T03:16:36Z

Don't forget to tick off the tasks above too.

…ip recently offline nodes This is done with an in-memory map of `nodeIdstring` to some data tracking the backoff period. it defaults to 5 min and doubles each failure. #413

emmacasolin added the bug Something isn't working label Jul 13, 2022

emmacasolin mentioned this issue Dec 8, 2023

Integration Tests for testnet.polykey.com MatrixAI/Polykey-CLI#71

Closed

14 tasks

emmacasolin mentioned this issue Jul 14, 2022

Continuous connection attempts between a deployed seed node and local agent #415

Closed

4 tasks

CMCDragonkai mentioned this issue Sep 8, 2022

Integrate TaskManager into NodeGraph and Discovery #445

Merged

15 tasks

CMCDragonkai assigned tegefaulkes Sep 14, 2022

CMCDragonkai closed this as completed in #445 Sep 21, 2022

CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Starting Connection Forward` infinite loop #413

`Starting Connection Forward` infinite loop #413

emmacasolin commented Jul 13, 2022 •

edited by tegefaulkes

Loading

tegefaulkes commented Jul 13, 2022

emmacasolin commented Jul 13, 2022

emmacasolin commented Jul 13, 2022 •

edited

Loading

emmacasolin commented Jul 14, 2022

CMCDragonkai commented Sep 14, 2022

CMCDragonkai commented Sep 14, 2022

tegefaulkes commented Sep 15, 2022

tegefaulkes commented Sep 16, 2022

CMCDragonkai commented Sep 16, 2022

CMCDragonkai commented Sep 16, 2022

Starting Connection Forward infinite loop #413

Starting Connection Forward infinite loop #413

Comments

emmacasolin commented Jul 13, 2022 • edited by tegefaulkes Loading

Describe the bug

To Reproduce

Expected behavior

Tasks

tegefaulkes commented Jul 13, 2022

emmacasolin commented Jul 13, 2022

emmacasolin commented Jul 13, 2022 • edited Loading

emmacasolin commented Jul 14, 2022

CMCDragonkai commented Sep 14, 2022

CMCDragonkai commented Sep 14, 2022

tegefaulkes commented Sep 15, 2022

tegefaulkes commented Sep 16, 2022

CMCDragonkai commented Sep 16, 2022

CMCDragonkai commented Sep 16, 2022

`Starting Connection Forward` infinite loop #413

`Starting Connection Forward` infinite loop #413

emmacasolin commented Jul 13, 2022 •

edited by tegefaulkes

Loading

emmacasolin commented Jul 13, 2022 •

edited

Loading