Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decentralized Signalling #618

Merged
merged 19 commits into from
Dec 4, 2023
Merged

Conversation

CMCDragonkai
Copy link
Member

@CMCDragonkai CMCDragonkai commented Nov 7, 2023

Description

This introduces decentralized signalling which allows the PK nodes to:

  1. Use SRV records to resolve the seed nodes - this supports Testnet & Mainnet Status Page testnet.polykey.com mainnet.polykey.com #599, but is also necessary to allow dynamic seed nodes
  2. Connect to only a subset of the seed nodes, rather than always connecting to all seed nodes - thus allowing us to scale the seed nodes as the network gets larger
  3. Find the nearest signalling node to use to try and help connect to a target node, this process is iterative, starting from the final target node, and walking backwards through a chain of nodes that are connected to the final target node
  4. It will have to remove the hardcoded seed node list in order to use SRV records, however this PR won't try to solve the CSR feature for Keys domain to allow KN root certificates to be signed by an external CA #154 problem with having a root certificate of the nodes, so that will be done later to re-introduce trusted seed nodes
  5. It should mean that the load on the seed nodes decreases as it can rely on other nodes to participating in the signalling coordination - this should be optional, as some nodes may not want to do so
  6. This will also be maintaining random connections in order to maintain connectivity of the network graph assuming all nodes were to be behind the NAT. This randomness may be directed with some policy.

Issues Fixed

Tasks

  • 1. NodeGraph.getOldestNode is replaced with NodeGraph.getBucket with additional optionallimit parameter - update all tests
  • 2. NodeGraph.setNode now respects the NodeGraph.nodeBucketLimit and will throw ErrorNodeGraphBucketLimit if it goes over the limit when a new node is being set.
  • 3. The NodeGraph.nodeBucketLimit is no longer hardcoded, it can be set during creation/construction.
  • 4. Removed debug logs from NodeGraph as these shouldn't be needed after behaviour is verified.
  • 5. Added lastUpdated parameter to NodeGraph.setNode, allowing one to pass down last updated time from a higher context, and the ability to get rid of sleeps by manually setting the time. It defaults to utils.getUnixtime().
  • 6. Removed utils.sleep usage in NodeGraph.test.ts and now all tests execute much faster.
  • 7. Removed the global dataDir from jest.config.js which was being used by NodeManager.test.ts but this is vestigial code, since we are supposed to always be creating dataDir for each test module.
  • 8. NodeManager task handler IDs do not require hardcoded names, the function names do not disappear. Also when using esbuild make sure to use the keepNames option as true - https://esbuild.github.io/api/#keep-names (we have done this already in PK CLI)
  • 9. Added readiness test to NodeManager.test.ts and ensured that you can NodeManager.start and NodeManager.stop immediately afterwards, so this means stop always cancels all node manager tasks.
  • 10. The NodeGraph will now support multiple addresses - this provides new API to specifically deal with "contacts"

remaining tasks.

  • 1. finish fixing up NodeManager
  • 2. Update findNode to take a pingTimeoutTime for setting smaller timeouts for each connection attempt
  • 3. Implement sticky connection logic. This is a combo of deciding which nodes to keep long-running connections to. And to attempt re-connecting when a long-running connection drops. I need to work out some details for this. Potentially scale timeouts by the closeness metric.
  • 4. re-implement the MDNS functionality.
  • 5. Implement periodic updating the connectedTime in the NodeGraph for active nodes. - differed till later, will create an issue
  • 6. Review remaining tests. Cull irrelevant ones and convert ones we want to keep.
  • 7. review usage of the nodes domain in other parts of the code and fix their usage.

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@CMCDragonkai CMCDragonkai self-assigned this Nov 7, 2023
@ghost
Copy link

ghost commented Nov 7, 2023

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 7, 2023

I'm thinking that whatever function does the connection, let's call it connectTo(nodeId, ip, port) needs attempt a direct connection, and in the background run a signalling operation.

If the DC succeeds, it needs to cancel the signalling operation.

The background/concurrent signalling operation, needs to try and find an appropriate signaller. If there is an appropriate signaller, we just start using them straight away.

If there isn't an appropriate signaller, and the direct connection didn't work, we now can try to find another node that is close to the target node and try the same thing with that target node.

This how we walk backwards (backtrack) from the target node towards a close enough node to connect to first, and the forward track from there towards the target node. You'd backtrack to a node that is close to the target, but also closer to your own node.

However to even call connectTo we would need to already have the IP and port. In the scenario where we don't have the IP and port at all, and we have to use kademlia's find node operation to get close nodes that will eventually give us information about the target node. In this case, we're always forward tracking towards the target node, repeatedly doing connectTo(nodeId, ip, port) with some node close to the target node until we can try with the target node. In this case, we may have built up information about which signalling node is appropriate to use.


  • Related to maze traversal and path finding here

@CMCDragonkai
Copy link
Member Author

It seems NodeGraph.getOldestNode can be replaced with NodeGraph.getBucket by adding one more parameter limit: number = Infinity.

You just have to use sort = 'lastUpdated';

The limit can be injected into the 2 iterators over the NodeData. The good thing is that this call would get you an entire NodeBucket to work from.

@CMCDragonkai
Copy link
Member Author

I've replaced getOldestNodeId with getBucket instead with a limit parameter. Will need to update tests later.

@CMCDragonkai
Copy link
Member Author

In reviewing NodeGraph, I think the most important function to get a handle on is the NodeGraph.getClosestNodes. That ultimately is the basis of returning close nodes.

If we add extra data to the NodeGraph that applies to the NodeData type.

/**
 * This is the record value stored in the NodeGraph.
 */
type NodeData = {
  /**
   * The address of the node.
   */
  address: NodeAddress;
  /**
   * Unix timestamp of when it was last updated.
   */
  lastUpdated: number;
};

There are some suggestions that can help here:

  1. Making use of the scopes in reference to MDNS integration and NodeGraph structure expansion #537 (comment) - this could be used to imply some special nature of the records, perhaps that they are a public node? The reality it's not a proof of a public node, it is just simply that you believe it is a public IP address.
  2. Adding provenance information - so we know who provided us this information. This would make the lastUpdated a bit more useful.
  3. Maybe whether an address should be stable and not idled out? Actually this seems like something that the NodeManager would instead maintain.
  4. Anything else?

@CMCDragonkai CMCDragonkai force-pushed the feature-decentralized-signalling branch from 34b15c5 to 81e9034 Compare November 9, 2023 16:36
@CMCDragonkai
Copy link
Member Author

A quick and dirty solution that @tegefaulkes suggested is for us to only maintain connections to our closest seed node. That way, we continue to be centralized, but it will allow us to scale the seed nodes for MatrixAI/Polykey-CLI#40

@CMCDragonkai
Copy link
Member Author

I noticed that NodeGraph.nodeBucketLimit was hardcoded to 20, this is now configurable with during creation and construction.

@CMCDragonkai
Copy link
Member Author

Now if NodeGraph does have nodeBucketLimit set. That should mean that NodeGraph.setNode should also be aware of this throw an exception if one is setting a node more than the bucket limit.

It doesn't make sense to have that setting be in NodeGraph but not respected by NodeGraph.setNode.

@CMCDragonkai
Copy link
Member Author

But nodeBucketLimit is only used in 2 places. NodeGraph.resetBuckets and NodeGraph.getClosestNodes.

For NodeGraph.getClosestNodes it could just have no limit at all, or a limit of 1 to default it. So then we can leave it out.

In NodeGraph.resetBuckets, it uses it to compare against countNew, and if it is greater than the countNew, it will increment the count.

It is needed here because during remapping, there could more nodes being added to a bucket greater than the required bucket limit.

In that case, NodeGraph.setNode should be throwing an exception if the count exceeds the bucket limit thus the constraint is maintained here.

@CMCDragonkai
Copy link
Member Author

There's even an exception for this already ErrorNodeGraphOversizedBucket but it is never been used! @tegefaulkes there's nothing ensuring that the buckets size has to be kept to 20.

@CMCDragonkai
Copy link
Member Author

I'm changing it to ErrorNodeGraphBucketLimit and it will now be thrown on NodeGraph.setNode.

@CMCDragonkai
Copy link
Member Author

In terms of bucket refresh, I think:

image

That is buckets are refreshed if no lookups have occurred.

This condition cannot be dependent on the lastUpdated value for each record in the bucket.

This is because lastUpdated just means the time the record was last updated (set or updated).

This doesn't tell us anything about whether a bucket has had a lookup occur in the last hour.

I haven't got around to checking NodeManager yet, but comments @tegefaulkes?

@CMCDragonkai
Copy link
Member Author

In terms of bucket refresh, I think:

image

That is buckets are refreshed if no lookups have occurred.

This condition cannot be dependent on the lastUpdated value for each record in the bucket.

This is because lastUpdated just means the time the record was last updated (set or updated).

This doesn't tell us anything about whether a bucket has had a lookup occur in the last hour.

I haven't got around to checking NodeManager yet, but comments @tegefaulkes?

A proper condition on this would have to be maintained by the NodeManager, not NodeGraph.

We could even make it a completely in-memory data, rather than storing it into the DB. If we persist this information, this can go into the NodeGraph.nodeGraphMetaDbPath.

However if we do this, then NodeManager needs to be resetting up the background tasks with this persisted data.

Thoughts? @tegefaulkes

@CMCDragonkai
Copy link
Member Author

Looking at NodeGraph.getClosestNodes @tegefaulkes your comment here is hard to understand. Can you clarify what you are talking about here?

    // Buckets map to the target node in the following way;
    // 1. 0, 1, ..., T-1 -> T
    // 2. T -> 0, 1, ..., T-1
    // 3. T+1, T+2, ..., 255 are unchanged
    // We need to obtain nodes in the following bucket order
    // 1. T
    // 2. iterate over 0 ---> T-1
    // 3. iterate over T+1 ---> K

@CMCDragonkai
Copy link
Member Author

In NodeGraph.getClosestNodes why do you add the last bucket to the results array. Why do you do this and then do a slicing/truncation to the limit at the end? @tegefaulkes

@CMCDragonkai
Copy link
Member Author

Introduction of ErrorNodeGraphBucketLimit broke a few tests:

    ✕ get bucket with multiple nodes (2268 ms)
    ✕ get all buckets (1910 ms)
    ✕ reset buckets (with seed=2137511502) (120 ms)
    ✕ reset buckets to an existing node should remove node (with seed=2137511502) (154 ms)
    ✕ reset buckets is persistent (70 ms)

It's possible they were not respecting the bucket limit.

@CMCDragonkai
Copy link
Member Author

Ok I can see a sleep being used in get bucket with multiple nodes, and the reason is simple, the lastUpdated is set by NodeGraph. We actually need to parameterise this, so that it can be tested.

Furthermore, the exact time we update our NodeGraph.setNode can actually be dependent on the NodeManager, the main reason is if you need to compare timing from different parts of the program, if concurrent things are to have occurred at the same time, it might make more sense if they have matching timestamps.

So setNode will be expanded with an optional lastUpdated parameter defaulted to the getUnixtime().

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 9, 2023

Replacing sleep with:

    const lastUpdatedNow = utils.getUnixtime();
    const lastUpdatedTimes = Array.from({ length: nodeBucketLimit }, (_, i) => {
      return lastUpdatedNow - i * 100;
    });

The get bucket with multiple nodes now runs in 172 ms as opposed to originally 2279ms.

@CMCDragonkai
Copy link
Member Author

@amydevs the introduction of scopes wasn't done properly, as the tests in NodeGraph.test.ts was asserting that they were NodeAddress types when scopes now must be at the very least an empty array.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Nov 9, 2023

In terms of bucket refresh, I think:

image

That is buckets are refreshed if no lookups have occurred.

This condition cannot be dependent on the lastUpdated value for each record in the bucket.

This is because lastUpdated just means the time the record was last updated (set or updated).

This doesn't tell us anything about whether a bucket has had a lookup occur in the last hour.

I haven't got around to checking NodeManager yet, but comments @tegefaulkes?

I think based on this, even unsuccessful node lookups is a valid bucket operation, in which case the bucket refresh delay should be refreshed.

Refresh buckets is already occurring in memory.

@CMCDragonkai CMCDragonkai force-pushed the feature-decentralized-signalling branch from 8a4b405 to 62c578d Compare November 10, 2023 16:09
@CMCDragonkai
Copy link
Member Author

In the NodeGraph.test.ts we can no longer use as NodeAddress, as that is breaking the real type, which now includes scopes. Beware of this @amydevs @tegefaulkes, I'm making the change here.

@CMCDragonkai
Copy link
Member Author

Currently the NodeGraph still isn't capable of storing multiple addresses for the same node. I thought we were meant to have this as per #537? @amydevs

@amydevs
Copy link
Contributor

amydevs commented Nov 10, 2023

Currently the NodeGraph still isn't capable of storing multiple addresses for the same node. I thought we were meant to have this as per #537? @amydevs

Yes, this is meant to be done by the Step 2 of #537 but #584 only completed Step 1. This was done for the sake of time, as it was deemed that expanding the NodeGraph required more consideration and many changes.

@amydevs
Copy link
Contributor

amydevs commented Nov 10, 2023

In the NodeGraph.test.ts we can no longer use as NodeAddress, as that is breaking the real type, which now includes scopes. Beware of this @amydevs @tegefaulkes, I'm making the change here.

@CMCDragonkai where we're using as NodeAddress, i think we should be using satisfies NodeAddress instead.

@CMCDragonkai
Copy link
Member Author

image

@tegefaulkes
Copy link
Contributor

I squashed as much as I could. I left these as wip commits and just made sure the included a description as well.

test-bottom.ts Outdated Show resolved Hide resolved
@CMCDragonkai
Copy link
Member Author

The second WIP commit could be dropped. It adds scaffolding.

@CMCDragonkai
Copy link
Member Author

Generally my NodeGraph changes should be put together.

Make sure no spurious comments are left there, I had a bunch of stuff relating to nodes/types.ts.

CMCDragonkai and others added 17 commits December 4, 2023 19:11
small name and commentary updates to `NodeGraph`

[ci skip]
This includes creating a `connectionsQueue` utility class for coordinating rate limits and shared queues between the direct connection loop and the signalled connection loop. The two loops run concurrently while sharing found data between each other. When the connection is found, any pending connections are cancelled and awaited for clean up.

[ci skip]
…Node` loop

Previously it used the timeout provided for the whole `findNode` operations. This means that a failing connection would take up the whole timeout.

[ci skip]
closest nodes timeout after 2 hours, furthest after 1 min.

[ci skip]
[ci skip]

wip: fixing delay names

[ci skip]
Also added some basic multi-connection logic to `NodeConnectionManager`

[ci skip]
@tegefaulkes tegefaulkes force-pushed the feature-decentralized-signalling branch from 4b2e2fb to 6eb23ac Compare December 4, 2023 08:11
@tegefaulkes tegefaulkes force-pushed the feature-decentralized-signalling branch from 6eb23ac to 37ef800 Compare December 4, 2023 08:25
@tegefaulkes tegefaulkes merged commit 28f0001 into staging Dec 4, 2023
@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Aug 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

Successfully merging this pull request may close these issues.

Decentralised NAT Signalling
4 participants