Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testnet Node Deployment (testnet.polykey.io) #194

Closed
4 tasks done
Zachaccino opened this issue Jul 4, 2021 · 49 comments
Closed
4 tasks done

Testnet Node Deployment (testnet.polykey.io) #194

Zachaccino opened this issue Jul 4, 2021 · 49 comments
Assignees
Labels
epic Big issue with multiple subissues ops Operations and administration procedure Action that must be executed production Affects a production deployment that involves customers r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices security Security risk

Comments

@Zachaccino
Copy link

Zachaccino commented Jul 4, 2021

Specification

It is now time for our second attempt at testnet deployment.

image

We had previously already done a PK deployment on AWS using ECS back when PK was 0.0.41. While the AWS deployment worked, we hit a lot of problems which meant we had to go through an 8 month long refactoring process over the entire codebase.

Now that the codebase is finally refactored, we're ready for the second attempt.

The AWS architecture is basically the same as before, but our configuration should be a lot more simpler. There are some changes though.

  • Before we had to deal with Node root certificates, now root certificates are no longer relevant to the testnet/mainnet deployment
  • We are now separating into 2 clusters of PK seed nodes: mainnet.polykey.io and testnet.polykey.io. The mainnet is intended for production use, and we will first prototype our testnet deployment and testnet will be where new versions of PK are tested before being released on production.
  • Both mainnet and testnet seed nodes will be trusted by default, but the PK releases should default to use the mainnet and have a switch to use the testnet.
  • We don't know yet whether we should be using NLB or not, we may decide not to use a NLB at all. But there shouldn't be any sort of session state that is required for P2P functionality
  • NLBs cannot be used with PK clients that are debugging the testnet/mainnet nodes, because they would resolve any possible node, and in this case there is in fact network session state. Instead PK client debugging has to be done with the container IPs.
  • We know that IPv6 isn't supported yet so we will have IPv4 and DNS support.
  • We should be using well known ports here of 1314 UDP and 1315 TCP for the ingress port and the client port respectively.
  • The PK nodes are not stateless, they do require node state. However this node state is not important to us to persist. So any EBS volume mounted into the ECS container should work. Basically we just need a mutable temporary directory. What kind of mutations are there? Well the kademlia node graph is persisted atm and is not in-memory.

Additional context

Tasks

  1. - Upload image to ECR "elastic container registry"
  2. - Create ECS "elastic container service" Task Definition for the new image uploaded to ECR
  3. - Start the ECS service, just cluster of 1, test that it is working by using the PK CLI and directly contacting the ECS IP address and port for PK_PORT.
  4. - Integrate firewall (security group), NLB and elastic IP to the NLB and then attach the testnet.polykey.io domain to the NLB
@CMCDragonkai CMCDragonkai changed the title Bootstrap Deployment Bootstrap Node Deployment (bootstrap.polykey.io) Jul 4, 2021
@CMCDragonkai
Copy link
Member

@joshuakarp we need to agree on the right wording/name for this concept in both code and the URL.

  • bootstrap - this confuses it with the bootstrap phase and the bootstrap command
  • seed - I like this word as this makes sense with respect to other P2P apps
  • broker - this word is too overloaded in the machine messaging ecosystem

If we use seed, we need to use that terminology everywhere so that way we don't have any confusion. That means seed.polykey.io for the domain name, plus in the nodes domain code as well as the certificate or seed node IDs and initial IP address.

We may not even need the seed certificates if we just hardcode the seed into kademlia with both the seed Node IDs and seed domains instead of IPs.

Oh and importantly we must be able to have kademlia refer to NodeAddresses through a domain name instead of an IP. A domain name can resolve to an IP or a list of IP addresses. That's going to be important for after MVP, since we our domain seed.polykey.io may point to other IPs in the future or globally distributed set of IPs.

@CMCDragonkai CMCDragonkai added this to the Release Polykey CLI 1.0.0 milestone Jul 4, 2021
@CMCDragonkai CMCDragonkai added ops Operations and administration production Affects a production deployment that involves customers security Security risk labels Jul 5, 2021
@joshuakarp
Copy link
Contributor

Agreed - seed makes most sense to me too, and distinguishes it from the bootstrap phase.

Can you clarify this a bit more?

Oh and importantly we must be able to have kademlia refer to NodeAddresses through a domain name instead of an IP. A domain name can resolve to an IP or a list of IP addresses. That's going to be important for after MVP, since we our domain seed.polykey.io may point to other IPs in the future or globally distributed set of IPs.

@CMCDragonkai
Copy link
Member

To solve this issue, we're going to have to deploy a test-net for the PK network. We can use the nixpkgs development style. The master branch represents the currently development PK. Whereas a release branch can be used staging into production.

The master branch can have CI/CD continuously deploy it to the test net. So whatever is the current master branch is also in released to the testnet. This does mean master is sort of unstable.

Additional branches can be marked for releasing to production net. I wonder if this is best done with a separate release or prod branch, or just done via tagging. The CI/CD for this should be manual though, as a manual QA should be done for this.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Sep 8, 2021

The testnet will require its own domain:

seed.polykey.io - seed nodes for production releases
seed-test.polykey.io or testnet.polykey.io - testnet for master

This should allow us to maintain an incremental working system, and invite others to join in testing on the testnet.

@CMCDragonkai CMCDragonkai changed the title Bootstrap Node Deployment (bootstrap.polykey.io) Bootstrap/Seed Node Deployment (seed.polykey.io) Sep 8, 2021
@CMCDragonkai
Copy link
Member

CMCDragonkai commented Sep 8, 2021

The master branch should be auto deployed to a testnet. A separate branch will be made for production deployment.

This gives a few advantages:

  1. Most PRs a new devs will assume that feature requests get merged to master. Doesn't require much explanation.
  2. Shows development activity front and center, and for people looking at GitHub which are devs, they will probably want to use the testnet.
  3. No accidental pushes to production when it is a special branch requiring special privileges to push to it. Testing must be extensive for master before enabling a merge to a prod branch.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Sep 8, 2021

We can reuse the terminology from blockchains and use:

mainnet.polykey.io
testnet.polykey.io

I like this as this captures the intent and intuition of P2P and blockchain devs.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 20, 2021

There's probably code changes necessary when we hit problem:

  1. Environment variable setting to automate port/host settings, and also the root password and node state location during bootstrapping, should involve the config.js
  2. We need to make use of reserved domains in our seed node list. Right now it expects that the node ids are known. We may say that we don't know the node ids, but we will trust whatever node id we get from resolving the testnet.polykey.io domain.
    • there are 2 things that are dynamic here: node ids and IP addresses, both of which can be resolved by trusting a reserved domain name
    • I don't like this idea, so we will instead pre-generate root certificates and root public key pairs. Then hand out the node ids and embed these into the source code through a config and PK_SEED. We as a company need to backup and maintain these root keys and such, and that means during the bootstrap phase, we need to be able to generate "deterministic" root keys. This is already possible by the KeyManager, however I believe the bootstrap phase CLI procedure needs to be updated to support this.
      • this has an advantage in that we only have to backup a set of 24/12 word phrases which we will keep secret as the keys to the testnet!!!!
  3. Nodes domain will need to be able to resolve the DNS domain names, these domain name may present a set of IPs, each set of IP may be any possible NodeId. At this point in time, nodes expects a list of node id to IP address. If we want to allow IP address to be dynamic, we will need to change this to discover the mappings by querying the IP address as if it was a node, and asking what it's node id is, and checking if that node id is in one of the list of trusted node ids. This is similar procedure to our LAN "multicase discovery" Local Network Traversal - Multicast Discovery js-mdns#1

@tegefaulkes

@CMCDragonkai CMCDragonkai added the procedure Action that must be executed label Oct 20, 2021
@CMCDragonkai CMCDragonkai changed the title Bootstrap/Seed Node Deployment (seed.polykey.io) Testnet Node Deployment (testnet.polykey.io) Oct 20, 2021
@CMCDragonkai
Copy link
Member

Scheduled for after merge of https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/184 to master. Then we start from master branch of this repo!

@joshuakarp please notify here when ready!

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 26, 2021

image

@joshuakarp please start on a component diagram here as well.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 26, 2021

Meeting with @tegefaulkes yesterday indicated that pk agent start would run the agent process directly in the foreground.

This would be what we would want to do inside our container image as well when launching via the task definition.

To make this deterministic and automatic (un-attended) the pk agent start command would either require the agent to already be bootstrapped, or it would need to be able to perform bootstrapping automatically and then start the agent.

Right now pk bootstrap is a specific command that bootstraps the PK node state. If pk agent start is called before bootstrapping is done, what is the expected behaviour? If we want to necessarily separate the 2 commands, we would have to program our TD to instead run the shell command:

pk bootstrap && pk agent start

Where with the node state is already bootstrapped, it would have to be ignored. However these commands are also used by end users so we have to balance UX concerns for users of the CLI itself as well.

I prefer the separation of responsibilities here. The pk bootstrap should be reserved for bootstrapping agent/node state. The location of this state can be anywhere, by default in human users, it's in the home directory, but in our TD, this can be provided via a /srv directory mounted with a persistent volume either by AWS elastic file system or EBS volume. The specific state here doesn't actually matter, our agents should be able to work with just temporary state. That is they are not stateless services since they need to mutate the disk for storing data in kademlia, but the state is not something we need to keep persisted. @joshuakarp is actually necessary for us to keep our node state in the DB? I'm wondering about the design of the kademlia and whether that can just be in-memory. I forgot if we had a discussion about this.

Whichever style we go for, we're going to need some environment variables.

In #269 we already talked about the addition of PK_SEED_NODES.

For this, we're going to need env variables for:

  • root recovery code related to Enable un-attended bootstrapping for pk agent start and pk bootstrap with PK_RECOVERY_CODE and PK_PASSWORD #202
  • port bindings for grpc server, forward proxy, reverse proxy (see: https://github.com/MatrixAI/js-polykey/wiki/network#network-architecture)
    • we selected a default "preferred" port as 1314 for agent seed connections, but by default the ports are randomly selected by the OS, so 1314 is only for our usage in the seed nodes
    • old details here https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/issues/237#note_530324589
    • PK_INGRESS_PORT - ingress port is the reverse proxy port used for P2P
    • PK_INGRESS_HOST - ingress host is the reverse proxy host used for P2P
    • for NAT-busting, egress host and port may be relevant but I'm not sure, we'll know when we start NAT-Traversal Testing with testnet.polykey.io #159
    • the testnet.polykey.io would resolve to an PK_INGRESS_HOST as this is what is needed, however the expectation would be that the same IP would be used for both the ingress host and the client server host
    • is the grpc client server is on the same port as the agent server? Or can they be separate? Either way accessing the client does not need to go through the proxies, therefore, a specific port can be used here too:
    • PK_SERVER_PORT and PK_SERVER_HOST if there's a shared host/port being used by the client service
    • if 1314 is used for PK_INGRESS_PORT, then when PK CLI cannot be used to access this, as the CLI would expect direct contact with the service, instead they would require PK_SERVER_PORT, therefore an additional port 1315 could be used
  • All env variable parsed data should be centralised into src/config.ts. Only there is where process.env should even be called.
    • beware the difference between PK CLI environment and PK agent environment, env variables set before pk agent start foreground process works, but other env variables set for other commands makes no difference to the agent process

Network architecture diagram reposted here:

image

@tegefaulkes @joshuakarp can you review and address my questions above with snippets pointing to the code where relevant

@emmacasolin
Copy link
Contributor

I'm currently using PK_PASSWORD as the place where the root password is stored, but I can change it to PK_ROOT_PASSWORD?

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 26, 2021 via email

@joshuakarp
Copy link
Contributor

is actually necessary for us to keep our node state in the DB? I'm wondering about the design of the kademlia and whether that can just be in-memory.

Are you specifically referring to the seed node state here? If so, perhaps this doesn't need to be persisted. In-memory would be satisfactory for this case. It would mean that you can't "discover" the seed nodes as part of the kademlia process.

If you're referring to the global node state here (i.e. all known nodes from Kademlia) then this needs to be persisted. Otherwise, we completely reset our known nodes each time a keynode is restarted.

@joshuakarp
Copy link
Contributor

joshuakarp commented Nov 16, 2021

Start date changed from Tuesday Nov 16th to Thursday Nov 18th (delays in #269, #231, and CLI MR on Gitlab). Also adjusted end date to cater for weekend (from Sunday Nov 21st to Tuesday Nov 23rd).

@joshuakarp
Copy link
Contributor

Start date changed from Thursday Nov 18th to Wednesday Dec 1 (delayed from refactoring work in #283).

@CMCDragonkai
Copy link
Member

This is scheduled to be done as soon as #310 is merged into master. Then the src/config.ts will have properly maintained testnet nodes.

@joshuakarp
Copy link
Contributor

The PR that addresses this issue should also split the nodes agent service tests(nodesChainDataGet and nodesClosestLocalNode) as per #310 (comment).

@joshuakarp joshuakarp mentioned this issue Feb 8, 2022
29 tasks
@CMCDragonkai
Copy link
Member

Not split tests, no tests have been written there, so this PR should introduce new tests.

The only nodes tests that I haven't split yet are the agent service tests for nodesChainDataGet, nodesClosestLocalNode, and nodesHolePunchMessage because there are no tests for those written yet.

@CMCDragonkai
Copy link
Member

Make sure that #326 has the task list updated.

@joshuakarp
Copy link
Contributor

Not split tests, no tests have been written there, so this PR should introduce new tests.

The only nodes tests that I haven't split yet are the agent service tests for nodesChainDataGet, nodesClosestLocalNode, and nodesHolePunchMessage because there are no tests for those written yet.

My mistake - I misinterpreted this as only nodesHolePunchMessage not having tests yet. Will update.

@joshuakarp
Copy link
Contributor

The latest pipeline for master (6 days ago when #321 was merged): https://gitlab.com/MatrixAI/open-source/js-polykey/-/commit/a54598eca29a9b89d1d169e2fd9326c469139dc4

It is a working build, but some test failures are apparent (presumably will be fixed once #310 and #320 are merged into master).

@CMCDragonkai
Copy link
Member

We only nodes tests to be passing to proceed. The vaults and discovery stuff won't block testnet deployment. So you should continue reviewing the docs. You should start on AWS (you have the login?) once #310 is merged.

@joshuakarp
Copy link
Contributor

Our most recent pipeline now that #310 and #320 have been merged: https://gitlab.com/MatrixAI/open-source/js-polykey/-/pipelines/470214101

Only a few test failures that will be handled in external PRs. So can push ahead with this.

@joshuakarp
Copy link
Contributor

I'm going to start by simply attempting to deploy TypeScript-Demo-Lib on AWS and playing around with it. Will spin it down once it's finished.

@CMCDragonkai
Copy link
Member

the PK agent will need a writable directory to be the node state, if we don't specify anything this will just be a temporary scratch layer by ECS, so we should be using a volume mount of some sort, the data inside this PK agent is not important, therefore any AWS volume should be ok, however an NFS/EFS volume might help us in case we want to debug things

Previous TD had a EFS attached here. However some things are different now:

  • We have recovery code which allows us to deterministically regenerate the keypair. This is essential when redeploying new versions of the seed nodes, we need to keep this.
  • The internal state of seed nodes is not super important. Things like the GG, ACL... etc should all pretty much remain pristine. The only thing the seed nodes are doing is mutating its node graph.
  • The node graph is more cache than anything. It doesn't necessarily need to be preserved.
  • Because caching the state can be useful for performance, instead of relying on the scratch layer which writes to memory, we can add a EBS volume mount rather than NFS/EFS.

@joshuakarp you should make the modification to the task definition now that we have successfully tested contacting the remote agent.

@CMCDragonkai
Copy link
Member

Testnet fully deployed... but ongoing tests are needed.

@CMCDragonkai
Copy link
Member

Initial AWS architecture diagram:

initial aws arch

This is not fully comprehensive simply because the cloudcraft doesn't sufficiently capture all the VPC and subnet architecture and the association with 3 NLBs and availability zones.

We won't do that until we have moved to using Pulumi, and then we can get a better auto generated architecture. For now the intermediate diagram is sufficient for us to close this issue.

@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Jul 24, 2022
@CMCDragonkai
Copy link
Member

A summary...

Relay/Seed/Bootstrap/Testnet Deployment (it's taken at least 4 attempts):

  • 2021 February - March: The First Attempt for relay.polykey.io: Ning and Robert and Gideon: Using AWS and ECS with certificates
  • 2021 July - November: - The Second Attempt for bootstrap/seed.polykey.io after Major Refactoring: New problems involving unattended deterministic bootstrapping with recovery key, CI/CD building the docker image, trusted seed nodes
  • 2022 February - The Third Attempt for testnet.polykey.io: Here we have a successful connection from client to agent running on AWS through a manual deployment
  • 2022 July - The Fourth Attempt for testnet.polykey.io: Automatic deployment now with NLBs, CI/CD and Polykey-Infrastructure, new problem is fixing the P2P network

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jul 29, 2023

  • October 2022 - The Fifth Attempt using Polykey-Infrastructure but switched from ECS instances to EC2 instances with fixed pool of recovery codes for deterministic seed node keypairs. Removed NLBs. 20 minute bootstrapping time went to 20 seconds bootstrapping time, then with crypto upgrade less than 5 seconds. New problems:
    • GRPC needs to be removed in favour of our own RPC solution (transport agnostic RPC)
    • uTP must be removed in favour of QUIC (data transfer layer)
    • Nodes and connection management needs to be bidirectional
    • Local discovery needs MDNS
    • Local traversal needs multi-address with local address prioritisation, otherwise we hit things like NAT hair pinning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Big issue with multiple subissues ops Operations and administration procedure Action that must be executed production Affects a production deployment that involves customers r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices security Security risk
Development

Successfully merging a pull request may close this issue.

4 participants