-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable k8s clusters for testing and deploying networks #12
Conversation
in-progress/7588-spartan-clusters.md
Outdated
|
||
**boot node** | ||
|
||
There will be a statefulset with a single replica for the boot node. As part of its init container it will deploy the enshrined L1 contracts. Other nodes in the network will be able to resolve the boot node's address via its stable DNS name, e.g. `boot-node-0.aztec-network.svc.cluster.local`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this would be the P2P boot node? I'd personally move bootstrapping the L1 contracts a k8s job that runs independently/before the rest of the network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later edit: not sure how we'd distribute the L1 contract address to the other services? The addresses will only be known after deploying the contracts so I think we'd need some higher level tool to orchestrate extracting the addresses and creating configmaps/secrets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we need to deploy protocol contracts to L2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about taking the job route, but since nodes can just ask for addresses this seems simpler for now.
Good point on deploying L2 protocol contracts. I'll include that. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @just-mitch intends for the boot node to be a regular full node, not the P2P boot node.
But yeah, we will need to run a bootstrapping process similar to what we are currently doing on our other networks.
in-progress/7588-spartan-clusters.md
Outdated
|
||
There will be a statefulset for the full nodes. The number of replicas will be configurable. Each full node will have a service exposing its p2p port and node port. | ||
|
||
As part of their init container, they will get config from the boot node, including its ENR (which will require exposing this on the `get-node-info` endpoint). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so nodes would just ask the bootnode for the L1 addresses 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly!
RUN kind load docker-image aztecprotocol/end-to-end:$AZTEC_DOCKER_TAG | ||
RUN helm install aztec-network helm-charts/aztec-network --set $network_values --namespace $namespace | ||
RUN helm install aztec-chaos helm-charts/aztec-chaos --set $chaos_values --namespace $namespace | ||
RUN helm test aztec-network --namespace $namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these e2e tests would run on the existing test runner but they leverage docker to create a k8s cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's exactly right
|
||
Existing e2e tests will continue to work as they do now. | ||
|
||
We will gradually port the tests to work with either the existing setup or the new setup, configurable via an environment variable; this likely can be done by simply pointing the wallets that run in the tests to PXEs in the k8s cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌 This is great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plan looks good! I don't have enough k8s experience to commit to it and I'm biased towards the current setup, but if you get buy-in for switching to it, then let's go for it.
|
||
There will be a statefulset with a single replica for the boot node. As part of its init container it will deploy the enshrined L1 contracts. Other nodes in the network will be able to resolve the boot node's address via its stable DNS name, e.g. `boot-node-0.aztec-network.svc.cluster.local`. | ||
|
||
**full node** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we differentiate full nodes, validator nodes, sequencing nodes, and prover nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we spin up prover agents separately from prover nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! These should be separated. I will add that.
Side note, I think validator/sequencing nodes should be the same for this purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think validator/sequencing nodes should be the same for this purpose.
True, that's a protocol-level decision, not an infra one
|
||
### Production Network | ||
|
||
There will be a separate long-lived k8s cluster deployed on AWS. This cluster will be used for running the public `spartan` network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Spartan SequencerNet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never heard of SequencerNet: spartan
will be a public, permissioned network that will support multiple validators (and provers).
So I think they're the same thing.
There will be a deployment for graphana. It will have a single replica, and be exposed via ClusterIP. | ||
|
||
|
||
### Staging Network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go down this route, I'd use this to deploy all of our networks (ie alphanet, devnet, and provernet), not just sequencer nets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure. I'd like to demonstrate it working in isolation with this one network and then migrate the others once we have confidence.
|
||
We use docker-compose already in some tests. The problem is that it is very difficult to test a network with more than a few nodes. It is also difficult to simulate network conditions. Last, we wouldn't be able to use the same tooling to deploy a public network. | ||
|
||
The thinking here is to use the same tooling that we use for production deployments to test our networks. This should result in less code and more confidence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm: this would replace much of our tf templates then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. TF can/should be used to set up the k8s cluster itself, and public networking infrastructure, but our software deployment configuration can live in helm.
|
||
There will be a github action workflow that deploys the network to this cluster on every push to the `staging` branch. | ||
|
||
The grafana dashboard will be exposed via a public IP, but password protected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use our existing prom/grafana setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Local clusters used in CI/CD can deploy their own grafana/prom within the k8s cluster, and we can disable that for staging
and update the existing/external prometheus to scrape from the opentel collector within the staging
k8s cluster.
|
||
## Documentation Plan | ||
|
||
We will write documentation on how people can join the `spartan` network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with k8s but all of this seems much easier than anything we have at the moment.
Got a verbal +1 from @charlielye . |
We will: