-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple HA TrustDomain #5611
Comments
I have a viable prototype of the spire-ha-agent I've been running on a home lab for a few days now. I have a kubernetes cluster built on top of it for kubelet -> kube-apiserver attestation using spire issued jwts. Its able to continue functioning when tokens expire and one of the spire servers is offline. The code is really rough, but in a good enough state to consider contributing. |
I'd like to chime in on this and say that this is something that I'd like to see SPIRE be able to do for the deployments I manage. The shared database in particular is one of parts that we are still trying to figure out how to deal with not only in cases of it having an outage, but also in the case of maintainance such as upgrades (in a lot of cases these are still recommended to be done with some downtime). I've asked in the past for something like this and I think this is definitely a step in the right direction. It would be even better if it was possible to avoid having to run multiple agents, which at least doubles the load on the system. Would be glad to help with any work needed for this. |
Its going to be important too for making the argument that something HA like Kubernetes should depend on SPIRE in certain setups. If its significantly harder to make SPIRE HA and then put an HA Kubernetes on top, it will seriously hamper adoption of that use case. Kubernetes is pretty easy to HA, so SPIRE needs to be too.
Thinking we start small. Yeah, it doubles up the required resources, but has a good failure profile, and doesnt touch the existing, working infrastructure. so only those wanting to take the risk of testing out the new spire-ha-agent bits take the risk of trying out something new. Once we have it working reliably, we can look at ways of reducing the overhead?
❤️ Thinking maybe we start this off as its own repo, so it can evolve without bothering the main spire maintainers (unless they want to)? Can mark it as experimental then too so its clear its kind of a early thing. @sorindumitru and me can be maintainers, and anyone else from the spire team that wants to work on it too? |
Maintainer consensus is to give this its own repo under github.com/spiffe, mark it dev/experimental, and explore the idea there. |
For SPIRE to continue to make inroads into the enterprise, there needs to be good HA solutions. For example, If a spire-server goes offline, JWT support breaks as there is no way to get JWT's issued. This is important functionality and is a show stopper for some.
Currently, High Availability is left as an exercise to the reader in SPIRE. Usually involving some kind of clustered SQL server, and some kind of load balancing tier. Its pretty complex to deploy and manage.
Under such a setup though, there are still single points of failure (EX: the sql database is replicated, but what happens if it gets corrupted during a failed schema update?). If you are trying to make SPIRE the bottom turtle security wise, its hard to use SPIRE to secure the SQL cluster as you need an HA SPIRE to setup your SQL cluster. Chicken and the Egg issue. And there are more issues.
Instead, I propose we provide the option to adopt the Prometheus philosophy to HA. Two identically configured servers to provide the same service. "Just run 2."
So, say we create a new term, an HA Trust Domain. Its made up of 2 independent SPIRE deployments configured identically. Each's spire-server is configured the same way, same entries, and including the same TrustDomain setting. Each node has 2 agents, talking to one server each.
For sake of discussion lets call the server instances ServerA and ServerB. On a worker node, lets call the agent connected to ServerA, AgentA, and the one connected to ServerB, AgentB.
A workload wont care if its certificate was issued by ServerA or ServerB if its TrustBundle == ServerA(TrustBundle) + ServerB(TrustBundle). It also doesn't care if it is talking to AgentA or AgentB. This allows an entire underlying TrustDomain (Server and related Agents) to go offline while the other one still stays up. No clustered DB needed, nor load balancers.
If you want to build a scaled out SPIRE setup using a clustered SQL database, you can use such an HA TrustDomain to build the certs for the SQL cluster and root this SPIRE setup on the HA TrustDomain via nesting.
For an initial implementation with the smallest amount of effort, we add a new agent. Lets call it, spire-ha-agent.
The spire-ha-agent does the following:
Other enhancements can be made in the future on the server side to make this workflow better too. But lets save those for another issue.
The text was updated successfully, but these errors were encountered: