Simple HA TrustDomain #5611

kfox1111 · 2024-10-29T13:10:27Z

For SPIRE to continue to make inroads into the enterprise, there needs to be good HA solutions. For example, If a spire-server goes offline, JWT support breaks as there is no way to get JWT's issued. This is important functionality and is a show stopper for some.

Currently, High Availability is left as an exercise to the reader in SPIRE. Usually involving some kind of clustered SQL server, and some kind of load balancing tier. Its pretty complex to deploy and manage.

Under such a setup though, there are still single points of failure (EX: the sql database is replicated, but what happens if it gets corrupted during a failed schema update?). If you are trying to make SPIRE the bottom turtle security wise, its hard to use SPIRE to secure the SQL cluster as you need an HA SPIRE to setup your SQL cluster. Chicken and the Egg issue. And there are more issues.

Instead, I propose we provide the option to adopt the Prometheus philosophy to HA. Two identically configured servers to provide the same service. "Just run 2."

So, say we create a new term, an HA Trust Domain. Its made up of 2 independent SPIRE deployments configured identically. Each's spire-server is configured the same way, same entries, and including the same TrustDomain setting. Each node has 2 agents, talking to one server each.

For sake of discussion lets call the server instances ServerA and ServerB. On a worker node, lets call the agent connected to ServerA, AgentA, and the one connected to ServerB, AgentB.

A workload wont care if its certificate was issued by ServerA or ServerB if its TrustBundle == ServerA(TrustBundle) + ServerB(TrustBundle). It also doesn't care if it is talking to AgentA or AgentB. This allows an entire underlying TrustDomain (Server and related Agents) to go offline while the other one still stays up. No clustered DB needed, nor load balancers.

If you want to build a scaled out SPIRE setup using a clustered SQL database, you can use such an HA TrustDomain to build the certs for the SQL cluster and root this SPIRE setup on the HA TrustDomain via nesting.

For an initial implementation with the smallest amount of effort, we add a new agent. Lets call it, spire-ha-agent.

The spire-ha-agent does the following:

Connects to AgentA and AgentB via the Delegated Identity API, passing the PID of the caller.
Listens on the main Workload API unix socket when requests come in, proxy it to AgentA and/or AgentB.
- If TrustBundle requests, hand out the TrustBundles from both agents.
- Cache the TrustBundles locally. On spire-ha-agent restart, if you have both bundles on disk, you can start with one of the Agents offline
- If an x509/jwt request, send to AgentA or AgentB and return the result of whichever is online at the time.

Other enhancements can be made in the future on the server side to make this workflow better too. But lets save those for another issue.

kfox1111 · 2024-11-03T16:28:45Z

I have a viable prototype of the spire-ha-agent I've been running on a home lab for a few days now. I have a kubernetes cluster built on top of it for kubelet -> kube-apiserver attestation using spire issued jwts. Its able to continue functioning when tokens expire and one of the spire servers is offline.

The code is really rough, but in a good enough state to consider contributing.

sorindumitru · 2024-11-05T13:12:09Z

I'd like to chime in on this and say that this is something that I'd like to see SPIRE be able to do for the deployments I manage. The shared database in particular is one of parts that we are still trying to figure out how to deal with not only in cases of it having an outage, but also in the case of maintainance such as upgrades (in a lot of cases these are still recommended to be done with some downtime).

I've asked in the past for something like this and I think this is definitely a step in the right direction. It would be even better if it was possible to avoid having to run multiple agents, which at least doubles the load on the system.

Would be glad to help with any work needed for this.

kfox1111 · 2024-11-05T14:28:29Z

I'd like to chime in on this and say that this is something that I'd like to see SPIRE be able to do for the deployments I manage. The shared database in particular is one of parts that we are still trying to figure out how to deal with not only in cases of it having an outage, but also in the case of maintainance such as upgrades (in a lot of cases these are still recommended to be done with some downtime).

Its going to be important too for making the argument that something HA like Kubernetes should depend on SPIRE in certain setups. If its significantly harder to make SPIRE HA and then put an HA Kubernetes on top, it will seriously hamper adoption of that use case. Kubernetes is pretty easy to HA, so SPIRE needs to be too.

I've asked in the past for something like this and I think this is definitely a step in the right direction. It would be even better if it was possible to avoid having to run multiple agents, which at least doubles the load on the system.

Thinking we start small. Yeah, it doubles up the required resources, but has a good failure profile, and doesnt touch the existing, working infrastructure. so only those wanting to take the risk of testing out the new spire-ha-agent bits take the risk of trying out something new. Once we have it working reliably, we can look at ways of reducing the overhead?

Would be glad to help with any work needed for this.

❤️

Thinking maybe we start this off as its own repo, so it can evolve without bothering the main spire maintainers (unless they want to)? Can mark it as experimental then too so its clear its kind of a early thing. @sorindumitru and me can be maintainers, and anyone else from the spire team that wants to work on it too?

kfox1111 · 2024-11-15T18:57:08Z

Maintainer consensus is to give this its own repo under github.com/spiffe, mark it dev/experimental, and explore the idea there.

kfox1111 mentioned this issue Oct 29, 2024

The Bottom Turtle Reference Architecture(s) #5206

Open

5 tasks

amartinezfayo added the triage/in-progress Issue triage is in progress label Oct 29, 2024

kfox1111 self-assigned this Oct 29, 2024

amartinezfayo assigned evan2645 Oct 31, 2024

kfox1111 closed this as completed Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple HA TrustDomain #5611

Simple HA TrustDomain #5611

kfox1111 commented Oct 29, 2024

kfox1111 commented Nov 3, 2024

sorindumitru commented Nov 5, 2024

kfox1111 commented Nov 5, 2024

kfox1111 commented Nov 15, 2024 •

edited

Loading

Simple HA TrustDomain #5611

Simple HA TrustDomain #5611

Comments

kfox1111 commented Oct 29, 2024

kfox1111 commented Nov 3, 2024

sorindumitru commented Nov 5, 2024

kfox1111 commented Nov 5, 2024

kfox1111 commented Nov 15, 2024 • edited Loading

kfox1111 commented Nov 15, 2024 •

edited

Loading