Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple HA TrustDomain #5611

Closed
kfox1111 opened this issue Oct 29, 2024 · 4 comments
Closed

Simple HA TrustDomain #5611

kfox1111 opened this issue Oct 29, 2024 · 4 comments
Assignees
Labels
triage/in-progress Issue triage is in progress

Comments

@kfox1111
Copy link
Contributor

For SPIRE to continue to make inroads into the enterprise, there needs to be good HA solutions. For example, If a spire-server goes offline, JWT support breaks as there is no way to get JWT's issued. This is important functionality and is a show stopper for some.

Currently, High Availability is left as an exercise to the reader in SPIRE. Usually involving some kind of clustered SQL server, and some kind of load balancing tier. Its pretty complex to deploy and manage.

Under such a setup though, there are still single points of failure (EX: the sql database is replicated, but what happens if it gets corrupted during a failed schema update?). If you are trying to make SPIRE the bottom turtle security wise, its hard to use SPIRE to secure the SQL cluster as you need an HA SPIRE to setup your SQL cluster. Chicken and the Egg issue. And there are more issues.

Instead, I propose we provide the option to adopt the Prometheus philosophy to HA. Two identically configured servers to provide the same service. "Just run 2."

So, say we create a new term, an HA Trust Domain. Its made up of 2 independent SPIRE deployments configured identically. Each's spire-server is configured the same way, same entries, and including the same TrustDomain setting. Each node has 2 agents, talking to one server each.

For sake of discussion lets call the server instances ServerA and ServerB. On a worker node, lets call the agent connected to ServerA, AgentA, and the one connected to ServerB, AgentB.

A workload wont care if its certificate was issued by ServerA or ServerB if its TrustBundle == ServerA(TrustBundle) + ServerB(TrustBundle). It also doesn't care if it is talking to AgentA or AgentB. This allows an entire underlying TrustDomain (Server and related Agents) to go offline while the other one still stays up. No clustered DB needed, nor load balancers.

If you want to build a scaled out SPIRE setup using a clustered SQL database, you can use such an HA TrustDomain to build the certs for the SQL cluster and root this SPIRE setup on the HA TrustDomain via nesting.

For an initial implementation with the smallest amount of effort, we add a new agent. Lets call it, spire-ha-agent.

The spire-ha-agent does the following:

  • Connects to AgentA and AgentB via the Delegated Identity API, passing the PID of the caller.
  • Listens on the main Workload API unix socket when requests come in, proxy it to AgentA and/or AgentB.
    • If TrustBundle requests, hand out the TrustBundles from both agents.
    • Cache the TrustBundles locally. On spire-ha-agent restart, if you have both bundles on disk, you can start with one of the Agents offline
    • If an x509/jwt request, send to AgentA or AgentB and return the result of whichever is online at the time.

Other enhancements can be made in the future on the server side to make this workflow better too. But lets save those for another issue.

spire-ha

@kfox1111
Copy link
Contributor Author

kfox1111 commented Nov 3, 2024

I have a viable prototype of the spire-ha-agent I've been running on a home lab for a few days now. I have a kubernetes cluster built on top of it for kubelet -> kube-apiserver attestation using spire issued jwts. Its able to continue functioning when tokens expire and one of the spire servers is offline.

The code is really rough, but in a good enough state to consider contributing.

@sorindumitru
Copy link
Contributor

I'd like to chime in on this and say that this is something that I'd like to see SPIRE be able to do for the deployments I manage. The shared database in particular is one of parts that we are still trying to figure out how to deal with not only in cases of it having an outage, but also in the case of maintainance such as upgrades (in a lot of cases these are still recommended to be done with some downtime).

I've asked in the past for something like this and I think this is definitely a step in the right direction. It would be even better if it was possible to avoid having to run multiple agents, which at least doubles the load on the system.

Would be glad to help with any work needed for this.

@kfox1111
Copy link
Contributor Author

kfox1111 commented Nov 5, 2024

I'd like to chime in on this and say that this is something that I'd like to see SPIRE be able to do for the deployments I manage. The shared database in particular is one of parts that we are still trying to figure out how to deal with not only in cases of it having an outage, but also in the case of maintainance such as upgrades (in a lot of cases these are still recommended to be done with some downtime).

Its going to be important too for making the argument that something HA like Kubernetes should depend on SPIRE in certain setups. If its significantly harder to make SPIRE HA and then put an HA Kubernetes on top, it will seriously hamper adoption of that use case. Kubernetes is pretty easy to HA, so SPIRE needs to be too.

I've asked in the past for something like this and I think this is definitely a step in the right direction. It would be even better if it was possible to avoid having to run multiple agents, which at least doubles the load on the system.

Thinking we start small. Yeah, it doubles up the required resources, but has a good failure profile, and doesnt touch the existing, working infrastructure. so only those wanting to take the risk of testing out the new spire-ha-agent bits take the risk of trying out something new. Once we have it working reliably, we can look at ways of reducing the overhead?

Would be glad to help with any work needed for this.

❤️

Thinking maybe we start this off as its own repo, so it can evolve without bothering the main spire maintainers (unless they want to)? Can mark it as experimental then too so its clear its kind of a early thing. @sorindumitru and me can be maintainers, and anyone else from the spire team that wants to work on it too?

@kfox1111
Copy link
Contributor Author

kfox1111 commented Nov 15, 2024

Maintainer consensus is to give this its own repo under github.com/spiffe, mark it dev/experimental, and explore the idea there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/in-progress Issue triage is in progress
Projects
None yet
Development

No branches or pull requests

4 participants