-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workload not able to fetch the prepared X509 Authority certificate during the "grace period". #2704
Comments
I don't think this is the case, because your server is configured to use a static x509 upstream authority (root.key and root.crt). This is the certificate that ends up being in the bundle ... and when you see SPIRE Server rotating signing certs, it's rotating intermediates (since the roots are statically defined). That means that the bundle is not actually changing during these rotations.
This is expected. You have your statically defined root, your rotating intermediate (managed internally by SPIRE Server), and then your leaf SVID. The number of intermediates won't grow when an intermediate rotates - instead, the "Intermediate #1" will simply be replaced with an updated intermediate. And, you should not see the intermediate flip over to the new one until SPIRE activates it.
root.crt is what you should be finding in the bundle. The prepared/activated intermediate CAs only appear in SVID chains. I suspect that you are expecting SPIRE to manage the root, and for the roots to rotate. You can do this by not configuring an UpstreamAuthority plugin on SPIRE Server.
At this point, I'm not quite sure what could be causing your TLS errors. Any chance you can share the agent and server logs? Are you proxying traffic between agents and servers? |
Sorry about the confusion. I forgot to mention we use spiffe-helper with
We also understand that the root CA cert won't be rotated with the existing configuration, but we do expect SPIRE to rotate the intermediate CA cert and leaf SVIDs. Our issue was that, as the figure shown below:
However, according to the video presented by you @evan2645 and Andrew @azdagron:
|
I think there is a misunderstanding. Since you have an upstream authority configured, the bundles only contain the upstream root. Each workload receives the following over the workload API:
The upstream root is the trust anchor for the trust domain. Therefore, in order for workloads to verify each other, they must present the entire SVID certificate chain when doing TLS handshakes (i.e. every certificate parsed from the x509_svid) field. In other words, when a client connects to the server, the server should present not just its own SVID, but also the intermediate that signed it. The client should likewise present its own SVID, and the intermediate which signed it. Each side can then form a complete chain of trust back to the upstream root. The SPIFFE helper concatenates all of the certificates parsed from the |
This should really not be supported by spiffe-helper. Can you try without it set? |
The problem went away after removing |
Great to hear! |
I noticed SSL errors between our server and client (both setups with spire-agent for the certs rotation) after SPIRE’s X509 Authority starts to rotate. But later, the errors went away.
Our speculation is that SPIRE CA bundle (with prepared X509 Authority cert) propagation didn’t work during the 1/2-5/6 grace period (The new X509 Authority is prepared at 1/2 TTL of the currently active X509 Authority, and become active for signing SVID at 5/6 TTL of the currently active X509 Authority). Our server was using the previously active X509 Authority cert (not expired yet) as its SSL CA bundle to accept SSL connection. But our client already got an SVID signed by the newly activated X509 Authority and it used that SVID to connect to our server. The previously active X509 Authority cert on the server won't able to recognize the SVID presented by the client, and it failed the SSL handshake.
Below is what we did to verify our speculation.
spire-agent api watch
to monitor the change of X509 Authority and workload SVID, and below is an example output. BesideIntermediate #1
, we was expecting to see something likeIntermediate #2
as the prepared X509 Authority in the output during the grace period. ButIntermediate #2
never popped up in the output, and the validity time ofIntermediate #1
was changed only after the prepared X509 Authority became active, and the SPIRE agent got a renewed SVID from the SPIRE server.spire-agent api fetch x509 -write
to persist the certs during the grace period. But we were not able to find the prepared X509 Authority cert in those persisted certs (bundle.0.pem and svid.0.pem).Our understanding of the SPIRE X509 Authority rotation is that during the grace period, the SPIRE agent will try to poll the SPIRE server intermittently, and the SPIRE server will send back both the active and the prepared X509 Authority certs. But we are missing the prepared X509 Authority cert here.
We are able to consistently reproduce this issue on
spire-0.12.0
which our production deployment is running on andspire-1.1.3
which is the latest release. Below are the configuration files we used for reproducing.The text was updated successfully, but these errors were encountered: