Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workload not able to fetch the prepared X509 Authority certificate during the "grace period". #2704

Closed
dzhou3 opened this issue Jan 28, 2022 · 6 comments
Assignees

Comments

@dzhou3
Copy link

dzhou3 commented Jan 28, 2022

I noticed SSL errors between our server and client (both setups with spire-agent for the certs rotation) after SPIRE’s X509 Authority starts to rotate. But later, the errors went away.

Our speculation is that SPIRE CA bundle (with prepared X509 Authority cert) propagation didn’t work during the 1/2-5/6 grace period (The new X509 Authority is prepared at 1/2 TTL of the currently active X509 Authority, and become active for signing SVID at 5/6 TTL of the currently active X509 Authority). Our server was using the previously active X509 Authority cert (not expired yet) as its SSL CA bundle to accept SSL connection. But our client already got an SVID signed by the newly activated X509 Authority and it used that SVID to connect to our server. The previously active X509 Authority cert on the server won't able to recognize the SVID presented by the client, and it failed the SSL handshake.

Below is what we did to verify our speculation.

  1. We used the command spire-agent api watch to monitor the change of X509 Authority and workload SVID, and below is an example output. BesideIntermediate #1, we was expecting to see something like Intermediate #2 as the prepared X509 Authority in the output during the grace period. But Intermediate #2 never popped up in the output, and the validity time of Intermediate #1 was changed only after the prepared X509 Authority became active, and the SPIRE agent got a renewed SVID from the SPIRE server.
Received 1 svid after 2.063973ms

SPIFFE ID:              spiffe://example.org/a0-wl
SVID Valid After:       2022-01-28 01:21:35 +0000 UTC
SVID Valid Until:       2022-01-28 01:22:45 +0000 UTC
Intermediate #1 Valid After:    2022-01-28 01:21:19 +0000 UTC
Intermediate #1 Valid Until:    2022-01-28 01:27:29 +0000 UTC
CA #1 Valid After:      2021-10-20 08:28:23 +0000 UTC
CA #1 Valid Until:      2022-10-20 08:28:23 +0000 UTC

  1. We also tried to restart spire-agent and then run spire-agent api fetch x509 -write to persist the certs during the grace period. But we were not able to find the prepared X509 Authority cert in those persisted certs (bundle.0.pem and svid.0.pem).

Our understanding of the SPIRE X509 Authority rotation is that during the grace period, the SPIRE agent will try to poll the SPIRE server intermittently, and the SPIRE server will send back both the active and the prepared X509 Authority certs. But we are missing the prepared X509 Authority cert here.

We are able to consistently reproduce this issue on spire-0.12.0 which our production deployment is running on and spire-1.1.3 which is the latest release. Below are the configuration files we used for reproducing.

server {
    bind_address = "0.0.0.0"
    bind_port = "1101"
    trust_domain = "example.org"
    data_dir = "/home/ubuntu/spire-test/1/server"
    log_file = "/home/ubuntu/spire-test/1/server/log/log"
    log_format = "TEXT"
    log_level = "DEBUG"
    ca_ttl = "6m"
    default_svid_ttl = "1m"
    socket_path = "/home/ubuntu/spire-test/1/server/sock/api.sock"
    ca_key_type = "rsa-4096"
    ca_subject {
        country = ["SPIRE|CA_SUBJECT|COUNTRY"]
        organization = ["SPIRE|CA_SUBJECT|ORG"]
        common_name = "SPIRE|COMMON_NAME"
    }
}

plugins {
    DataStore "sql" {
        plugin_data {
            database_type = "sqlite3"
            connection_string = "/home/ubuntu/spire-test/1/server/datastore/datastore.sqlite3"
        }
    }

    KeyManager "disk" {
        plugin_data {
            keys_path = "/home/ubuntu/spire-test/1/server/keys/keys.json"
        }
    }

    NodeAttestor "join_token" {
        plugin_data {}
    }

    UpstreamAuthority "disk" {
        plugin_data {
            cert_file_path = "/home/ubuntu/spire-test/ca/root.crt"
            key_file_path  = "/home/ubuntu/spire-test/ca/root.key"
        }
    }
}
agent {
    data_dir = "/home/ubuntu/spire-test/1/agents/svids/a0"
    admin_socket_path = "/home/ubuntu/spire-test/1/agents/admin_sockets/a0.sock"
    socket_path = "/home/ubuntu/spire-test/1/agents/api_sockets/a0.sock"
    log_file = "/home/ubuntu/spire-test/1/agents/logs/a0.log"
    log_level = "INFO"
    log_format = "TEXT"
    trust_domain = "example.org"
    server_address = "spire-server.com"
    server_port = 1101
    join_token = "3ed02471-cac4-4f13-bdd7-479f581c2636"
    insecure_bootstrap = false
    trust_bundle_path = "/home/ubuntu/spire-test/ca/root.crt"
}

plugins {
   KeyManager "disk" {
        plugin_data {
            directory = "/home/ubuntu/spire-test/1/agents/svids/a0"
        }
    }

    NodeAttestor "join_token" {
        plugin_data {}
    }

    WorkloadAttestor "unix" {
        plugin_data {
            discover_workload_path = false
        }
    }
}
@dzhou3 dzhou3 changed the title SPIRE agent is not fetching the prepared X509 CA in the "grace period". SPIRE agent is not fetching the prepared X509 CA during the "grace period". Jan 28, 2022
@dzhou3 dzhou3 changed the title SPIRE agent is not fetching the prepared X509 CA during the "grace period". Workload not able to fetching the prepared X509 CA during the "grace period". Jan 28, 2022
@dzhou3 dzhou3 changed the title Workload not able to fetching the prepared X509 CA during the "grace period". Workload not able to fetch the prepared X509 CA during the "grace period". Jan 28, 2022
@dzhou3 dzhou3 changed the title Workload not able to fetch the prepared X509 CA during the "grace period". Workload not able to fetch the prepared X509 Authority certificate during the "grace period". Jan 28, 2022
@evan2645
Copy link
Member

evan2645 commented Feb 4, 2022

Our speculation is that SPIRE CA bundle (with prepared X509 Authority cert) propagation didn’t work during the 1/2-5/6 grace period (The new X509 Authority is prepared at 1/2 TTL of the currently active X509 Authority, and become active for signing SVID at 5/6 TTL of the currently active X509 Authority).

I don't think this is the case, because your server is configured to use a static x509 upstream authority (root.key and root.crt). This is the certificate that ends up being in the bundle ... and when you see SPIRE Server rotating signing certs, it's rotating intermediates (since the roots are statically defined). That means that the bundle is not actually changing during these rotations.

BesideIntermediate #1, we was expecting to see something like Intermediate #2 as the prepared X509 Authority in the output during the grace period. But Intermediate #2 never popped up in the output, and the validity time of Intermediate #1 was changed only after the prepared X509 Authority became active, and the SPIRE agent got a renewed SVID from the SPIRE server.

This is expected. You have your statically defined root, your rotating intermediate (managed internally by SPIRE Server), and then your leaf SVID. The number of intermediates won't grow when an intermediate rotates - instead, the "Intermediate #1" will simply be replaced with an updated intermediate.

And, you should not see the intermediate flip over to the new one until SPIRE activates it.

But we were not able to find the prepared X509 Authority cert in those persisted certs (bundle.0.pem and svid.0.pem).

root.crt is what you should be finding in the bundle. The prepared/activated intermediate CAs only appear in SVID chains.

I suspect that you are expecting SPIRE to manage the root, and for the roots to rotate. You can do this by not configuring an UpstreamAuthority plugin on SPIRE Server.

We are able to consistently reproduce this issue on spire-0.12.0 which our production deployment is running on and spire-1.1.3 which is the latest release.

At this point, I'm not quite sure what could be causing your TLS errors. Any chance you can share the agent and server logs? Are you proxying traffic between agents and servers?

@dzhou3
Copy link
Author

dzhou3 commented Feb 7, 2022

I don't think this is the case, because your server is configured to use a static x509 upstream authority (root.key and root.crt). This is the certificate that ends up being in the bundle ... and when you see SPIRE Server rotating signing certs, it's rotating intermediates (since the roots are statically defined). That means that the bundle is not actually changing during these rotations.

Sorry about the confusion. I forgot to mention we use spiffe-helper with addIntermediatesToBundle = true to pull the bundle from spire-agent in our production environment, so our bundle file has both the static root CA cert and the rotating intermediate CA cert. Here is the conf file for spiffe-helper:

agentAddress = "/tmp/agent.sock"
cmd = "python"
cmdArgs = "/home/ubuntu/update_cert.py"
certDir = "/opt/spire-agent/certs"
renewSignal = "SIGUSR1"
svidFileName = "workload.crt"
svidKeyFileName = "workload.key"
svidBundleFileName = "workload-bundle.crt"
addIntermediatesToBundle = true


I suspect that you are expecting SPIRE to manage the root, and for the roots to rotate. You can do this by not configuring an UpstreamAuthority plugin on SPIRE Server.

We also understand that the root CA cert won't be rotated with the existing configuration, but we do expect SPIRE to rotate the intermediate CA cert and leaf SVIDs.


Our issue was that, as the figure shown below:

Screen Shot 2022-02-07 at 00 52 17

  1. At time B (right before the new intermediate CA CA_B became active) our server renewed its SVID and received (through spiffe-helper):

    • new workload SVID server_svid_signed_by_CA_A.pem (signed by CA_A)
    • bundle file server_bundle.pem (only contains the root CA cert and CA_A's cert. We are expecting the additional CA_B's cert)
  2. At time D (right after the new intermediate CA CA_B became active), our client renewed its SVID and received (through spiffe-helper):

    • new workload SVID client_svid_signed_by_CA_B.pem (signed by CA_B)
    • bundle file client_bundle.pem (only contains the root CA cert and CA_B's cert. We are expecting the additional CA_A's cert)
  3. Before the server received (loaded) the next SVID signed by activated CA_B, the client started mTLS connection with the server and failed because:

    • The server bundle server_bundle.pem didn't contain CA_B's cert, and the server couldn't verify the client's cert client_svid_signed_by_CA_B.pem
    • The client bundle client_bundle.pem didn't contain CA_A's cert, and the client couldn't verify the server's cert server_svid_signed_by_CA_A.pem

However, according to the video presented by you @evan2645 and Andrew @azdagron:

  • the retired (but still valid) intermediate CA cert CA_A should be kept for some time (26:21). For our case, it should have been retrieved by the client at time D.
  • the prepared intermediate CA cert CA_B should be propagated between time A to C (27:19). And for our case, it should have been retrieved by the server at time B.

@azdagron
Copy link
Member

azdagron commented Feb 7, 2022

I think there is a misunderstanding. Since you have an upstream authority configured, the bundles only contain the upstream root. Each workload receives the following over the workload API:

  1. The SVID certificate chain, which has the SVID (i.e. leaf cert), and the server intermediate that signed the SVID
  2. The bundle, containing the upstream root certificate.

The upstream root is the trust anchor for the trust domain. Therefore, in order for workloads to verify each other, they must present the entire SVID certificate chain when doing TLS handshakes (i.e. every certificate parsed from the x509_svid) field. In other words, when a client connects to the server, the server should present not just its own SVID, but also the intermediate that signed it. The client should likewise present its own SVID, and the intermediate which signed it. Each side can then form a complete chain of trust back to the upstream root.

The SPIFFE helper concatenates all of the certificates parsed from the x509_svid field into a single file (containing one or more PEM blocks). If your software is naive, it may only be loading the first PEM block and ignoring the rest.

@evan2645
Copy link
Member

evan2645 commented Feb 7, 2022

I forgot to mention we use spiffe-helper with addIntermediatesToBundle = true

This should really not be supported by spiffe-helper. Can you try without it set?

@dzhou3
Copy link
Author

dzhou3 commented Mar 11, 2022

The problem went away after removing addIntermediatesToBundle = true in spiffe-helper conf file. Thank you! 🙇

@dzhou3 dzhou3 closed this as completed Mar 11, 2022
@evan2645
Copy link
Member

Great to hear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants