Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd policy container of destination Pod and proxy-injector Pod crashing #7011

Closed
BobyMCbobs opened this issue Oct 1, 2021 · 26 comments
Closed
Assignees

Comments

@BobyMCbobs
Copy link

BobyMCbobs commented Oct 1, 2021

Bug Report

What is the issue?

Linkerd core components destination and proxy-injector crashing and never coming up.

How can it be reproduced?

  1. Install Talos on a Raspberry Pi and initialise Kubernetes on it
  2. Download the latest Linkerd cli
  3. linkerd install | kubectl apply -f -

Logs, error output, etc

Some events from one of the destination Pods are:

  Normal   Pulled     50s                kubelet            Container image "cr.l5d.io/linkerd/policy-controller:stable-2.11.0" already present on machine
  Normal   Created    50s                kubelet            Created container policy
  Normal   Started    49s                kubelet            Started container policy
  Warning  Unhealthy  11s (x7 over 48s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  Unhealthy  11s (x3 over 31s)  kubelet            Liveness probe failed: HTTP probe failed with statuscode: 500

and logs from the policy container of the same Pod

...
2021-10-01T22:13:45.708082Z  INFO pods: linkerd_policy_controller_k8s_api::watch: Restarting
2021-10-01T22:13:45.722498Z  WARN serverauthorizations: rustls::session: Sending fatal alert BadCertificate    
2021-10-01T22:13:45.723177Z ERROR serverauthorizations: kube::client: failed with error error trying to connect: invalid certificate: UnknownIssuer
2021-10-01T22:13:45.723301Z  INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: invalid certificate: UnknownIssuer
2021-10-01T22:13:45.728281Z  WARN pods: rustls::session: Sending fatal alert BadCertificate    
2021-10-01T22:13:45.728985Z ERROR pods: kube::client: failed with error error trying to connect: invalid certificate: UnknownIssuer
2021-10-01T22:13:45.729145Z  INFO pods: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: invalid certificate: UnknownIssuer

Some events from one of the the proxy-injector Pods are:

  Normal   Started    2m7s                 kubelet            Started container proxy-injector
  Warning  Unhealthy  89s (x3 over 109s)   kubelet            Liveness probe failed: HTTP probe failed with statuscode: 502
  Normal   Killing    89s                  kubelet            Container proxy-injector failed liveness probe, will be restarted
  Warning  Unhealthy  69s (x10 over 2m7s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 502
  Normal   Pulled     67s (x2 over 2m8s)   kubelet            Container image "cr.l5d.io/linkerd/controller:stable-2.11.0" already present on machine
  Normal   Created    66s (x2 over 2m7s)   kubelet            Created container proxy-injector

linkerd check output

Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
/ pod/linkerd-destination-6bc8d8f9f8-7qpdn container policy is not ready

Environment

  • Kubernetes Version: v1.22.1
  • Cluster Environment: (GKE, AKS, kops, ...)
  • Host OS: Talos v0.12.0
  • Linkerd version: stable-2.11.01

Possible solution

Unsure

Additional context

This is running on my Raspberry Pi cluster.
I'm using linkerd install --ha | kubectl apply -f - but there is no difference when not using HA mode.

@adleong
Copy link
Member

adleong commented Oct 1, 2021

hi @BobyMCbobs. stable-2.11.01 does not exist, the latest release is stable-2.11.0. With that said, it seems like there is some problem with the way that the identity certificates have been configured. linkerd install should take care of this for you, but perhaps there's some unfortunate interaction with Talos; I'm not sure.

One thing to try would be to run linkerd check --pre on a fresh Talos cluster (without Linkerd installed) to see if it detects any reasons why Linkerd wouldn't be able to be installed.

@BobyMCbobs
Copy link
Author

hi @BobyMCbobs. stable-2.11.01 does not exist, the latest release is stable-2.11.0. With that said, it seems like there is some problem with the way that the identity certificates have been configured. linkerd install should take care of this for you, but perhaps there's some unfortunate interaction with Talos; I'm not sure.

One thing to try would be to run linkerd check --pre on a fresh Talos cluster (without Linkerd installed) to see if it detects any reasons why Linkerd wouldn't be able to be installed.

Hi @adleong, thanks for your reply.

Oh, that seems to be a typo. It is stable-2.11.0 as suggested.
I ran linkerd check --pre before attempting to install and it returned no errors.

@adleong
Copy link
Member

adleong commented Oct 2, 2021

@BobyMCbobs Interesting. Unfortunately I don't have a Pi/Talos cluster to test on and I can't reproduce this error in any of my clusters. Since this seems to be some kind of problem related to the issuer certificate, I'd recommend looking at the identity controller's logs for errors as well as looking at the linkerd-identity-issuer secret and ensuring that the certificate there validates against the trust root in the linkerd-identity-trust-roots ConfigMap.

@BobyMCbobs
Copy link
Author

@BobyMCbobs Interesting. Unfortunately I don't have a Pi/Talos cluster to test on and I can't reproduce this error in any of my clusters.

It appears to not be exclusive to architecture; I tried it in a VM too and had the same results.

Since this seems to be some kind of problem related to the issuer certificate, I'd recommend looking at the identity controller's logs for errors

The logs for the identity controller look fine

🐚 kubectl -n linkerd logs linkerd-identity-749d4b55c4-gx9zf -c identity -f
time="2021-10-02T00:33:12Z" level=info msg="running version stable-2.11.0"
time="2021-10-02T00:33:13Z" level=info msg="starting gRPC server on :8080"
time="2021-10-02T00:33:13Z" level=info msg="starting admin server on :9990"
time="2021-10-02T00:33:14Z" level=info msg="issued certificate for linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local until 2021-10-03 00:33:34 +0000 UTC: 822f2f58e496f31293081f9e1d0de946"
time="2021-10-02T00:33:17Z" level=info msg="issued certificate for linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local until 2021-10-03 00:33:37 +0000 UTC: ba1ab84c6294a09f8900fe2bb71ffb5a"
time="2021-10-02T00:33:22Z" level=info msg="issued certificate for linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local until 2021-10-03 00:33:42 +0000 UTC: f467373ddbc7bc415f65cfbeeabc7106"

as well as looking at the linkerd-identity-issuer secret and ensuring that the certificate there validates against the trust root in the linkerd-identity-trust-roots ConfigMap.

I've copied the three certs out to /tmp/linkerd/ca-bundle.crt /tmp/linkerd/crt.pem /tmp/linkerd/key.pem but haven't checked certs like this before, may I please have an openssl command for validating the certs?

@olix0r
Copy link
Member

olix0r commented Oct 2, 2021

@adleong these errors probably don't have anything to do with the identity certificates. Rather, the policy controller's Kubernetes client is having trouble establishing TLS with Kubernetes API.

@BobyMCbobs Thanks for the helpful report! Can you share the output of the following?

:; kubectl get secret $(kubectl get sa default -o json | jq  -r '.secrets[0].name') -o json | jq -r '.data["ca.crt"] | @base64d'

This will dump the CA certificate for your cluster (which is totally safe to share).

Also, you could try running the controller with additional logging information by using --set policyController.logLevel=kube=trace\\,rustls=trace\\,info when you install/upgrade the control plane -- this should get the policy controller to log a lot more, hopefully.

@BobyMCbobs
Copy link
Author

@adleong these errors probably don't have anything to do with the identity certificates. Rather, the policy controller's Kubernetes client is having trouble establishing TLS with Kubernetes API.

@BobyMCbobs Thanks for the helpful report! Can you share the output of the following?

:; kubectl get secret $(kubectl get sa default -o json | jq  -r '.secrets[0].name') -o json | jq -r '.data["ca.crt"] | @base64d'

This will dump the CA certificate for your cluster (which is totally safe to share).

Here's the CA from a cluster with the issue

-----BEGIN CERTIFICATE-----
MIIBijCCAS+gAwIBAgIQZhnJB0CCrwDf7wy7z5JVMTAKBggqhkjOPQQDBDAVMRMw
EQYDVQQKEwprdWJlcm5ldGVzMB4XDTIxMDkwNzAxMzMzOFoXDTMxMDkwNTAxMzMz
OFowFTETMBEGA1UEChMKa3ViZXJuZXRlczBZMBMGByqGSM49AgEGCCqGSM49AwEH
A0IABGoKpIoxOluzcwkPhpCmWDFeDNcOxAVgFnX7uJ6dEmq656SRfaUtavuFC0EW
zQVeQchNAGcWLOqdrJABALeUYBKjYTBfMA4GA1UdDwEB/wQEAwIChDAdBgNVHSUE
FjAUBggrBgEFBQcDAQYIKwYBBQUHAwIwDwYDVR0TAQH/BAUwAwEB/zAdBgNVHQ4E
FgQUKbVHaF1ACqlyTCjdAuaMgSBCiKQwCgYIKoZIzj0EAwQDSQAwRgIhAN6YBxBb
9S9QalcFEViQ2eqhZIHj5u+HNbCMtpL3ygaGAiEAy2jYmrxHWg4luamVVHRY5QLa
EH/HCt9vnyiLXFXYc4I=
-----END CERTIFICATE-----

Also, you could try running the controller with additional logging information by using --set policyController.logLevel=kube=trace\\,rustls=trace\\,info when you install/upgrade the control plane -- this should get the policy controller to log a lot more, hopefully.

Here's a snippet from the end of it

...
2021-10-02T20:04:48.224215Z DEBUG pods:HTTP{http.method=GET http.url=https://kubernetes.default.svc/api/v1/pods?&labelSelector=linkerd.io%2Fcontrol-plane-ns otel.name="list" otel.kind="client"}: rustls::client::hs: Using ciphersuite TLS13_CHACHA20_POLY1305_SHA256    
2021-10-02T20:04:48.224452Z DEBUG pods:HTTP{http.method=GET http.url=https://kubernetes.default.svc/api/v1/pods?&labelSelector=linkerd.io%2Fcontrol-plane-ns otel.name="list" otel.kind="client"}: rustls::client::tls13: Not resuming    
2021-10-02T20:04:48.224476Z TRACE pods:HTTP{http.method=GET http.url=https://kubernetes.default.svc/api/v1/pods?&labelSelector=linkerd.io%2Fcontrol-plane-ns otel.name="list" otel.kind="client"}: rustls::client: EarlyData rejected    
2021-10-02T20:04:48.224546Z TRACE pods:HTTP{http.method=GET http.url=https://kubernetes.default.svc/api/v1/pods?&labelSelector=linkerd.io%2Fcontrol-plane-ns otel.name="list" otel.kind="client"}: rustls::client: Dropping CCS    
2021-10-02T20:04:48.224566Z DEBUG pods:HTTP{http.method=GET http.url=https://kubernetes.default.svc/api/v1/pods?&labelSelector=linkerd.io%2Fcontrol-plane-ns otel.name="list" otel.kind="client"}: rustls::client::tls13: TLS1.3 encrypted extensions: []    
2021-10-02T20:04:48.224581Z DEBUG pods:HTTP{http.method=GET http.url=https://kubernetes.default.svc/api/v1/pods?&labelSelector=linkerd.io%2Fcontrol-plane-ns otel.name="list" otel.kind="client"}: rustls::client::hs: ALPN protocol is None    
2021-10-02T20:04:48.224621Z DEBUG pods:HTTP{http.method=GET http.url=https://kubernetes.default.svc/api/v1/pods?&labelSelector=linkerd.io%2Fcontrol-plane-ns otel.name="list" otel.kind="client"}: rustls::client::tls13: Got CertificateRequest CertificateRequestPayloadTLS13 { context: PayloadU8([]), extensions: [Unknown(UnknownExtension { typ: StatusRequest, payload: Payload([]) }), Unknown(UnknownExtension { typ: SCT, payload: Payload([]) }), SignatureAlgorithms([RSA_PSS_SHA256, ECDSA_NISTP256_SHA256, ED25519, RSA_PSS_SHA384, RSA_PSS_SHA512, RSA_PKCS1_SHA256, RSA_PKCS1_SHA384, RSA_PKCS1_SHA512, ECDSA_NISTP384_SHA384, ECDSA_NISTP521_SHA512, RSA_PKCS1_SHA1, ECDSA_SHA1_Legacy]), AuthorityNames([PayloadU16([48, 21, 49, 19, 48, 17, 6, 3, 85, 4, 10, 19, 10, 107, 117, 98, 101, 114, 110, 101, 116, 101, 115]), PayloadU16([48, 0])])] }    
2021-10-02T20:04:48.224665Z DEBUG pods:HTTP{http.method=GET http.url=https://kubernetes.default.svc/api/v1/pods?&labelSelector=linkerd.io%2Fcontrol-plane-ns otel.name="list" otel.kind="client"}: rustls::client::tls13: Client auth requested but no cert selected    
2021-10-02T20:04:48.224739Z TRACE pods:HTTP{http.method=GET http.url=https://kubernetes.default.svc/api/v1/pods?&labelSelector=linkerd.io%2Fcontrol-plane-ns otel.name="list" otel.kind="client"}: rustls::client::tls13: Server cert is [Certificate(b"0\x82\x02\x150\x82\x01\xba\xa0\x03\x02\x01\x02\x02\x11\0\xefSz\xc1Y\x8b\xfe\x98\xdb\xf0EHw\xb9\xca\x100\n\x06\x08*\x86H\xce=\x04\x03\x040\x151\x130\x11\x06\x03U\x04\n\x13\nkubernetes0\x1e\x17\r211002195647Z\x17\r221002195647Z0/1\x140\x12\x06\x03U\x04\n\x13\x0bkube-master1\x170\x15\x06\x03U\x04\x03\x13\x0ekube-apiserver0Y0\x13\x06\x07*\x86H\xce=\x02\x01\x06\x08*\x86H\xce=\x03\x01\x07\x03B\0\x04\xb1\xf7\xd6\xc2Z\xcc\xfc\xb2\xb9\xdd\xfa\x19\xaa[\0\xe0\xdf\x0f\xa70\x9a\x88\x7fC\xdf\xc1\xb4\xc8\xb7\x1b\xf50+PD\x177o\xab\xeaq\x9e\xf9h\xa9\xe9\xb4\x96\x94\x92\\\x1a\x90q\xfe\xe7\x9c\xcc\x82\x9d\xf9\t\xdb\xcd\xa3\x81\xd00\x81\xcd0\x0e\x06\x03U\x1d\x0f\x01\x01\xff\x04\x04\x03\x02\x05\xa00\x13\x06\x03U\x1d%\x04\x0c0\n\x06\x08+\x06\x01\x05\x05\x07\x03\x010\x1f\x06\x03U\x1d#\x04\x180\x16\x80\x14)\xb5Gh]@\n\xa9rL(\xdd\x02\xe6\x8c\x81 B\x88\xa40\x81\x84\x06\x03U\x1d\x11\x04}0{\x82\nkubernetes\x82\x12kubernetes.default\x82\x16kubernetes.default.svc\x82$kubernetes.default.svc.cluster.local\x82\tlocalhost\x87\x04\xc0\xa8z\x8c\x87\x04\xc0\xa8z\x8c\x87\x04\n`\0\x010\n\x06\x08*\x86H\xce=\x04\x03\x04\x03I\00F\x02!\0\xf6^(X\x8c-f\xfe\xdc\xe9\x83\x9a\x11\xbd~\xd9\x85\xa3~\\b=\xc5GFy\xb9\xc5o\x1f=\x0f\x02!\0\xa4\xc0\x15\xa3\x08}\xc7+\x945\x15\xa6\x1b\xf89\xa3w\xa8\xe3\x11\x0b\xc9\x10\xd6\xd43\xf8\xc5\x05g\xb5[")]    
2021-10-02T20:04:48.224840Z  WARN pods:HTTP{http.method=GET http.url=https://kubernetes.default.svc/api/v1/pods?&labelSelector=linkerd.io%2Fcontrol-plane-ns otel.name="list" otel.kind="client"}: rustls::session: Sending fatal alert BadCertificate    
2021-10-02T20:04:48.225321Z ERROR pods:HTTP{http.method=GET http.url=https://kubernetes.default.svc/api/v1/pods?&labelSelector=linkerd.io%2Fcontrol-plane-ns otel.name="list" otel.kind="client" otel.status_code="ERROR"}: kube::client: failed with error error trying to connect: invalid certificate: UnknownIssuer
2021-10-02T20:04:48.225390Z  INFO pods: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: invalid certificate: UnknownIssuer

Happy to upload a gist for the whole logs if need be

@smira
Copy link

smira commented Oct 4, 2021

I'm able to reproduce this issue with even Talos in containers and VMs (on amd64).

The only thing to note about Kubernetes API cert is that Talos provisions ECDSA Kubernetes CA.

$ talosctl cluster create
...
$ linkerd check --pre
Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

pre-kubernetes-setup
--------------------
√ control plane namespace does not already exist
√ can create non-namespaced resources
√ can create ServiceAccounts
√ can create Services
√ can create Deployments
√ can create CronJobs
√ can create ConfigMaps
√ can create Secrets
√ can read Secrets
√ can read extension-apiserver-authentication configmap
√ no clock skew detected

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

Status check results are √

$ linkerd install | kubectl apply -f -
...
$ linkerd check
Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
\ pod/linkerd-destination-558894b46d-dlt26 container policy is not ready 

^^ hangs forever

@smira
Copy link

smira commented Oct 4, 2021

talosctl can be downloaded form the release page: https://github.com/talos-systems/talos/releases/tag/v0.13.0-beta.0

(should work on OS X/Linux with Docker)

@smira
Copy link

smira commented Oct 4, 2021

Same problem as @BobyMCbobs reported:

  Warning  Unhealthy  7s (x4 over 20s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  Unhealthy  7s                kubelet            Liveness probe failed: HTTP probe failed with statuscode: 500

@olix0r
Copy link
Member

olix0r commented Oct 4, 2021

Yeah, the Kubernetes client library we're using appears to only support RSA keys at the moment

https://github.com/kube-rs/kube-rs/blob/1f31d151f2577bd5bac8b165b82ed4b3629eec27/kube/src/client/tls.rs#L87-L89

We'll take a look at fixing this.

@olix0r
Copy link
Member

olix0r commented Oct 4, 2021

It looks like there are some upstream issues with parsing certain kinds of PEM-formatted EC private keys; but in general we don't seem to have any issues talking to Kubernetes API servers that use EC certificates. From a quick look at Talos's docs, it seems like all API clients may need to use mTLS to authenticate?

If so, it's probably possible to get this working by building an alternate version of the policy controller that uses the native-tls feature to get OpenSSL bindings -- but we currently use distroless base images by default that don't ship with OpenSSL, so an alternate runtime image would need to be used to provide these libraries. I'm happy to help guide you through the process of building such an image; but I'm not really eager to pull an OpenSSL dependency into Linkerd's control plane as part of our standard distribution.

The proper fix is probably to address rustls/rustls#332

@smira
Copy link

smira commented Oct 4, 2021

From a quick look at Talos's docs, it seems like all API clients may need to use mTLS to authenticate?

mTLS is only used for Talos API itself (which is different from Kubernetes API).

Talos ships with vanilla upstream Kubernetes, so it's completely standard distribution. ECDSA is a supported way to provision Kubernetes certificates, and all operations with ECDSA are way faster than with RSA.

@olix0r
Copy link
Member

olix0r commented Oct 4, 2021

Talos ships with vanilla upstream Kubernetes, so it's completely standard distribution. ECDSA is a supported way to provision Kubernetes certificates, and all operations with ECDSA are way faster than with RSA.

Yeah. We use ECDSA keys in Linkerd as well. If I'm reading the issue history, it seems like this is specifically a problem with parsing PEM-formatted ECDSA private keys (whereas pkcs8 formatted keys should be parseable). What I can't understand is where we would be hitting this. I know for a fact that we should be able to talk to TLS services that use ECSDA certificates (for instance, we can run this controller in k3d clusters that use these certs). We'll need to figure out how the Talos case differs from k3d.

@olix0r
Copy link
Member

olix0r commented Oct 4, 2021

On second thought, I'm not convinced that the problem you're seeing has anything to do with parsing EC Private Keys -- I'd expect that error to be different. @adleong is going to try to run talos to compare it with a working k3d instance.

@adleong adleong self-assigned this Oct 4, 2021
@adleong
Copy link
Member

adleong commented Oct 6, 2021

Here is the server certificate for the Kubernetes API in a kind cluster which works with Linkerd:

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 1651715351998697625 (0x16ec125c89f11099)
    Signature Algorithm: SHA256-RSA
        Issuer: CN=kubernetes
        Validity
            Not Before: Sep 21 18:32:06 2021 UTC
            Not After : Sep 21 18:32:06 2022 UTC
        Subject: CN=kube-apiserver
        Subject Public Key Info:
            Public Key Algorithm: RSA
                Public-Key: (2048 bit)
                Modulus:
                    c1:64:97:75:08:ca:ae:ab:07:04:21:d7:60:60:a6:
                    24:df:77:34:4d:64:0f:1d:9f:fa:f6:89:2b:82:eb:
                    c4:b8:ce:29:cc:00:e7:bc:e7:60:c3:7a:02:c3:78:
                    bf:c8:38:2e:76:92:a7:7d:2c:f5:27:8c:a1:f1:5a:
                    09:75:57:3e:62:c4:07:b7:fb:a4:64:b7:62:8a:54:
                    d4:7e:26:02:18:96:a6:59:20:a8:b4:4e:f4:8e:26:
                    c3:70:51:55:54:1a:16:0d:0a:59:3b:63:6a:36:0b:
                    86:7b:ed:8f:44:b3:34:bd:7b:69:17:6a:f9:fb:b3:
                    fe:fe:62:07:4b:3b:20:6d:3f:58:0f:96:54:4f:42:
                    63:f8:9d:37:f6:3d:fa:8e:d8:27:f2:92:2d:04:b9:
                    80:15:9c:8a:bf:d8:bd:e5:ee:5c:43:4e:f8:22:b2:
                    6f:e5:72:bb:ac:a5:bd:7c:a5:6e:0a:c1:eb:01:94:
                    17:2a:ea:50:66:b3:d5:24:23:d6:21:05:db:a2:fb:
                    73:5b:3f:8c:ff:40:6b:b5:1c:d9:95:b9:3a:af:3b:
                    f1:58:ea:42:a4:4c:09:6d:b1:ee:79:a6:b1:44:8a:
                    4e:82:7b:e4:71:40:a9:37:04:1e:18:5e:33:dc:f2:
                    5f:9e:ee:b4:31:1f:ff:73:21:58:d0:05:00:ed:ea:
                    b1
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                Server Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Authority Key Identifier:
                keyid:3F:46:59:B2:99:EF:7C:E2:F9:5C:D0:58:93:F8:41:2F:0E:A2:1D:4C
            X509v3 Subject Alternative Name:
                DNS:alex-control-plane, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:localhost
                IP Address:10.96.0.1, IP Address:172.18.0.2, IP Address:127.0.0.1
    Signature Algorithm: SHA256-RSA
         53:40:91:c6:13:97:a3:93:6b:da:4e:45:00:6d:f4:66:d5:8a:
         7b:be:65:d7:69:54:e7:8d:2a:1f:f9:23:ff:56:e1:18:71:53:
         b8:f8:68:69:f4:ae:1d:6e:ad:bb:84:f2:da:dd:ce:f4:2a:b0:
         f9:e2:39:34:d6:f0:78:e7:88:56:22:16:ac:d5:f8:44:26:27:
         68:1a:92:03:08:87:02:6f:0f:1e:b0:e3:d3:22:89:0b:63:2e:
         7a:4b:e1:46:54:ad:24:9f:8b:f9:65:86:93:88:71:0b:23:e9:
         79:19:6c:00:6e:ff:93:f8:87:81:be:e4:06:fb:8a:85:b8:61:
         b4:fd:69:da:de:88:94:3f:9b:11:2a:a6:b2:c8:1d:7d:c7:95:
         1e:0b:17:75:99:83:4d:77:53:95:db:1a:10:31:8d:8e:6f:10:
         45:55:16:e3:5a:1c:02:e7:97:20:42:e4:48:d1:c3:7d:d5:9e:
         db:7e:18:56:3a:05:bb:1c:c2:80:75:62:46:22:2e:dd:13:07:
         1b:46:86:c7:64:51:e1:80:19:9e:9d:57:08:df:f6:a2:3e:49:
         92:c8:81:fe:5f:cd:ea:36:74:26:1d:40:b3:cc:4a:73:d9:53:
         25:9d:0f:4f:d5:1e:5d:41:04:14:ea:85:cc:ba:e2:7d:e3:73:
         7a:88:9c:f5

and here is the server cert for the Kubernetes API from a Talos cluster where the policy controller is crashing:

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 232438196251277727239562535674581519728 (0xaeddf7de718cc57c0d671fa75ca68d70)
    Signature Algorithm: ECDSA-SHA512
        Issuer: O=kubernetes
        Validity
            Not Before: Oct 6 22:53:27 2021 UTC
            Not After : Oct 6 22:53:27 2022 UTC
        Subject: O=kube-master,CN=kube-apiserver
        Subject Public Key Info:
            Public Key Algorithm: ECDSA
                Public-Key: (256 bit)
                X:
                    ba:60:6d:58:0b:10:94:fb:3c:60:f2:b3:d9:d8:01:
                    d4:4f:7b:ec:71:51:f4:88:b7:16:b4:15:3b:c2:e5:
                    05:46
                Y:
                    41:3f:82:59:96:bb:9f:bb:bd:69:21:c1:e0:50:01:
                    f5:75:9c:17:25:39:82:96:27:1a:f5:73:52:02:55:
                    bd:3b
                Curve: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                Server Authentication
            X509v3 Authority Key Identifier:
                keyid:D5:F3:AC:BA:FF:46:2A:30:CB:9A:FD:2E:AF:B8:3B:5B:89:22:4C:3B
            X509v3 Subject Alternative Name:
                DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:localhost
                IP Address:10.5.0.2, IP Address:10.5.0.2, IP Address:10.5.0.2, IP Address:10.96.0.1
    Signature Algorithm: ECDSA-SHA512
         30:45:02:21:00:c0:23:cc:c2:a0:48:56:14:52:47:c5:f5:a3:
         65:8f:c8:14:f8:e6:3d:5a:41:04:9e:f3:ac:be:f7:88:a1:9c:
         85:02:20:17:6f:fe:8c:b3:ae:83:1e:36:80:71:9b:6e:25:6c:
         bd:da:3b:fa:10:b8:9f:0d:f9:7c:a5:49:67:f4:0a:c7:38

The most obvious difference between these is that the Talos server certificate uses ECDSA as the public key algorithm and the signature algorithm. (In contrast, Kind uses RSA and SHA256-RSA respectively).

Perhaps kube-rs also has issues handling ECDSA public keys?

@adleong
Copy link
Member

adleong commented Oct 6, 2021

Nevermind, I found an even more obvious difference:

The Talos certificate specifies the issuer as Issuer: O=kubernetes instead of specifying the issuer with a CN. If we're just looking for a CN issuer, it could explain why we throw a invalid certificate: UnknownIssuer error.

@BobyMCbobs
Copy link
Author

Here is the server certificate for the Kubernetes API in a kind cluster which works with Linkerd:

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 232438196251277727239562535674581519728 (0xaeddf7de718cc57c0d671fa75ca68d70)
    Signature Algorithm: ECDSA-SHA512
        Issuer: O=kubernetes
        Validity
            Not Before: Oct 6 22:53:27 2021 UTC
            Not After : Oct 6 22:53:27 2022 UTC
        Subject: O=kube-master,CN=kube-apiserver
        Subject Public Key Info:
            Public Key Algorithm: ECDSA
                Public-Key: (256 bit)
                X:
                    ba:60:6d:58:0b:10:94:fb:3c:60:f2:b3:d9:d8:01:
                    d4:4f:7b:ec:71:51:f4:88:b7:16:b4:15:3b:c2:e5:
                    05:46
                Y:
                    41:3f:82:59:96:bb:9f:bb:bd:69:21:c1:e0:50:01:
                    f5:75:9c:17:25:39:82:96:27:1a:f5:73:52:02:55:
                    bd:3b
                Curve: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                Server Authentication
            X509v3 Authority Key Identifier:
                keyid:D5:F3:AC:BA:FF:46:2A:30:CB:9A:FD:2E:AF:B8:3B:5B:89:22:4C:3B
            X509v3 Subject Alternative Name:
                DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:localhost
                IP Address:10.5.0.2, IP Address:10.5.0.2, IP Address:10.5.0.2, IP Address:10.96.0.1
    Signature Algorithm: ECDSA-SHA512
         30:45:02:21:00:c0:23:cc:c2:a0:48:56:14:52:47:c5:f5:a3:
         65:8f:c8:14:f8:e6:3d:5a:41:04:9e:f3:ac:be:f7:88:a1:9c:
         85:02:20:17:6f:fe:8c:b3:ae:83:1e:36:80:71:9b:6e:25:6c:
         bd:da:3b:fa:10:b8:9f:0d:f9:7c:a5:49:67:f4:0a:c7:38

and here is the server cert for the Kubernetes API from a Talos cluster where the policy controller is crashing:

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 232438196251277727239562535674581519728 (0xaeddf7de718cc57c0d671fa75ca68d70)
    Signature Algorithm: ECDSA-SHA512
        Issuer: O=kubernetes
        Validity
            Not Before: Oct 6 22:53:27 2021 UTC
            Not After : Oct 6 22:53:27 2022 UTC
        Subject: O=kube-master,CN=kube-apiserver
        Subject Public Key Info:
            Public Key Algorithm: ECDSA
                Public-Key: (256 bit)
                X:
                    ba:60:6d:58:0b:10:94:fb:3c:60:f2:b3:d9:d8:01:
                    d4:4f:7b:ec:71:51:f4:88:b7:16:b4:15:3b:c2:e5:
                    05:46
                Y:
                    41:3f:82:59:96:bb:9f:bb:bd:69:21:c1:e0:50:01:
                    f5:75:9c:17:25:39:82:96:27:1a:f5:73:52:02:55:
                    bd:3b
                Curve: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                Server Authentication
            X509v3 Authority Key Identifier:
                keyid:D5:F3:AC:BA:FF:46:2A:30:CB:9A:FD:2E:AF:B8:3B:5B:89:22:4C:3B
            X509v3 Subject Alternative Name:
                DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:localhost
                IP Address:10.5.0.2, IP Address:10.5.0.2, IP Address:10.5.0.2, IP Address:10.96.0.1
    Signature Algorithm: ECDSA-SHA512
         30:45:02:21:00:c0:23:cc:c2:a0:48:56:14:52:47:c5:f5:a3:
         65:8f:c8:14:f8:e6:3d:5a:41:04:9e:f3:ac:be:f7:88:a1:9c:
         85:02:20:17:6f:fe:8c:b3:ae:83:1e:36:80:71:9b:6e:25:6c:
         bd:da:3b:fa:10:b8:9f:0d:f9:7c:a5:49:67:f4:0a:c7:38

The most obvious difference between these is that the Talos server certificate uses ECDSA as the public key algorithm and the signature algorithm. (In contrast, Kind uses RSA and SHA256-RSA respectively).

Perhaps kube-rs also has issues handling ECDSA public keys?

The two certs posted might be the same, I see lot of the same data

@adleong
Copy link
Member

adleong commented Oct 7, 2021

@BobyMCbobs 🤦 you're right, I had a copy-paste fail. I've edited my comment with the actual Kind certificate.

@olix0r
Copy link
Member

olix0r commented Oct 7, 2021

Here's a k3d CA certificate:

:; kubectl get secret $(kubectl get sa default -o json |jq  -r '.secrets[0].name') -o json |jq -r '.data["ca.crt"] | @base64d' | step certificate inspect -
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 0 (0x0)
    Signature Algorithm: ECDSA-SHA256
        Issuer: CN=k3s-server-ca@1632414643
        Validity
            Not Before: Sep 23 16:30:43 2021 UTC
            Not After : Sep 21 16:30:43 2031 UTC
        Subject: CN=k3s-server-ca@1632414643
        Subject Public Key Info:
            Public Key Algorithm: ECDSA
                Public-Key: (256 bit)
                X:
                    3c:24:c1:b7:60:07:69:d1:c8:d1:6a:cc:8e:de:f5:
                    50:90:75:75:9b:a8:a9:cf:c2:c4:76:ee:47:c5:85:
                    dc:26
                Y:
                    81:48:b5:72:92:5a:36:cc:b5:ca:28:2b:86:c0:45:
                    92:61:ba:8e:7b:61:48:ec:18:7b:f4:33:6c:86:e2:
                    0f:e8
                Curve: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment, Certificate Sign
            X509v3 Basic Constraints: critical
                CA:TRUE
            X509v3 Subject Key Identifier:
                27:89:51:C0:6A:B1:30:4D:19:9E:81:F4:6E:64:C7:2F:3D:D8:3B:69
    Signature Algorithm: ECDSA-SHA256
         30:45:02:20:3f:63:7f:3b:a0:b7:eb:06:09:a2:e9:a3:dd:83:
         ac:5c:6a:ec:2f:88:39:17:1c:7f:5a:9b:a4:c1:26:2b:f6:8e:
         02:21:00:a4:d5:f7:6d:ab:6c:b8:e2:86:88:f8:50:76:48:b1:
         4e:ca:48:cb:46:b2:68:6a:ea:14:d7:13:e1:b1:01:c7:ca

And here's the k3d API server's certificate:

:; echo | openssl s_client -connect 127.0.0.1:39757  2>&1 | sed --quiet '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' | step certificate inspect -Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 3203983157232513768 (0x2c76d5b4c12db2e8)
    Signature Algorithm: ECDSA-SHA256
        Issuer: CN=k3s-server-ca@1632414643
        Validity
            Not Before: Sep 23 16:30:43 2021 UTC
            Not After : Sep 24 04:45:37 2022 UTC
        Subject: O=k3s,CN=k3s
        Subject Public Key Info:
            Public Key Algorithm: ECDSA
                Public-Key: (256 bit)
                X:
                    a0:8d:0b:e1:ca:19:1b:69:f7:6e:22:fe:ae:bc:5e:
                    a7:4d:1c:4b:ad:80:50:85:6b:45:57:cb:b2:29:a2:
                    24:d5
                Y:
                    2b:99:73:34:56:9d:ef:38:4a:19:9b:46:ef:f1:17:
                    c6:2b:f5:3e:f2:3d:cc:05:76:de:5b:cd:36:74:ab:
                    01:e4
                Curve: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                Server Authentication
            X509v3 Authority Key Identifier:
                keyid:27:89:51:C0:6A:B1:30:4D:19:9E:81:F4:6E:64:C7:2F:3D:D8:3B:69
            X509v3 Subject Alternative Name:
                DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:localhost
                IP Address:0.0.0.0, IP Address:10.43.0.1, IP Address:127.0.0.1, IP Address:172.19.0.2, IP Address:172.19.0.3
    Signature Algorithm: ECDSA-SHA256
         30:45:02:20:6b:ec:d2:f6:26:d6:9f:a7:6f:41:41:99:ea:df:
         54:4c:d8:ea:ad:c4:22:65:99:9d:77:d2:c9:4a:fc:b4:4e:42:
         02:21:00:96:03:76:8c:29:c2:4d:1c:3c:f0:b6:bf:5d:9c:2e:
         01:36:6f:c6:38:73:0e:7b:62:f8:e1:cc:7e:4c:9b:c7:be

@olix0r
Copy link
Member

olix0r commented Oct 7, 2021

If I'm reading this correctly, rustls doesn't actually support SHA-512 with the P-256 curve. It supports ECDSA_NISTP256_SHA256, ECDSA_NISTP384_SHA384, and ECDSA_NISTP521_SHA512. I'm not an enough of a crypto lawyer to appreciate the nuance here.

@olix0r
Copy link
Member

olix0r commented Oct 7, 2021

Here's the commit that removed support for P256_SHA512;briansmith/webpki@a830244 and here's the background discussion https://groups.google.com/a/chromium.org/g/security-dev/c/SlfABuvvQas/m/HXaWVhZkBQAJ

@olix0r
Copy link
Member

olix0r commented Oct 7, 2021

Looking at RFC8446, it looks like TLSv1.3 only defines support for these algorithms:

          /* ECDSA algorithms */
          ecdsa_secp256r1_sha256(0x0403),
          ecdsa_secp384r1_sha384(0x0503),
          ecdsa_secp521r1_sha512(0x0603),

I think the proper fix here is for Talos to issue certificates that conform with the above TLSv1.3-supported signature algorithms.

@smira
Copy link

smira commented Oct 7, 2021

@olix0r thanks for digging into that, that is a bit surprising as it works all over Go TLS clients/servers. But makes perfect sense.

@olix0r
Copy link
Member

olix0r commented Oct 7, 2021

@smira Yeah, I agree it's surprising. I'm not 100% sure this is the problem, but so far it's the most likely issue I can see. In general, rust's TLS ecosystem tends to be fairly minimal & strict, which is generally a good thing when it comes to TLS, but it can lead to some surprising situations like this.

smira added a commit to smira/crypto that referenced this issue Oct 8, 2021
See linkerd/linkerd2#7011 (comment)

Looks like some implementations follow TLS 1.3 rules and skip
implementing all combinations of elliptic curves and hashing.

This changes makes Talos default to issuing ECDSA-P256-SHA256
certificates.

Signed-off-by: Andrey Smirnov <[email protected]>
@smira
Copy link

smira commented Oct 8, 2021

Fix is going to be merged to Talos, and we plan to release version 0.13 with the fix.

@olix0r
Copy link
Member

olix0r commented Oct 8, 2021

@smira Excellent. I'm going to close this issue out for now, but please let us know if you hit any more issues going forward!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants