-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random TLS certificate verification failure when calling the percona xtradb cluster validating webhook #1675
Labels
Comments
we have the exact same behaviour with just 1 operator pod.
|
me too |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Report
Random TLS certificate verification failure when calling the percona xtradb cluster validating webhook
More about the problem
When we deploy the pxc operator in cluster wide mode (watchAllNamespaces=true) and with more than one replica (replicaCount>1), tls certificate verification failure appears at random on validating webhook call.
These errors can be seen when a user try to apply or edit a CR definition of a pxc cluster, or from the operator logs during reconciliation operations. The logs are the following :*
"Internal error occured: failed calling webhook "validationwebhook.pxc.percona.com": failed to call webhook: Post "[https://percona-xtradb-cluster-operator.namespace.svc:443/validate-percona-xtradbcluster?timeout=10s](https://percona-xtradb-cluster-operator.namespace.svc/validate-percona-xtradbcluster?timeout=10s)": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "Root CA")
After some investigations, I noticed that the ca bundle configured in the validating webhook change each time a pxc-operator replica pod take the lead of the operations and only this pod has valid tls certificate.
This can be checked by recovering the ca-bundle from the validating webhook and the tls.crt from the pxc-operator leader pod and verify the signature with openssl :
But if we extract the tls.crt from another pxc-operator replica pod, the verification fails :
And if we delete the leader pod, the ca-bundle configured in the validating webhook change to match the certificate of the new leader.
As the percona-xtradb-cluster-operator k8s service point to any of the pxc-operator replica pods, this explains why the error appears at random if the validation webhook call is redirected to a non-leader pxc-operator replica pod.
This was also confirm by the fact that the problem disappears when we scale down the operator to only one replica.
Steps to reproduce
Versions
Anything else?
No response
The text was updated successfully, but these errors were encountered: