-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto certificate renewal with restartOnTLSSecretUpdate and cert-manager fails #390
Comments
@nosvalds Is there a separate truststore.p12 file in the TLS secret? Maybe the problem here is the keystore.p12 is being used as the truststore and since it's a new cert, it's not trusting the leader's cert? This feels like a truststore thing to me, so can you share the env vars that were generated for a solr pod |
Here it is for 1 pod. They look the same across the 3 pods. This is in a non-broken state. If you need it in the broken state I'll need to setup a test cluster.
|
FYI if you want to see my manifests they are here: https://github.com/IATI/solr-k8s, |
thanks Nik ... Solr is getting configured with the keystore as the truststore, which is likely the cause of the problem here (since once the new cert is loaded, it's not trusted by the other nodes). So there is a bug here in that the Solr operator should not set the truststore to the keystore and instead just have the JVM fallback to using cacerts. The source of this bug is that you want to use the keystore as truststore when using self-signed certs and I incorrectly applied that logic in all cases :-( I think an easy work-around for you for now is to just create a generic secret in your cluster(s) containing the Let's Encrypt Root CA for your truststore (pkcs12 format). You can download the CA pem files from here: https://letsencrypt.org/certificates/ Here's a script that does what you need, change the password ... it also imports all the CA certs that come with your JVM, so you'll need to point to a JAVA_HOME for your env before running this script:
Once the
sorry for the trouble here! Let me know if this work-around works for you for now and I'll get a fix into the next version of the operator that doesn't set the truststore to the keystore. |
Thanks @thelabdude ! I think this worked. Some modifications to the script to ensure I pulled the
Then I updated my SolrCloud CRD as directed. Looks like the truststore was updated appropriately:
I've only done this on my dev environment so far which is only 1 Solrcloud pod, so not able to fully test it's fixed yet. |
The fix for this is starting to feel more like it should go into My initial approach for this was to not set the
So the Solr would start failing with:
So the current behavior of looking for a So for now, I think the work-around Nik used (create your own truststore and put into a secret) is the best solution until In
For Nik's case, just using the Java default cacerts as the truststore for Solr should fix his issue with Let's Encrypt's certs renewing as modern Java (https://www.oracle.com/java/technologies/javase/8u141-relnotes.html) include Let's Encrypt's root ca cert (see: https://letsencrypt.org/docs/certificate-compatibility/). So in 0.6.0, my plan is to support this merge option and if there's no explicit user-provided truststore, Solr will just boot with the Java cacerts as the truststore. |
I've just got around to implementing the workaround on my production environment with 3 Solr Pods. After updating the truststore I manually triggered a renewal of the certificate |
Thank you for following up @nosvalds, glad to hear the work-around works for now. |
Asked in https://kubernetes.slack.com/archives/CQSNS615F/p1639651250121300, recommended to bring here.
I’m seeing an issue where my Solr cloud (3 replicas) doesn’t recover when cert-manager renews the certificate. It seems to be tied to my updateStrategy.
What I’m seeing:
restartOnTLSSecretUpdate
setting starts to restart the clusterAfter the 1 Pod restarts, its collections are shown as having a “Down” status, in the logs I see
2021-12-16 10:15:14.746 ERROR (recoveryExecutor-11-thread-3-processing-n:pod-1.default:443_solr x:transaction_10_shard1_replica_n5 c:transaction_10 s:shard1 r:core_node6) [c:transaction_10 s:shard1 r:core_node6 x:transaction_10_shard1_replica_n5] o.a.s.c.RecoveryStrategy Failed to connect leader https://pod-2.default:443/solr on recovery, try again
The cluster is stuck at this state, as the updateStrategy settings won’t allow the other 2 pods to restart since the 1 pod’s collections are “Down”
The error seems to indicate that the pod that has restarted can’t communicate with the pods that haven’t restarted yet, presumably because the certificates.
A (bad) “workaround” I found is that if it set the below, then all 3 pods will restart and come up heathy. But this obviously has the downside of a short downtime.
I don’t think the active certs would have been expired, as Lets Encrypt certs have 90day duration and they renewed at 60days (which is the default for cert-manager of 2/3 of the duration). https://cert-manager.io/docs/usage/certificate/#renewal & https://cert-manager.io/docs/faq/#if-renewbefore-or-duration-is-not-defined-what-will-be-the-default-value
I wonder if the act of cert-manager renewing the certificates invalidates the active one though? I can’t find this specifically in the docs. This would be a problem then. It also would explain why I was still seeing this problem when triggering a manual renewal using the cert-manager cmctl https://cert-manager.io/docs/usage/cmctl/#renew cli. If this is the case, we would need
restartOnTLSSecretUpdate
to be able to ignore theupdateStrategy. managed. max*Unavailable
settings.Sort of related issue: cert-manager/cert-manager#1168, but Solr has
restartOnTLSSecretUpdate
for our pods.The text was updated successfully, but these errors were encountered: