-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OCPBUGS-25341]: perform operator apiService certificate validity checks directly #3217
[OCPBUGS-25341]: perform operator apiService certificate validity checks directly #3217
Conversation
5373cc0
to
1f59431
Compare
if !certs.Active(ca) { | ||
logger.Warnf("CA cert not active") | ||
errs = append(errs, fmt.Errorf("found the CA cert is not active")) | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're no longer checking the certs (here and below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've combined the cert checks with freshness immediately after the APIService checks for successful/failed CSVs. We don't necessarily need to repeat the checks here again.
129df3a
to
1fc69f8
Compare
func (i *StrategyDeploymentInstaller) ShouldRotateCerts(s Strategy) (bool, error) { | ||
strategy, ok := s.(*v1alpha1.StrategyDetailsDeployment) | ||
if !ok { | ||
return false, fmt.Errorf("attempted to install %s strategy with deployment installer", strategy.GetStrategyName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this error a little more descriptive? To me it's just saying "we attempted an install" rather than "install failed because: ..." From what I can tell, looks like a failure due to a missing strategy? But I'm not super familiar with this code.
return false, fmt.Errorf("attempted to install %s strategy with deployment installer", strategy.GetStrategyName()) | ||
} | ||
|
||
hasCerts := map[string]struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick and not required: we're using k8s.io/apimachinery/pkg/util/sets
in v1 and inconsistently in v0. Might make things a little cleaner but it's not a big deal to me.
83a83a6
to
1b113fd
Compare
Signed-off-by: Ankita Thomas <[email protected]>
1b113fd
to
d6c60b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few questions right now.
However, and I realize this is pre-existing code, I don't like creating a certificate before checking it. The process should really be:
- Does the cert exist? No -> create the cert
- Is the cert need to be rotated? Yes -> rotate/create the cert
Right now, a cert is created every time the rotate check is to be done, and that is a BIG waste of CPU.
func HostnamesForService(serviceName, namespace string) []string { | ||
return []string{ | ||
fmt.Sprintf("%s.%s", serviceName, namespace), | ||
fmt.Sprintf("%s.%s.svc", serviceName, namespace), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not adding .cluster
nor .cluster.local
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should resolve even without the cluster domain specified, though there shouldn't be any harm in adding the fqdns for the default cluster domains. We could possibly also lookup the actual domain similar to pkg/package-server/client/util.go, though that may not be necessary
} else if apierrors.IsNotFound(err) { | ||
return true, nil | ||
} else { | ||
return false, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's some other error, what happens? This does not rotate the certificate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it means we enter exponential backoff (at least until we hit the 5m install plan timeout)?
Any other error is likely to be either temporary (in which case it's hopefully resolved next time around), or not (in which case we would likely fail in a similar way trying to rotate the cert).
The only scenario that is coming to mind where Get
secret fails, but a subsequent Create
or Update
would succeed would be if we lack read permission but have write permission.
// Re-sync if kube-apiserver was unavailable | ||
if apierrors.IsServiceUnavailable(installErr) { | ||
logger.WithError(installErr).Info("could not update install status") | ||
syncError = installErr | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this moved lower? If there's an error, shouldn't we check it first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a similar thought. Maybe we shouldn't timeout the install plan if the reason for things taking so long is that KAS is unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick pass and I think this looks good overall, but to be honest I'm a bit lost in the depths of this code.
The one question that I had that came to mind: how do we force ourselves to check if certs need to be rotated? If nothing else in the system is changing, are we doing any of:
- re-reconcile if the cert secret changes
- re-reconcile at a time in the future base on the current secrets' expiration time.
It could totally be that we already handle that part correctly, but it seems like an important detail to understand about this bug.
return false, fmt.Errorf("failed to install %s strategy with deployment installer: unsupported deployment install strategy", strategy.GetStrategyName()) | ||
} | ||
|
||
hasCerts := sets.NewString() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is deprecated. The sets
package now supports/prefers the generic type.
hasCerts := sets.NewString() | |
hasCerts := sets.New[string]() |
timeouts after API availability checks, add service FQDN to list of hostnames. Signed-off-by: Ankita Thomas <[email protected]>
d6c60b3
to
5d5112f
Compare
We already do this for all olm managed secrets
With each reconcile, we check all certs for anything that expires in a day or less and rotate all of those, including the ones that are already expired. The problem was that we were checking the cert freshness timestamps on the CSV to make those checks, and those were sometimes being incorrectly updated when the cert rotate hadn't really succeeded. |
But I'm wondering if there's a scenario where:
Is there something that forces a re-reconcile inside the time window where:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Although we may want to look at the order of operations when a cert is to be rotated. As, this could lead to CPU spikes.
908da0c
Cert updates can occasionally fail silently, updating only the timestamps on the CSV without any changes to the underlying cert secret. This PR uses the cert expiry times directly to retry the refresh.