-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unregister prom gauges when recycling cluster watcher #11875
Conversation
Fixes #11839 When in `restartClusterWatcher` we fail to connect to the target cluster for whatever reason, the function gets called again 10s later, and tries to register the same prometheus metrics without unregistering them first, which generates warnings. The problem lies in `NewRemoteClusterServiceWatcher`, which instantiates the remote kube-api client and registers the metrics, returning a nil object if the client can't connect. `cleanupWorkers` at the beginning of `restartClusterWatcher` won't unregister those metrics because of that nil object. This fix reorders `NewRemoteClusterServiceWatcher` so that an object is returned even when there's an error, so cleanup on that object can be performed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. So, to fix, this basically ensures that any error encountered after building the remote client will still retain a reference to a cluster watcher. e.g.
linkerd2/multicluster/cmd/service-mirror/main.go
Lines 155 to 163 in 999eff3
err = restartClusterWatcher(ctx, link, *namespace, creds, controllerK8sAPI, *requeueLimit, *repairPeriod, metrics, *enableHeadlessSvc) | |
if err != nil { | |
// failed to restart cluster watcher; give a bit of slack | |
// and restart the link watch to give it another try | |
log.Error(err) | |
time.Sleep(linkWatchRestartAfter) | |
linkWatch.Stop() | |
} | |
case watch.Deleted: |
Good way to fix it imo.
|
||
_, err = remoteAPI.Client.Discovery().ServerVersion() | ||
if err != nil { | ||
return &rcsw, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generally not considered sound to handle a value when err != nil; so we probably ought to omit the value.
return &rcsw, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err) | |
return nil, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err) |
However, your description seems to indicate that this is load-bearing:
This fix reorders NewRemoteClusterServiceWatcher so that an object is returned even when there's an error, so cleanup on that object can be performed.
But the return value is not used at the call-site when an error is returned:
linkerd2/multicluster/cmd/service-mirror/main.go
Lines 306 to 319 in cf2999d
clusterWatcher, err = servicemirror.NewRemoteClusterServiceWatcher( | |
ctx, | |
namespace, | |
controllerK8sAPI, | |
cfg, | |
&link, | |
requeueLimit, | |
repairPeriod, | |
ch, | |
enableHeadlessSvc, | |
) | |
if err != nil { | |
return fmt.Errorf("unable to create cluster watcher: %w", err) | |
} |
So, how does this change fix the problem exactly? How do we avoid introducing another problem like this. Can you add a comment so we don't easily run into this problem again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OH! this is setting a global value. I don't think this is a sound pattern. Instead, the caller should use:
cw, err = servicemirror.NewRemoteClusterServiceWatcher(
ctx,
namespace,
controllerK8sAPI,
cfg,
&link,
requeueLimit,
repairPeriod,
ch,
enableHeadlessSvc,
)
if err != nil {
return fmt.Errorf("unable to create cluster watcher: %w", err)
}
clusterWatcher = cw
This removes the need for the change to NewRemoteClusterServiceWatcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't modify NewRemoteClusterServiceWatcher to return rcsw
on an error, the caller won't be able to perform the gauges cleanup. Actually I've just thought of something else; we should be able to perform the cleanup directly inside NewRemoteClusterServiceWatcher before returning the error. I've just pushed that, LMKWYT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense. Let's add a comment above this line explaining that the remoteAPI registers gauges and they must be explicitly unregistered on error. https://github.com/linkerd/linkerd2/pull/11875/files#diff-58391f2b0ac5849326792fbaf12a8e4aa8b06886acbe9fda308357d131ed38dcR172
d8392a5
to
43fe71c
Compare
…API for creating kube-api clients
This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <[email protected]>
This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised ([#11917]) [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <[email protected]>
This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised ([#11917]) [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <[email protected]>
Fixes #11839
When in
restartClusterWatcher
we fail to connect to the target cluster for whatever reason, the function gets called again 10s later, and tries to register the same prometheus metrics without unregistering them first, which generates warnings.The problem lies in
NewRemoteClusterServiceWatcher
, which instantiates the remote kube-api client and registers the metrics, returning a nil object if the client can't connect.cleanupWorkers
at the beginning ofrestartClusterWatcher
won't unregister those metrics because of that nil object.This fix reorders
NewRemoteClusterServiceWatcher
so that an object is returned even when there's an error, so cleanup on that object can be performed.