Only send server updates to listeners when the opaque protocol changes #11907

adleong · 2024-01-10T00:51:17Z

Whenever the destination controller's informer receives an update of a Server resource, it checks every portPublisher in the endpointsWatcher to see if the Server selects any pods in that servicePort and updates those pods' opaque protocol field. Regardless of if any pods were matched or if the opaque protocol changed, an update is sent to each listener. This results in an update to every endpointTranslator each time a Server is updated. During a resync, we get an update for every Server in the cluster which results in N updates to each endpointTranslator where N is the number of Servers in the cluster.

If N is greater than 100, it becomes possible that these N updates could overflow the endpointTranslator update queue if the queue is not being drained fast enough.

We change this to only send the update for a Server if at least one of the servicePort addresses was selected by that server AND it's opaque protocol field changed.

Signed-off-by: Alex Leong <[email protected]>

zaharidichev

LGTM good catch!

This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <[email protected]>

This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised ([#11917]) [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <[email protected]>

#11907) Whenever the destination controller's informer receives an update of a Server resource, it checks every portPublisher in the endpointsWatcher to see if the Server selects any pods in that servicePort and updates those pods' opaque protocol field. Regardless of if any pods were matched or if the opaque protocol changed, an update is sent to each listener. This results in an update to every endpointTranslator each time a Server is updated. During a resync, we get an update for every Server in the cluster which results in N updates to each endpointTranslator where N is the number of Servers in the cluster. If N is greater than 100, it becomes possible that these N updates could overflow the endpointTranslator update queue if the queue is not being drained fast enough. We change this to only send the update for a Server if at least one of the servicePort addresses was selected by that server AND it's opaque protocol field changed. Signed-off-by: Alex Leong <[email protected]>

This stable release adds a cni-repair-controller which fixes the issue of injected pods that cannot acquire proper network config because linkerd-cni and/or the cluster's network CNI haven't fully started ([#11699]). It also fixes a bug in the destination controller where having a large number of Server resources could cause the destination controller to use an excessive amount of CPU ([#11907]). Finally, it fixes a conflict with tap resource shortnames which was causing warnings from kubectl v1.29.0+ ([#11816]). [#11699]: #11699 [#11907]: #11907 [#11816]: #11816

Only send server updates to listeners when the opaque protocol changes

2d467d0

Signed-off-by: Alex Leong <[email protected]>

adleong requested a review from a team as a code owner January 10, 2024 00:51

zaharidichev approved these changes Jan 10, 2024

View reviewed changes

mateiidavid approved these changes Jan 10, 2024

View reviewed changes

adleong merged commit 27a1a84 into main Jan 10, 2024
33 checks passed

adleong deleted the alex/too-many-servers-in-the-restaurant branch January 10, 2024 22:18

mateiidavid mentioned this pull request Jan 12, 2024

edge-24.1.1 #11922

Merged

adleong mentioned this pull request Jan 18, 2024

stable-2.14.9 #11949

Merged

mcharriere mentioned this pull request Jan 29, 2024

Update Linkerd to stable-2.14.9 giantswarm/roadmap#3188

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only send server updates to listeners when the opaque protocol changes #11907

Only send server updates to listeners when the opaque protocol changes #11907

adleong commented Jan 10, 2024

zaharidichev left a comment

Only send server updates to listeners when the opaque protocol changes #11907

Only send server updates to listeners when the opaque protocol changes #11907

Conversation

adleong commented Jan 10, 2024

zaharidichev left a comment

Choose a reason for hiding this comment