-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unable to delete server_port_subscribers metric with labels #10764
Comments
Fixes #10764 `GetProfile' streams create a `server_port_subscribers` gauge that tracks the number of listeners interested in a given Server. Because of an oversight, the gauge was only being registered until the second listener was added. For just one listener the gauge was absent. But whenever the `GetProfile` stream ended, the gauge was deleted which resulted in this error if it wasn't registered to begin with: ``` level=warning msg="unable to delete server_port_subscribers metric with labels map[name:voting namespace:emojivoto port:4191]" addr=":8086" component=server ``` One can check that the gauge wasn't being created by installing viz and emojivoto, and checking the following returns empty: ```bash $ linkerd diagnostics controller-metrics|ag server.port.subscribers ``` After this fix, one can see the metric getting populated: ```bash $ linkerd diagnostics controller-metrics | grep server.port.subscribers # HELP server_port_subscribers Number of subscribers to Server changes associated with a pod's port. # TYPE server_port_subscribers gauge server_port_subscribers{name="emoji",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="4191"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9990"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9995"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9996"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9997"} 1 server_port_subscribers{name="metrics",namespace="linkerd-viz",port="4191"} 1 server_port_subscribers{name="metrics",namespace="linkerd-viz",port="9995"} 1 server_port_subscribers{name="tap",namespace="linkerd-viz",port="4191"} 1 server_port_subscribers{name="tap",namespace="linkerd-viz",port="9995"} 1 server_port_subscribers{name="tap",namespace="linkerd-viz",port="9998"} 1 server_port_subscribers{name="vote",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="voting",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="web",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="web",namespace="linkerd-viz",port="4191"} 1 server_port_subscribers{name="web",namespace="linkerd-viz",port="9994"} 1 ``` And when scaling down the voting deployment, one can see how the metric with `name="voting"` is removed.
Fixes #10764 `GetProfile` streams create a `server_port_subscribers` gauge that tracks the number of listeners interested in a given Server. Because of an oversight, the gauge was only being registered until the second listener was added. For just one listener the gauge was absent. But whenever the `GetProfile` stream ended, the gauge was deleted which resulted in this error if it wasn't registered to begin with: ``` level=warning msg="unable to delete server_port_subscribers metric with labels map[name:voting namespace:emojivoto port:4191]" addr=":8086" component=server ``` One can check that the gauge wasn't being created by installing viz and emojivoto, and checking the following returns empty: ```bash $ linkerd diagnostics controller-metrics | grep server_port_subscribers ``` After this fix, one can see the metric getting populated: ```bash $ linkerd diagnostics controller-metrics | grep server_port_subscribers # HELP server_port_subscribers Number of subscribers to Server changes associated with a pod's port. # TYPE server_port_subscribers gauge server_port_subscribers{name="emoji",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="4191"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9990"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9995"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9996"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9997"} 1 server_port_subscribers{name="metrics",namespace="linkerd-viz",port="4191"} 1 server_port_subscribers{name="metrics",namespace="linkerd-viz",port="9995"} 1 server_port_subscribers{name="tap",namespace="linkerd-viz",port="4191"} 1 server_port_subscribers{name="tap",namespace="linkerd-viz",port="9995"} 1 server_port_subscribers{name="tap",namespace="linkerd-viz",port="9998"} 1 server_port_subscribers{name="vote",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="voting",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="web",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="web",namespace="linkerd-viz",port="4191"} 1 server_port_subscribers{name="web",namespace="linkerd-viz",port="9994"} 1 ``` And when scaling down the voting deployment, one can see how the metric with `name="voting"` is removed.
Fixes #10764 `GetProfile` streams create a `server_port_subscribers` gauge that tracks the number of listeners interested in a given Server. Because of an oversight, the gauge was only being registered until the second listener was added. For just one listener the gauge was absent. But whenever the `GetProfile` stream ended, the gauge was deleted which resulted in this error if it wasn't registered to begin with: ``` level=warning msg="unable to delete server_port_subscribers metric with labels map[name:voting namespace:emojivoto port:4191]" addr=":8086" component=server ``` One can check that the gauge wasn't being created by installing viz and emojivoto, and checking the following returns empty: ```bash $ linkerd diagnostics controller-metrics | grep server_port_subscribers ``` After this fix, one can see the metric getting populated: ```bash $ linkerd diagnostics controller-metrics | grep server_port_subscribers # HELP server_port_subscribers Number of subscribers to Server changes associated with a pod's port. # TYPE server_port_subscribers gauge server_port_subscribers{name="emoji",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="4191"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9990"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9995"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9996"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9997"} 1 server_port_subscribers{name="metrics",namespace="linkerd-viz",port="4191"} 1 server_port_subscribers{name="metrics",namespace="linkerd-viz",port="9995"} 1 server_port_subscribers{name="tap",namespace="linkerd-viz",port="4191"} 1 server_port_subscribers{name="tap",namespace="linkerd-viz",port="9995"} 1 server_port_subscribers{name="tap",namespace="linkerd-viz",port="9998"} 1 server_port_subscribers{name="vote",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="voting",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="web",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="web",namespace="linkerd-viz",port="4191"} 1 server_port_subscribers{name="web",namespace="linkerd-viz",port="9994"} 1 ``` And when scaling down the voting deployment, one can see how the metric with `name="voting"` is removed.
Fixes #10764 `GetProfile` streams create a `server_port_subscribers` gauge that tracks the number of listeners interested in a given Server. Because of an oversight, the gauge was only being registered until the second listener was added. For just one listener the gauge was absent. But whenever the `GetProfile` stream ended, the gauge was deleted which resulted in this error if it wasn't registered to begin with: ``` level=warning msg="unable to delete server_port_subscribers metric with labels map[name:voting namespace:emojivoto port:4191]" addr=":8086" component=server ``` One can check that the gauge wasn't being created by installing viz and emojivoto, and checking the following returns empty: ```bash $ linkerd diagnostics controller-metrics | grep server_port_subscribers ``` After this fix, one can see the metric getting populated: ```bash $ linkerd diagnostics controller-metrics | grep server_port_subscribers # HELP server_port_subscribers Number of subscribers to Server changes associated with a pod's port. # TYPE server_port_subscribers gauge server_port_subscribers{name="emoji",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="4191"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9990"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9995"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9996"} 1 server_port_subscribers{name="linkerd",namespace="linkerd",port="9997"} 1 server_port_subscribers{name="metrics",namespace="linkerd-viz",port="4191"} 1 server_port_subscribers{name="metrics",namespace="linkerd-viz",port="9995"} 1 server_port_subscribers{name="tap",namespace="linkerd-viz",port="4191"} 1 server_port_subscribers{name="tap",namespace="linkerd-viz",port="9995"} 1 server_port_subscribers{name="tap",namespace="linkerd-viz",port="9998"} 1 server_port_subscribers{name="vote",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="voting",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="web",namespace="emojivoto",port="4191"} 1 server_port_subscribers{name="web",namespace="linkerd-viz",port="4191"} 1 server_port_subscribers{name="web",namespace="linkerd-viz",port="9994"} 1 ``` And when scaling down the voting deployment, one can see how the metric with `name="voting"` is removed. Signed-off-by: Eric Anderson <[email protected]>
I believe that this can happen because of the way that the pod name label is computed on the server_port_subscribers gauge. The gauge is labeled with the first segment of the pod name only, i.e. the pod's workload name. This means that portPodPublishers for different pods within the same workload (each with their own distinct list of listeners) will have conflicting gauge labels. This means that the gauge value is incorrect (it will be set to the number of listeners for an individual podport instead of the total across the entire workload) and each update from a podport will overwrite the value from any other podport in the same workload. It also means that it's possible for there to be two subscriptions to two different podports in the same workload (with one listener each) and then when these become unsubscribed, the following happens:
I think there's a fundamental mismatch here between keeping a separate list of listeners for each podport while trying to report a gauge which is an aggregate over an entire workload. It may be possible to modify the code so that the gauge accurately reflects the total number of listeners summed across all podports within a workload, but a simpler solution may be to replace this gauge with a pair of counters: one for subscriptions and one for unsubscriptions. This composes more easily since counter increments from different podports in the same workload will both apply instead of overwriting each other. And we can still determine the number of active subscriptions by subtracting unsubscriptions from subscriptions (assuming we account for counter resets). |
@adleong here are some logs from one of our destination pods. If you need anything else, let me know :)
Hopefully this helps! |
Fixes #10764 Removed the `server_port_subscribers` gauge, as it wasn't distiguishing amongst different pods, and the number of subscribers for each pod were conflicting with one another when updating the metric (see more details [here](#10764 (comment))). Besides carying an invalid value, this was generating the warning `unable to delete server_port_subscribers metric with labels` The metric was replaced with the `server_port_subscribes` and `server_port_unsubscribes` counters, which track the overall number of subscribes and unsubscribes to the particular pod port. :taco: to @adleong for the diagnosis and the fix!
Fixes #10764 Removed the `server_port_subscribers` gauge, as it wasn't distiguishing amongst different pods, and the number of subscribers for each pod were conflicting with one another when updating the metric (see more details [here](#10764 (comment))). Besides carying an invalid value, this was generating the warning `unable to delete server_port_subscribers metric with labels` The metric was replaced with the `server_port_subscribes` and `server_port_unsubscribes` counters, which track the overall number of subscribes and unsubscribes to the particular pod port. :taco: to @adleong for the diagnosis and the fix!
Fixes #10764 Removed the `server_port_subscribers` gauge, as it wasn't distiguishing amongst different pods, and the number of subscribers for each pod were conflicting with one another when updating the metric (see more details [here](#10764 (comment))). Besides carying an invalid value, this was generating the warning `unable to delete server_port_subscribers metric with labels` The metric was replaced with the `server_port_subscribes` and `server_port_unsubscribes` counters, which track the overall number of subscribes and unsubscribes to the particular pod port. :taco: to @adleong for the diagnosis and the fix!
This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#11055]). In addition, it updates the base image for the `linkerd-cni` initcontainer to resolve a CVE in `libdb` ([#11196]), fixes a race condition in the Destination controller that could cause it to crash ([#11163]), as well as fixing a number of other issues. * Control Plane * Fixed a race condition in the destination controller that could cause it to panic ([#11169]; fixes [#11193]) * Improved the granularity of logging levels in the control plane ([#11147]) * Replaced incorrect `server_port_subscribers` gauge in the Destination controller's metrics with `server_port_subscribes` and `server_port_unsubscribes` counters ([#11206]; fixes [#10764]) * Proxy * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests ([#11198]; fixes [#11055]) * CLI * Updated extension CLI commands to prefer the `--registry` flag over the `LINKERD_DOCKER_REGISTRY` environment variable, making the precedence more consistent (thanks @harsh020!) (see [#11144]) * CNI * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in `libdb` ([#11196]) * Changed the CNI plugin installer to always run in 'chained' mode; the plugin will now wait until another CNI plugin is installed before appending its configuration ([#10849]) * Removed `hostNetwork: true` from linkerd-cni Helm chart templates ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!) * Multicluster * Fixed the `linkerd multicluster check` command failing in the presence of lots of mirrored services ([#10764]) [#10764]: #10764 [#10849]: #10849 [#11055]: #11055 [#11141]: #11141 [#11144]: #11144 [#11147]: #11147 [#11158]: #11158 [#11163]: #11163 [#11169]: #11169 [#11196]: #11196 [#11198]: #11198 [#11206]: #11206 [CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/
This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#11055]). In addition, it updates the base image for the `linkerd-cni` initcontainer to resolve a CVE in `libdb` ([#11196]), fixes a race condition in the Destination controller that could cause it to crash ([#11163]), as well as fixing a number of other issues. * Control Plane * Fixed a race condition in the destination controller that could cause it to panic ([#11169]; fixes [#11193]) * Improved the granularity of logging levels in the control plane ([#11147]) * Replaced incorrect `server_port_subscribers` gauge in the Destination controller's metrics with `server_port_subscribes` and `server_port_unsubscribes` counters ([#11206]; fixes [#10764]) * Proxy * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests ([#11198]; fixes [#11055]) * CLI * Updated extension CLI commands to prefer the `--registry` flag over the `LINKERD_DOCKER_REGISTRY` environment variable, making the precedence more consistent (thanks @harsh020!) (see [#11144]) * CNI * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in `libdb` ([#11196]) * Changed the CNI plugin installer to always run in 'chained' mode; the plugin will now wait until another CNI plugin is installed before appending its configuration ([#10849]) * Removed `hostNetwork: true` from linkerd-cni Helm chart templates ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!) * Multicluster * Fixed the `linkerd multicluster check` command failing in the presence of lots of mirrored services ([#10764]) [#10764]: #10764 [#10849]: #10849 [#11055]: #11055 [#11141]: #11141 [#11144]: #11144 [#11147]: #11147 [#11158]: #11158 [#11163]: #11163 [#11169]: #11169 [#11196]: #11196 [#11198]: #11198 [#11206]: #11206 [CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/
This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#11055]). In addition, it updates the base image for the `linkerd-cni` initcontainer to resolve a CVE in `libdb` ([#11196]), fixes a race condition in the Destination controller that could cause it to crash ([#11163]), as well as fixing a number of other issues. * Control Plane * Fixed a race condition in the destination controller that could cause it to panic ([#11169]; fixes [#11193]) * Improved the granularity of logging levels in the control plane ([#11147]) * Proxy * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests ([#11198]; fixes [#11055]) * CLI * Updated extension CLI commands to prefer the `--registry` flag over the `LINKERD_DOCKER_REGISTRY` environment variable, making the precedence more consistent (thanks @harsh020!) (see [#11144]) * CNI * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in `libdb` ([#11196]) * Changed the CNI plugin installer to always run in 'chained' mode; the plugin will now wait until another CNI plugin is installed before appending its configuration ([#10849]) * Removed `hostNetwork: true` from linkerd-cni Helm chart templates ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!) * Multicluster * Fixed the `linkerd multicluster check` command failing in the presence of lots of mirrored services ([#10764]) [#10764]: #10764 [#10849]: #10849 [#11055]: #11055 [#11141]: #11141 [#11144]: #11144 [#11147]: #11147 [#11158]: #11158 [#11163]: #11163 [#11169]: #11169 [#11196]: #11196 [#11198]: #11198 [CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/
## edge-23.8.2 This edge release adds improvements to Linkerd's multi-cluster features as part of the [flat network support] planned for Linkerd stable-2.14.0. In addition, it fixes an issue ([#10764]) where warnings about an invalid metric were logged frequently by the Destination controller. * Added a new `remoteDiscoverySelector` field to the multicluster `Link` CRD, which enables a service mirroring mod where the control plane performs discovery for the mirrored service from the remote cluster, rather than creating Endpoints for the mirrored service in the source cluster ([#11190], [#11201], [#11220], and [#11224]) * Fixed missing "Services" menu item in the Spanish localization for the `linkerd-viz` web dashboard ([#11229]) (thanks @mclavel!) * Replaced `server_port_subscribers` Destination controller gauge metric with `server_port_subscribes` and `server_port_unsubscribes` counter metrics ([#11206]; fixes [#10764]) * Replaced deprecated `failure-domain.beta.kubernetes.io` labels in Helm charts with `topology.kubernetes.io` labels ([#11148]; fixes [#11114]) (thanks @piyushsingariya!) [#10764]: #10764 [#11114]: #11114 [#11148]: #11148 [#11190]: #11190 [#11201]: #11201 [#11206]: #11206 [#11220]: #11220 [#11224]: #11224 [#11229]: #11229 [flat network support]: https://linkerd.io/2023/07/20/enterprise-multi-cluster-at-scale-supporting-flat-networks-in-linkerd/
## edge-23.8.2 This edge release adds improvements to Linkerd's multi-cluster features as part of the [flat network support] planned for Linkerd stable-2.14.0. In addition, it fixes an issue ([#10764]) where warnings about an invalid metric were logged frequently by the Destination controller. * Added a new `remoteDiscoverySelector` field to the multicluster `Link` CRD, which enables a service mirroring mode where the control plane performs discovery for the mirrored service from the remote cluster, rather than creating Endpoints for the mirrored service in the source cluster ([#11190], [#11201], [#11220], and [#11224]) * Fixed missing "Services" menu item in the Spanish localization for the `linkerd-viz` web dashboard ([#11229]) (thanks @mclavel!) * Replaced `server_port_subscribers` Destination controller gauge metric with `server_port_subscribes` and `server_port_unsubscribes` counter metrics ([#11206]; fixes [#10764]) * Replaced deprecated `failure-domain.beta.kubernetes.io` labels in Helm charts with `topology.kubernetes.io` labels ([#11148]; fixes [#11114]) (thanks @piyushsingariya!) [#10764]: #10764 [#11114]: #11114 [#11148]: #11148 [#11190]: #11190 [#11201]: #11201 [#11206]: #11206 [#11220]: #11220 [#11224]: #11224 [#11229]: #11229 [flat network support]: https://linkerd.io/2023/07/20/enterprise-multi-cluster-at-scale-supporting-flat-networks-in-linkerd/
What is the issue?
Since upgrading to 2.13.0, our destination logs are spammed with:
We don't run the Viz Prometheus, incase that's related at all.
How can it be reproduced?
Unsure. This happened after upgrading.
Logs, error output, etc
.
output of
linkerd check -o short
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: