Rework and merge CoreDNS and Node Local DNS dashboards (#10034)

* Adapt scrape configuration to new CoreDNS metric CoreDNS deprecated the `coredns_forward_requests_total` and `coredns_forward_responses_total` metrics. As per documentation, both metrics are now replaced by the `coredns_proxy_request_duration_seconds_count` metric. * Adapt CoreDNS dashboard to new metrics As per documentation, the data comes now from the metric: `coredns_proxy_request_duration_seconds_count{proxy_name="forward"}`: CoreDNS has esentially taken advantage of the 1:1 relationship between requests and responses to create a single metric that measures both. Each sample in the new metric is a request and a response. The request destination and the response code are added as labels (`to` and `rcode`, respectively). The replacement for the old `coredns_forward_requests_total{to}` metric is described in the documentation as the `sum(coredns_proxy_..._{to, proxy_name="forward"})`. In other words, the number of requests with destination `to` is the sum of responses from `to`, regardless of their `rcode`. The dashboard queries that need adaptation already use the sum to aggregate by the `to` label. Since the sum is an associative operation, we can use this sum to count requests. As an example, imagine we have the following metrics with their corresponding labels at both times `t` and `t - offset`. If we take the first adapted query in the change as an example, it calculates the difference in the number of requests within a time interval: {rcode="NOERROR", to="1.2.3.4"} = 100 @ t {rcode="NXDOMAIN", to="1.2.3.4"} = 200 @ t {rcode="SERVFAIL", to="6.7.8.9"} = 50 @ t {rcode="NOERROR", to="1.2.3.4"} = 10 @ t - offset {rcode="NXDOMAIN", to="1.2.3.4"} = 10 @ t - offset {rcode="SERVFAIL", to="6.7.8.9"} = 3 @ t - offset The strategy followed by this commit calculates, first, the difference between samples with the same label values: {rcode="NOERROR", to="1.2.3.4"} = 100 - 10 = 90 {rcode="NXDOMAIN", to="1.2.3.4"} = 200 - 10 = 190 {rcode="SERVFAIL", to="6.7.8.9"} = 50 - 3 = 47 and, second, the sum by the `to` label: {to="1.2.3.4"} = 90 + 190 = 280 {to="6.7.8.9"} = 47 Alternatively, if we wanted to follow the documentation closely and do the sum first, then the sum by the `to` label at times `t` and `t - offset` would be the first step: {to="1.2.3.4"} = 100 + 200 = 300 @ t {to="6.7.8.9"} = 50 @ t {to="1.2.3.4"} = 10 + 10 = 20 @ t - offset {to="6.7.8.9"} = 3 @ t - offset And then difference calculation: {to="1.2.3.4"} = 300 - 20 = 280 {to="6.7.8.9"} = 50 - 3 = 47 Note the former approach is preferrable because it requires only one aggregation (the latter requires two) which keeps the final query simpler. * Revist CoreDNS wording on the NodeLocalDNS dashboard Fix wrong references to CoreDNS in the NodeLocalDNS dashboard to refer to NodeLocalDNS instead. Also, remove parenthesis in some dashboard titles (Per Interval) -> Per Interval to make it more homogeneous to the CoreDNS dashboard. * A helper script to export Plutono dashboards when renovating them Co-authored-by: Istvan Zoltan Ballok <[email protected]> Co-authored-by: Victor Herrero Otal <[email protected]> Co-authored-by: Christoph Kleineweber <[email protected]> * Export the current state of the CoreDNS dashboard Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove the "DNS Requests (Packets per Second)" panel This panel used the coredns_dns_request_duration_seconds metric and assumed that it has type, proto, zone labels but as the example shows below it does not have those labels: coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.00025"} 1037 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.0005"} 1106 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.001"} 1116 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.002"} 1119 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.004"} 1120 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.008"} 1124 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.016"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.032"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.064"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.128"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.256"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.512"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="1.024"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="2.048"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="4.096"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="8.192"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="+Inf"} 1125 coredns_dns_request_duration_seconds_sum{server="dns://:8053",view="",zone="."} 0.22205055899999987 coredns_dns_request_duration_seconds_count{server="dns://:8053",view="",zone="."} 1125 The next panel contains the same information so we decided to drop this panel without a replacement. To avoid changes in the JSON document, we kept the layout the same. The next panel fills the full row now. * Renovate the "DNS Requests" panel The type label is renamed by Prometheus to exported_type. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove the "DNS Lookup Responses" panel Using the offset is not idiomatic. For counter metrics, we should use the rate function. The other panel, which is now using the full row, is using the rate function so we do not need this panel. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "DNS Responses" panel Sum responses by rcode to avoid showing time series for individual CoreDNS pods. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "DNS Requests Latency" panel The "coredns_dns_request_duration_seconds_bucket" does not have a type label so we use the zone instead. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "DNS Requests Latency Heatmap" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "Cache Hits and Misses" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove the "Cache Hits and Misses" panel As described in previous commits, using the offset is not idiomatic on counter metrics. The previous panel shows the same information. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "Cache Size" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove the "Upstream DNS Requests Per Interval" panel using offset Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "Upstream DNS Requests" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove the "Upstream DNS Forward Responses Per Interval" panel using offset Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "Upstream DNS Responses" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "DNS Programming Events" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "DNS Programming Cores" panel Sum by service_kind instead of using it for filtering. The filtering can still be done by selecting the series in the panel directly. This also allows removing the service_kind dashboard variable. Note that Core DNS exposes this programming duration metric only for the headless services with selector case (coredns/coredns#3171) so the service_kind filtering and variable is really not needed. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "DNS Programming Heatmap" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Add TODO to adapt the NodeLocalDNS dashboards in the future The change can't be done yet because node-local-dns does not use core-dns 1.11 at the moment. https://github.com/kubernetes/dns/blob/1.23.1/go.mod#L7 * Fix "DNS Programming Events" panel In previous commits, we missed to delete the service_kind variable from the query. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove usage of the "service_kind" variable Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Introduce new "job" variable The CoreDNS and NodeLocalDNS dashboards are to be merged into one single dashboard. In order to achieve that, a new job variable is necessary to differentiate between the two. Also, the old pod variable needs to capture all values now. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Add job filter on all queries Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Adapt "Upstream DNS Requests" panel Node Local DNS is a thin wrapper of CoreDNS. However, it uses CoreDNS 1.10 for now, while the regular CoreDNS deployed in the shoot clusters is 1.11 at the moment. DNS upstream-related metrics from CoreDNS 1.10 were deprecated in 1.11 in favor of newer ones. In consequence, the DNS Upstream panel need to be adapted to consider both metrics. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Adapt "Upstream DNS Responses" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Adjust "CoreDNS DNS Programming Latency" row title The DNS programming latency makes only sense for the CoreDNS pods and it is expected to show no data for the Node Local DNS. This only captures the headless services with selector. ClusterIP and headless services without selector are not captured. coredns/coredns#3171 Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Introduce new "NodeLocalDNS Node Cache Error" row Node cache errors only make sense for the NodeLocalDNS pods. This commit introduces a row to separate the node cache error panels that come in the following commits from other CoreDNS panels. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Migrate "Node Cache Errors" panel Move this panel from the NodeLocalDNS dashboard to the new DNS common dashboard. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove Node Local DNS dashboard This dashboard has now been migrated to the CoreDNS dashboard. This commits also removes code references to the old NodeLocalDNS dashboard. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Rename CoreDNS dashboard to DNS Also rename some references to it. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Split "DNS Requests" panel The original panel contained three queries grouping DNS requests by zone, protocol and type. This commit splits this into three different panels for better visualization. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Change "DNS Requests Latency" panel to use the full row This aligns with the rest of the panels. This commit also improves this panel's legend. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Extend description for the "DNS Programming" panels Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate "Node Cache Errors" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Show both CoreDNS and Node Local DNS metrics by default Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Show the last 3h by default Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Set fill area to 1 to all panels Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Unify height for all panels Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Sort drop-down list options alphabetically Co-authored-by: Victor Herrero Otal <[email protected]> * Export the dashboard one last time This commit exports the new dashboard after applying all the changes one last time to align the dashboard JSON file with the final JSON generated by Plutono. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Fallthrough at the end of the hosts plugin configuration If you want to pass the request to the rest of the plugin chain if there is no match in the hosts plugin, you must specify the fallthrough option. https://coredns.io/plugins/hosts/ Co-authored-by: Istvan Zoltan Ballok <[email protected]> --------- Co-authored-by: Istvan Zoltan Ballok <[email protected]> Co-authored-by: Christoph Kleineweber <[email protected]>
gardener · Jul 9, 2024 · df84098 · df84098
1 parent efba3ea
commit df84098
Show file tree

Hide file tree

Showing 13 changed files with 499 additions and 2,307 deletions.
diff --git a/charts/gardener/provider-local/templates/coredns/configmap.yaml b/charts/gardener/provider-local/templates/coredns/configmap.yaml
@@ -22,6 +22,7 @@ data:
           {{ .Values.controllers.service.zone0IP }} istio-ingressgateway.istio-ingress--0.svc.cluster.local
           {{ .Values.controllers.service.zone1IP }} istio-ingressgateway.istio-ingress--1.svc.cluster.local
           {{ .Values.controllers.service.zone2IP }} istio-ingressgateway.istio-ingress--2.svc.cluster.local
+          fallthrough
         }
         forward . /etc/resolv.conf {
            max_concurrent 1000

diff --git a/hack/export-dashboard.sh b/hack/export-dashboard.sh
@@ -0,0 +1,64 @@
+#!/bin/sh
+# SPDX-FileCopyrightText: SAP SE or an SAP affiliate company and Gardener contributors
+#
+# SPDX-License-Identifier: Apache-2.0
+
+
+# When "renovating" Plutono dashboards, it is useful to export them
+# and capture the changes in small git commits, otherwise it is hard
+# to review the changes in the dashboard JSON files.
+#
+# To facilitate this workflow, this script exports a given dashboard
+# from Plutono that is port-forwarded to localhost:3000.
+#
+# This script expects that changes to a dashboard that are made in the Plutono UI
+# are saved in the Plutono database and can be exported via the Plutono API.
+# To allow this in a _local development environment_, please do the following:
+#
+# 1. Ignore the Plutono managed resource:
+#
+# ```bash
+# k annotate mr plutono resources.gardener.cloud/ignore=true
+# ```
+#
+# 2. Edit the Plutono deployment and enable the login form:
+#
+# ```bash
+# k edit deployment plutono
+# ```
+#
+# ```yaml
+# - name: PL_AUTH_DISABLE_LOGIN_FORM
+#   value: "false" # was "true
+# ```
+#
+# 3.The admin password for Plutono can be accessed via:
+#
+# ```bash
+# k get secret -l name=plutono-admin -o json | jq '.items[].data.password | @base64d' -r
+# ```
+#
+# 4. Port-forward Plutono to localhost:3000 and login as admin:
+#
+# ```bash
+# k port-forward deployment/plutono 3000
+# ```
+#
+# 5. Make a copy of the provisioned dashboard you plan to edit, keep the
+# suggested title with the " Copy" suffix, get its generated uid (Settings, JSON
+# model or copy it from the URL). Make changes in the Plutono UI, save the dashboard,
+# export it via this script and create meaningful git commits.
+#
+# ```bash
+# hack/export-dashboard.sh QSl_u1wIz dns pkg/component/observability/plutono/dashboards/shoot/owners/worker/dns-dashboard.json
+# ```
+
+cd "$(dirname "$(realpath "$0")")/.." || exit 1
+
+curl -s "http://localhost:3000/api/dashboards/uid/$1" \
+| jq --arg original_uid "$2" '
+  .dashboard |
+  .uid |= $original_uid |
+  .title |= sub(" Copy$"; "") |
+  del(.id, .iteration, .version)
+' > "$3"
diff --git a/imagevector/images.yaml b/imagevector/images.yaml
@@ -395,6 +395,7 @@ images:
     value:
     - type: 'githubTeam'
       teamname: 'gardener/gardener-core-networking-maintainers'
+# TODO(vicwicker): A node-local-dns version that uses coredns 1.11 or higher requires adapting the node-local-dns Plutono dashboards.
 - name: node-local-dns
   sourceRepository: github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/nodelocaldns
   repository: registry.k8s.io/dns/k8s-dns-node-cache

diff --git a/pkg/component/networking/coredns/coredns.go b/pkg/component/networking/coredns/coredns.go
@@ -139,8 +139,7 @@ func (c *coreDNS) Deploy(ctx context.Context) error {
 			"coredns_dns_request_duration_seconds_bucket",
 			"coredns_dns_requests_total",
 			"coredns_dns_responses_total",
-			"coredns_forward_requests_total",
-			"coredns_forward_responses_total",
+			"coredns_proxy_request_duration_seconds_count",
 			"coredns_kubernetes_dns_programming_duration_seconds_bucket",
 			"coredns_kubernetes_dns_programming_duration_seconds_count",
 			"coredns_kubernetes_dns_programming_duration_seconds_sum",

diff --git a/pkg/component/networking/coredns/coredns_test.go b/pkg/component/networking/coredns/coredns_test.go
@@ -666,7 +666,7 @@ status: {}
 				MetricRelabelConfigs: []monitoringv1.RelabelConfig{{
 					SourceLabels: []monitoringv1.LabelName{"__name__"},
 					Action:       "keep",
-					Regex:        `^(coredns_build_info|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_dns_request_duration_seconds_count|coredns_dns_request_duration_seconds_bucket|coredns_dns_requests_total|coredns_dns_responses_total|coredns_forward_requests_total|coredns_forward_responses_total|coredns_kubernetes_dns_programming_duration_seconds_bucket|coredns_kubernetes_dns_programming_duration_seconds_count|coredns_kubernetes_dns_programming_duration_seconds_sum|process_max_fds|process_open_fds)$`,
+					Regex:        `^(coredns_build_info|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_dns_request_duration_seconds_count|coredns_dns_request_duration_seconds_bucket|coredns_dns_requests_total|coredns_dns_responses_total|coredns_proxy_request_duration_seconds_count|coredns_kubernetes_dns_programming_duration_seconds_bucket|coredns_kubernetes_dns_programming_duration_seconds_count|coredns_kubernetes_dns_programming_duration_seconds_sum|process_max_fds|process_open_fds)$`,
 				}},
 			},
 		}