Skip to content

Commit

Permalink
Rework and merge CoreDNS and Node Local DNS dashboards (#10034)
Browse files Browse the repository at this point in the history
* Adapt scrape configuration to new CoreDNS metric

CoreDNS deprecated the `coredns_forward_requests_total` and
`coredns_forward_responses_total` metrics. As per documentation,
both metrics are now replaced by the
`coredns_proxy_request_duration_seconds_count` metric.

* Adapt CoreDNS dashboard to new metrics

As per documentation, the data comes now from the metric:
`coredns_proxy_request_duration_seconds_count{proxy_name="forward"}`:

CoreDNS has esentially taken advantage of the 1:1 relationship between
requests and responses to create a single metric that measures both. Each
sample in the new metric is a request and a response. The request
destination and the response code are added as labels (`to` and `rcode`,
respectively). The replacement for the old
`coredns_forward_requests_total{to}` metric is described in the
documentation as the `sum(coredns_proxy_..._{to, proxy_name="forward"})`.
In other words, the number of requests with destination `to` is the sum
of responses from `to`, regardless of their `rcode`.

The dashboard queries that need adaptation already use the sum to
aggregate by the `to` label. Since the sum is an associative operation,
we can use this sum to count requests. As an example, imagine we have the
following metrics with their corresponding labels at both times `t` and
`t - offset`. If we take the first adapted query in the change as an
example, it calculates the difference in the number of requests within a
time interval:

{rcode="NOERROR",  to="1.2.3.4"} = 100 @ t
{rcode="NXDOMAIN", to="1.2.3.4"} = 200 @ t
{rcode="SERVFAIL", to="6.7.8.9"} = 50  @ t

{rcode="NOERROR",  to="1.2.3.4"} = 10  @ t - offset
{rcode="NXDOMAIN", to="1.2.3.4"} = 10  @ t - offset
{rcode="SERVFAIL", to="6.7.8.9"} = 3   @ t - offset

The strategy followed by this commit calculates, first, the difference
between samples with the same label values:

{rcode="NOERROR",  to="1.2.3.4"} = 100 - 10 = 90
{rcode="NXDOMAIN", to="1.2.3.4"} = 200 - 10 = 190
{rcode="SERVFAIL", to="6.7.8.9"} = 50  - 3  = 47

and, second, the sum by the `to` label:

{to="1.2.3.4"} = 90 + 190 = 280
{to="6.7.8.9"}            = 47

Alternatively, if we wanted to follow the documentation closely and
do the sum first, then the sum by the `to` label at times `t` and
`t - offset` would be the first step:

{to="1.2.3.4"} = 100 + 200 = 300       @ t
{to="6.7.8.9"}             = 50        @ t

{to="1.2.3.4"} = 10  + 10  = 20        @ t - offset
{to="6.7.8.9"}             = 3         @ t - offset

And then difference calculation:

{to="1.2.3.4"} = 300 - 20 = 280
{to="6.7.8.9"} = 50  - 3  = 47

Note the former approach is preferrable because it requires only one
aggregation (the latter requires two) which keeps the final query simpler.

* Revist CoreDNS wording on the NodeLocalDNS dashboard

Fix wrong references to CoreDNS in the NodeLocalDNS dashboard to refer
to NodeLocalDNS instead.

Also, remove parenthesis in some dashboard titles
(Per Interval) -> Per Interval to make it more homogeneous to the CoreDNS
dashboard.

* A helper script to export Plutono dashboards when renovating them

Co-authored-by: Istvan Zoltan Ballok <[email protected]>
Co-authored-by: Victor Herrero Otal <[email protected]>
Co-authored-by: Christoph Kleineweber <[email protected]>

* Export the current state of the CoreDNS dashboard

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Remove the "DNS Requests (Packets per Second)" panel

This panel used the coredns_dns_request_duration_seconds metric and
assumed that it has type, proto, zone labels but as the example shows
below it does not have those labels:

coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.00025"} 1037
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.0005"} 1106
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.001"} 1116
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.002"} 1119
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.004"} 1120
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.008"} 1124
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.016"} 1125
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.032"} 1125
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.064"} 1125
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.128"} 1125
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.256"} 1125
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.512"} 1125
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="1.024"} 1125
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="2.048"} 1125
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="4.096"} 1125
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="8.192"} 1125
coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="+Inf"} 1125
coredns_dns_request_duration_seconds_sum{server="dns://:8053",view="",zone="."} 0.22205055899999987
coredns_dns_request_duration_seconds_count{server="dns://:8053",view="",zone="."} 1125

The next panel contains the same information so we decided to drop this
panel without a replacement.

To avoid changes in the JSON document, we kept the layout the same.
The next panel fills the full row now.

* Renovate the "DNS Requests" panel

The type label is renamed by Prometheus to exported_type.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Remove the "DNS Lookup Responses" panel

Using the offset is not idiomatic. For counter metrics, we should use the
rate function. The other panel, which is now using the full row, is using
the rate function so we do not need this panel.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Renovate the "DNS Responses" panel

Sum responses by rcode to avoid showing time series for individual CoreDNS
pods.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Renovate the "DNS Requests Latency" panel

The "coredns_dns_request_duration_seconds_bucket" does not have a type
label so we use the zone instead.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Renovate the "DNS Requests Latency Heatmap" panel

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Renovate the "Cache Hits and Misses" panel

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Remove the "Cache Hits and Misses" panel

As described in previous commits, using the offset is not idiomatic on
counter metrics. The previous panel shows the same information.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Renovate the "Cache Size" panel

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Remove the "Upstream DNS Requests Per Interval" panel using offset

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Renovate the "Upstream DNS Requests" panel

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Remove the "Upstream DNS Forward Responses Per Interval" panel using offset

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Renovate the "Upstream DNS Responses" panel

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Renovate the "DNS Programming Events" panel

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Renovate the "DNS Programming Cores" panel

Sum by service_kind instead of using it for filtering. The filtering can
still be done by selecting the series in the panel directly. This also
allows removing the service_kind dashboard variable.

Note that Core DNS exposes this programming duration metric only for the
headless services with selector case
(coredns/coredns#3171)
so the service_kind filtering and variable is really not needed.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Renovate the "DNS Programming Heatmap" panel

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Add TODO to adapt the NodeLocalDNS dashboards in the future

The change can't be done yet because node-local-dns does not use
core-dns 1.11 at the moment.

https://github.com/kubernetes/dns/blob/1.23.1/go.mod#L7

* Fix "DNS Programming Events" panel

In previous commits, we missed to delete the service_kind variable from the
query.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Remove usage of the "service_kind" variable

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Introduce new "job" variable

The CoreDNS and NodeLocalDNS dashboards are to be merged into one single
dashboard. In order to achieve that, a new job variable is necessary to
differentiate between the two. Also, the old pod variable needs to capture
all values now.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Add job filter on all queries

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Adapt "Upstream DNS Requests" panel

Node Local DNS is a thin wrapper of CoreDNS. However, it uses CoreDNS
1.10 for now, while the regular CoreDNS deployed in the shoot clusters
is 1.11 at the moment. DNS upstream-related metrics from CoreDNS 1.10
were deprecated in 1.11 in favor of newer ones. In consequence, the
DNS Upstream panel need to be adapted to consider both metrics.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Adapt "Upstream DNS Responses" panel

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Adjust "CoreDNS DNS Programming Latency" row title

The DNS programming latency makes only sense for the CoreDNS
pods and it is expected to show no data for the Node Local DNS.

This only captures the headless services with selector.
ClusterIP and headless services without selector are not
captured.

coredns/coredns#3171

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Introduce new "NodeLocalDNS Node Cache Error" row

Node cache errors only make sense for the NodeLocalDNS pods. This commit
introduces a row to separate the node cache error panels that come in the
following commits from other CoreDNS panels.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Migrate "Node Cache Errors" panel

Move this panel from the NodeLocalDNS dashboard to the new DNS common
dashboard.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Remove Node Local DNS dashboard

This dashboard has now been migrated to the CoreDNS dashboard. This
commits also removes code references to the old NodeLocalDNS dashboard.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Rename CoreDNS dashboard to DNS

Also rename some references to it.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Split "DNS Requests" panel

The original panel contained three queries grouping DNS requests by zone,
protocol and type. This commit splits this into three different panels for
better visualization.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Change "DNS Requests Latency" panel to use the full row

This aligns with the rest of the panels. This commit also improves this
panel's legend.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Extend description for the "DNS Programming" panels

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Renovate "Node Cache Errors" panel

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Show both CoreDNS and Node Local DNS metrics by default

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Show the last 3h by default

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Set fill area to 1 to all panels

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Unify height for all panels

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Sort drop-down list options alphabetically

Co-authored-by: Victor Herrero Otal <[email protected]>

* Export the dashboard one last time

This commit exports the new dashboard after applying all the changes one
last time to align the dashboard JSON file with the final JSON generated
by Plutono.

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

* Fallthrough at the end of the hosts plugin configuration

If you want to pass the request to the rest of the plugin chain if there
is no match in the hosts plugin, you must specify the fallthrough option.

https://coredns.io/plugins/hosts/

Co-authored-by: Istvan Zoltan Ballok <[email protected]>

---------

Co-authored-by: Istvan Zoltan Ballok <[email protected]>
Co-authored-by: Christoph Kleineweber <[email protected]>
  • Loading branch information
3 people authored Jul 9, 2024
1 parent efba3ea commit df84098
Show file tree
Hide file tree
Showing 13 changed files with 499 additions and 2,307 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ data:
{{ .Values.controllers.service.zone0IP }} istio-ingressgateway.istio-ingress--0.svc.cluster.local
{{ .Values.controllers.service.zone1IP }} istio-ingressgateway.istio-ingress--1.svc.cluster.local
{{ .Values.controllers.service.zone2IP }} istio-ingressgateway.istio-ingress--2.svc.cluster.local
fallthrough
}
forward . /etc/resolv.conf {
max_concurrent 1000
Expand Down
64 changes: 64 additions & 0 deletions hack/export-dashboard.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/bin/sh
# SPDX-FileCopyrightText: SAP SE or an SAP affiliate company and Gardener contributors
#
# SPDX-License-Identifier: Apache-2.0


# When "renovating" Plutono dashboards, it is useful to export them
# and capture the changes in small git commits, otherwise it is hard
# to review the changes in the dashboard JSON files.
#
# To facilitate this workflow, this script exports a given dashboard
# from Plutono that is port-forwarded to localhost:3000.
#
# This script expects that changes to a dashboard that are made in the Plutono UI
# are saved in the Plutono database and can be exported via the Plutono API.
# To allow this in a _local development environment_, please do the following:
#
# 1. Ignore the Plutono managed resource:
#
# ```bash
# k annotate mr plutono resources.gardener.cloud/ignore=true
# ```
#
# 2. Edit the Plutono deployment and enable the login form:
#
# ```bash
# k edit deployment plutono
# ```
#
# ```yaml
# - name: PL_AUTH_DISABLE_LOGIN_FORM
# value: "false" # was "true
# ```
#
# 3.The admin password for Plutono can be accessed via:
#
# ```bash
# k get secret -l name=plutono-admin -o json | jq '.items[].data.password | @base64d' -r
# ```
#
# 4. Port-forward Plutono to localhost:3000 and login as admin:
#
# ```bash
# k port-forward deployment/plutono 3000
# ```
#
# 5. Make a copy of the provisioned dashboard you plan to edit, keep the
# suggested title with the " Copy" suffix, get its generated uid (Settings, JSON
# model or copy it from the URL). Make changes in the Plutono UI, save the dashboard,
# export it via this script and create meaningful git commits.
#
# ```bash
# hack/export-dashboard.sh QSl_u1wIz dns pkg/component/observability/plutono/dashboards/shoot/owners/worker/dns-dashboard.json
# ```

cd "$(dirname "$(realpath "$0")")/.." || exit 1

curl -s "http://localhost:3000/api/dashboards/uid/$1" \
| jq --arg original_uid "$2" '
.dashboard |
.uid |= $original_uid |
.title |= sub(" Copy$"; "") |
del(.id, .iteration, .version)
' > "$3"
1 change: 1 addition & 0 deletions imagevector/images.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,7 @@ images:
value:
- type: 'githubTeam'
teamname: 'gardener/gardener-core-networking-maintainers'
# TODO(vicwicker): A node-local-dns version that uses coredns 1.11 or higher requires adapting the node-local-dns Plutono dashboards.
- name: node-local-dns
sourceRepository: github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/nodelocaldns
repository: registry.k8s.io/dns/k8s-dns-node-cache
Expand Down
3 changes: 1 addition & 2 deletions pkg/component/networking/coredns/coredns.go
Original file line number Diff line number Diff line change
Expand Up @@ -139,8 +139,7 @@ func (c *coreDNS) Deploy(ctx context.Context) error {
"coredns_dns_request_duration_seconds_bucket",
"coredns_dns_requests_total",
"coredns_dns_responses_total",
"coredns_forward_requests_total",
"coredns_forward_responses_total",
"coredns_proxy_request_duration_seconds_count",
"coredns_kubernetes_dns_programming_duration_seconds_bucket",
"coredns_kubernetes_dns_programming_duration_seconds_count",
"coredns_kubernetes_dns_programming_duration_seconds_sum",
Expand Down
2 changes: 1 addition & 1 deletion pkg/component/networking/coredns/coredns_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -666,7 +666,7 @@ status: {}
MetricRelabelConfigs: []monitoringv1.RelabelConfig{{
SourceLabels: []monitoringv1.LabelName{"__name__"},
Action: "keep",
Regex: `^(coredns_build_info|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_dns_request_duration_seconds_count|coredns_dns_request_duration_seconds_bucket|coredns_dns_requests_total|coredns_dns_responses_total|coredns_forward_requests_total|coredns_forward_responses_total|coredns_kubernetes_dns_programming_duration_seconds_bucket|coredns_kubernetes_dns_programming_duration_seconds_count|coredns_kubernetes_dns_programming_duration_seconds_sum|process_max_fds|process_open_fds)$`,
Regex: `^(coredns_build_info|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_dns_request_duration_seconds_count|coredns_dns_request_duration_seconds_bucket|coredns_dns_requests_total|coredns_dns_responses_total|coredns_proxy_request_duration_seconds_count|coredns_kubernetes_dns_programming_duration_seconds_bucket|coredns_kubernetes_dns_programming_duration_seconds_count|coredns_kubernetes_dns_programming_duration_seconds_sum|process_max_fds|process_open_fds)$`,
}},
},
}
Expand Down
Loading

0 comments on commit df84098

Please sign in to comment.