-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure and expose DNS programming latency from Kubernetes plugin. #3171
Conversation
Thank you for your contribution. I've just checked the OWNERS files to find a suitable reviewer. This search was successful and I've asked pmoroney (via If you have questions or suggestions for this bot, please file an issue against the miekg/dreck repository. The bot understands the commands that are listed here. |
fe12bd6
to
0a71216
Compare
Codecov Report
@@ Coverage Diff @@
## master #3171 +/- ##
==========================================
+ Coverage 55.25% 56.29% +1.03%
==========================================
Files 217 218 +1
Lines 10777 10818 +41
==========================================
+ Hits 5955 6090 +135
+ Misses 4365 4252 -113
- Partials 457 476 +19
Continue to review full report at Codecov.
|
@johnbelamaric any chance you can have a look at this PR? It follows your suggestions on how measuring should be done without increasing memory footprint. |
/assign |
/assign: johnbelamaric |
now := time.Now() | ||
controller := newdnsController(client, dnsControlOpts{ | ||
initEndpointsCache: true, | ||
// This is needed as otherwise the fake k8s client doesn't work properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this more? If we can eliminate skipAPIObjectsCleanup
this all becomes a lot simpler. I don't remember why we need that as an option or see anywhere it gets set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed purely for testing purposes. If we do cleanup resources in test, cache doesn't work properly and tests fail. I am not at my workstation at the moment, so I can't check the exact effect.
One option is to remove tests. However this non-trivial change to the code and I'd rather have it tested.
lastChangeTriggerTime := getLastChangeTriggerTime(endpoints) | ||
|
||
if endpoints == nil || !isEndpointForHeadlessService(svcs, endpoints) || lastChangeTriggerTime.IsZero() { | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we're not recording the latency for programming clusterIP or headless_without_selector? For clusterIP, I would think it's the service creation time vs. now - since clusterIP is immutable I think that should work.
For headless_without_selector
, maybe the endpoints creation time...or update time if we have those in metadata. I recall you saying that we need the annotation though I can recall the exact reason offhand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this one does work only for headless services as for now. I was planning to follow up with the rest once this PR is hopefully merged. I assume adding those two missing types should be relatively easy.
plugin/kubernetes/metrics.go
Outdated
} | ||
val, err := time.Parse(time.RFC3339Nano, stringVal) | ||
if err != nil { | ||
log.Warningf("Error while parsing EndpointsLastChangeTriggerTimeAnnotation: '%s'. Error is %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this flood the logs?
If we're going to be wordy we should include enough info to debug and explain the actual effect we are warning about. The error value is not going to be useful I don't think, but knowing what it's supposed to look like will help.
log.Warningf("DnsProgrammingLatency cannot be calculated for Endpoints '%s/%s'; invalid %q annotation RFC3339 value of %q",
endpoints.GetNamespace(), endpoints.GetName(), api.EndpointsLastChangeTriggerTime, stringVal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the snippet!
I doubt. This annotations is machine-generated and it's not coming from user. I'd assume it should be formatted correctly unless we have a bug (should be easy to catch in tests) or a pathological case (user tampering with it).
af8eadc
to
71ffed4
Compare
7191299
to
9215418
Compare
@johnbelamaric PTAL :) |
I think we're OK here but I'd like to have @chrisohaver take a look as well. /assign: @chrisohaver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM by johnbelamaric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this misses an explanation of the new metrics in the README.md
Explanation added. |
For now metric is measure only for headless services. Informer has been slighlty refactored, so the code can measure latency without storing extra fields on Endpoint struct. Signed-off-by: Janek Łukaszewicz <[email protected]> Suggestions from code review Co-Authored-By: Chris O'Haver <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no worries. I'll also propose some small changes to have things more inline with other READMEs. But let's merge this first.
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM by miekg
Thank you all for the review! |
The DNS programming latency makes only sense for the CoreDNS pods and it is expected to show no data for the Node Local DNS. This only captures the headless services with selector. ClusterIP and headless services without selector are not captured. coredns/coredns#3171 Co-authored-by: Istvan Zoltan Ballok <[email protected]>
Sum by service_kind instead of using it for filtering. The filtering can still be done by selecting the series in the panel directly. This also allows removing the service_kind dashboard variable. Note that Core DNS exposes this programming duration metric only for the headless services with selector case (coredns/coredns#3171) so the service_kind filtering and variable is really not needed. Co-authored-by: Istvan Zoltan Ballok <[email protected]>
The DNS programming latency makes only sense for the CoreDNS pods and it is expected to show no data for the Node Local DNS. This only captures the headless services with selector. ClusterIP and headless services without selector are not captured. coredns/coredns#3171 Co-authored-by: Istvan Zoltan Ballok <[email protected]>
Sum by service_kind instead of using it for filtering. The filtering can still be done by selecting the series in the panel directly. This also allows removing the service_kind dashboard variable. Note that Core DNS exposes this programming duration metric only for the headless services with selector case (coredns/coredns#3171) so the service_kind filtering and variable is really not needed. Co-authored-by: Istvan Zoltan Ballok <[email protected]>
The DNS programming latency makes only sense for the CoreDNS pods and it is expected to show no data for the Node Local DNS. This only captures the headless services with selector. ClusterIP and headless services without selector are not captured. coredns/coredns#3171 Co-authored-by: Istvan Zoltan Ballok <[email protected]>
* Adapt scrape configuration to new CoreDNS metric CoreDNS deprecated the `coredns_forward_requests_total` and `coredns_forward_responses_total` metrics. As per documentation, both metrics are now replaced by the `coredns_proxy_request_duration_seconds_count` metric. * Adapt CoreDNS dashboard to new metrics As per documentation, the data comes now from the metric: `coredns_proxy_request_duration_seconds_count{proxy_name="forward"}`: CoreDNS has esentially taken advantage of the 1:1 relationship between requests and responses to create a single metric that measures both. Each sample in the new metric is a request and a response. The request destination and the response code are added as labels (`to` and `rcode`, respectively). The replacement for the old `coredns_forward_requests_total{to}` metric is described in the documentation as the `sum(coredns_proxy_..._{to, proxy_name="forward"})`. In other words, the number of requests with destination `to` is the sum of responses from `to`, regardless of their `rcode`. The dashboard queries that need adaptation already use the sum to aggregate by the `to` label. Since the sum is an associative operation, we can use this sum to count requests. As an example, imagine we have the following metrics with their corresponding labels at both times `t` and `t - offset`. If we take the first adapted query in the change as an example, it calculates the difference in the number of requests within a time interval: {rcode="NOERROR", to="1.2.3.4"} = 100 @ t {rcode="NXDOMAIN", to="1.2.3.4"} = 200 @ t {rcode="SERVFAIL", to="6.7.8.9"} = 50 @ t {rcode="NOERROR", to="1.2.3.4"} = 10 @ t - offset {rcode="NXDOMAIN", to="1.2.3.4"} = 10 @ t - offset {rcode="SERVFAIL", to="6.7.8.9"} = 3 @ t - offset The strategy followed by this commit calculates, first, the difference between samples with the same label values: {rcode="NOERROR", to="1.2.3.4"} = 100 - 10 = 90 {rcode="NXDOMAIN", to="1.2.3.4"} = 200 - 10 = 190 {rcode="SERVFAIL", to="6.7.8.9"} = 50 - 3 = 47 and, second, the sum by the `to` label: {to="1.2.3.4"} = 90 + 190 = 280 {to="6.7.8.9"} = 47 Alternatively, if we wanted to follow the documentation closely and do the sum first, then the sum by the `to` label at times `t` and `t - offset` would be the first step: {to="1.2.3.4"} = 100 + 200 = 300 @ t {to="6.7.8.9"} = 50 @ t {to="1.2.3.4"} = 10 + 10 = 20 @ t - offset {to="6.7.8.9"} = 3 @ t - offset And then difference calculation: {to="1.2.3.4"} = 300 - 20 = 280 {to="6.7.8.9"} = 50 - 3 = 47 Note the former approach is preferrable because it requires only one aggregation (the latter requires two) which keeps the final query simpler. * Revist CoreDNS wording on the NodeLocalDNS dashboard Fix wrong references to CoreDNS in the NodeLocalDNS dashboard to refer to NodeLocalDNS instead. Also, remove parenthesis in some dashboard titles (Per Interval) -> Per Interval to make it more homogeneous to the CoreDNS dashboard. * A helper script to export Plutono dashboards when renovating them Co-authored-by: Istvan Zoltan Ballok <[email protected]> Co-authored-by: Victor Herrero Otal <[email protected]> Co-authored-by: Christoph Kleineweber <[email protected]> * Export the current state of the CoreDNS dashboard Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove the "DNS Requests (Packets per Second)" panel This panel used the coredns_dns_request_duration_seconds metric and assumed that it has type, proto, zone labels but as the example shows below it does not have those labels: coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.00025"} 1037 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.0005"} 1106 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.001"} 1116 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.002"} 1119 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.004"} 1120 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.008"} 1124 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.016"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.032"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.064"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.128"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.256"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="0.512"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="1.024"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="2.048"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="4.096"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="8.192"} 1125 coredns_dns_request_duration_seconds_bucket{server="dns://:8053",view="",zone=".",le="+Inf"} 1125 coredns_dns_request_duration_seconds_sum{server="dns://:8053",view="",zone="."} 0.22205055899999987 coredns_dns_request_duration_seconds_count{server="dns://:8053",view="",zone="."} 1125 The next panel contains the same information so we decided to drop this panel without a replacement. To avoid changes in the JSON document, we kept the layout the same. The next panel fills the full row now. * Renovate the "DNS Requests" panel The type label is renamed by Prometheus to exported_type. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove the "DNS Lookup Responses" panel Using the offset is not idiomatic. For counter metrics, we should use the rate function. The other panel, which is now using the full row, is using the rate function so we do not need this panel. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "DNS Responses" panel Sum responses by rcode to avoid showing time series for individual CoreDNS pods. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "DNS Requests Latency" panel The "coredns_dns_request_duration_seconds_bucket" does not have a type label so we use the zone instead. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "DNS Requests Latency Heatmap" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "Cache Hits and Misses" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove the "Cache Hits and Misses" panel As described in previous commits, using the offset is not idiomatic on counter metrics. The previous panel shows the same information. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "Cache Size" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove the "Upstream DNS Requests Per Interval" panel using offset Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "Upstream DNS Requests" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove the "Upstream DNS Forward Responses Per Interval" panel using offset Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "Upstream DNS Responses" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "DNS Programming Events" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "DNS Programming Cores" panel Sum by service_kind instead of using it for filtering. The filtering can still be done by selecting the series in the panel directly. This also allows removing the service_kind dashboard variable. Note that Core DNS exposes this programming duration metric only for the headless services with selector case (coredns/coredns#3171) so the service_kind filtering and variable is really not needed. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate the "DNS Programming Heatmap" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Add TODO to adapt the NodeLocalDNS dashboards in the future The change can't be done yet because node-local-dns does not use core-dns 1.11 at the moment. https://github.com/kubernetes/dns/blob/1.23.1/go.mod#L7 * Fix "DNS Programming Events" panel In previous commits, we missed to delete the service_kind variable from the query. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove usage of the "service_kind" variable Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Introduce new "job" variable The CoreDNS and NodeLocalDNS dashboards are to be merged into one single dashboard. In order to achieve that, a new job variable is necessary to differentiate between the two. Also, the old pod variable needs to capture all values now. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Add job filter on all queries Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Adapt "Upstream DNS Requests" panel Node Local DNS is a thin wrapper of CoreDNS. However, it uses CoreDNS 1.10 for now, while the regular CoreDNS deployed in the shoot clusters is 1.11 at the moment. DNS upstream-related metrics from CoreDNS 1.10 were deprecated in 1.11 in favor of newer ones. In consequence, the DNS Upstream panel need to be adapted to consider both metrics. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Adapt "Upstream DNS Responses" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Adjust "CoreDNS DNS Programming Latency" row title The DNS programming latency makes only sense for the CoreDNS pods and it is expected to show no data for the Node Local DNS. This only captures the headless services with selector. ClusterIP and headless services without selector are not captured. coredns/coredns#3171 Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Introduce new "NodeLocalDNS Node Cache Error" row Node cache errors only make sense for the NodeLocalDNS pods. This commit introduces a row to separate the node cache error panels that come in the following commits from other CoreDNS panels. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Migrate "Node Cache Errors" panel Move this panel from the NodeLocalDNS dashboard to the new DNS common dashboard. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Remove Node Local DNS dashboard This dashboard has now been migrated to the CoreDNS dashboard. This commits also removes code references to the old NodeLocalDNS dashboard. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Rename CoreDNS dashboard to DNS Also rename some references to it. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Split "DNS Requests" panel The original panel contained three queries grouping DNS requests by zone, protocol and type. This commit splits this into three different panels for better visualization. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Change "DNS Requests Latency" panel to use the full row This aligns with the rest of the panels. This commit also improves this panel's legend. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Extend description for the "DNS Programming" panels Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Renovate "Node Cache Errors" panel Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Show both CoreDNS and Node Local DNS metrics by default Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Show the last 3h by default Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Set fill area to 1 to all panels Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Unify height for all panels Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Sort drop-down list options alphabetically Co-authored-by: Victor Herrero Otal <[email protected]> * Export the dashboard one last time This commit exports the new dashboard after applying all the changes one last time to align the dashboard JSON file with the final JSON generated by Plutono. Co-authored-by: Istvan Zoltan Ballok <[email protected]> * Fallthrough at the end of the hosts plugin configuration If you want to pass the request to the rest of the plugin chain if there is no match in the hosts plugin, you must specify the fallthrough option. https://coredns.io/plugins/hosts/ Co-authored-by: Istvan Zoltan Ballok <[email protected]> --------- Co-authored-by: Istvan Zoltan Ballok <[email protected]> Co-authored-by: Christoph Kleineweber <[email protected]>
For now metric is measure only for headless services. Informer has been slighlty
refactored, so the code can measure latency without storing extra fields on
Endpoint struct.
1. Why is this pull request needed and what does it do?
Measure and expose DNS programming latency from Kubernetes plugin.
2. Which issues (if any) are related?
ref kubernetes/perf-tests#617
3. Which documentation changes (if any) need to be made?
none
4. Does this introduce a backward incompatible change or deprecation?
no