Kubernetes: bring kube-dns implementation up-to-date #3373

jackfrancis · 2018-06-26T18:32:19Z

What this PR does / why we need it: Basing kube-dns implementation on the kubernetes-recommended base configs.

Primary change is to replace exechealthz with sidecar. Additionally added --no-negcache config to dnsmasq.

As a reference, here are the base configs.

For v1.11:

https://github.com/kubernetes/kubernetes/blob/v1.11.0/cluster/addons/dns/kube-dns/kube-dns.yaml.base

For v1.10:

https://github.com/kubernetes/kubernetes/blob/v1.10.0/cluster/addons/dns/kube-dns.yaml.base

For v1.9:

https://github.com/kubernetes/kubernetes/blob/v1.9.0/cluster/addons/dns/kube-dns.yaml.base

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #3534 fixes #2999

Special notes for your reviewer:

If applicable:

documentation
unit tests
tested backward compatibility (ie. deploy with previous version, upgrade with this branch)

Release note:

use exechealthz v1.3.0 in k8s 1.11

acs-bot · 2018-06-26T18:32:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jackfrancis · 2018-06-26T18:33:18Z

@feiskyer @andyzhangx FYI, testing new exechealthz on v1.11 clusters. Any reasons that you're aware of to backport this update for earlier cluster versions? (Or are there any reasons why not to move forward w/ new exechealthz version?)

codecov · 2018-06-26T19:17:25Z

Codecov Report

Merging #3373 into master will decrease coverage by 0.04%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master    #3373      +/-   ##
==========================================
- Coverage   55.49%   55.45%   -0.05%     
==========================================
  Files         105      105              
  Lines       16038    16041       +3     
==========================================
- Hits         8900     8895       -5     
- Misses       6386     6394       +8     
  Partials      752      752

feiskyer · 2018-06-26T23:59:20Z

@jackfrancis kube-dns has moved to sidecar container for health checking since kubernetes/kubernetes#38992 (v.1.6.0). I suggest we also use it (e.g. k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10) for our cluster.

jackfrancis · 2018-07-24T23:02:30Z

Reference for v1.11 kube-dns changes:

https://github.com/kubernetes/kubernetes/blob/v1.11.0/cluster/addons/dns/kube-dns/kube-dns.yaml.base

jackfrancis · 2018-07-24T23:27:29Z

@feiskyer thanks for the nudge here! See my changes in the file parts/k8s/addons/kubernetesmasteraddons-kube-dns-deployment.yaml. They aren't working. Anything obvious jump out at you?

feiskyer · 2018-07-25T03:07:53Z

parts/k8s/addons/kubernetesmasteraddons-kube-dns-deployment.yaml

@@ -146,6 +146,8 @@ spec:
        - mountPath: /kube-dns-config
          name: kube-dns-config
      - args:


healthz container should be removed, as sidecar container has been added below.

Also deployed with https://github.com/kubernetes/kubernetes/blob/v1.11.0/cluster/addons/dns/kube-dns/kube-dns.yaml.base, it works well with acs-engine deployed cluster (v1.11.0).

jackfrancis · 2018-07-25T18:29:59Z

@feiskyer Thanks for the continued guidance! I've converted the 1.11 kube-dns implementation here to follow the base kubernetes example. Initial tests suggest that HPA is failing, but everything else in our test surface area checks out.

Does anything look suspicious here that would break HPA?

https://github.com/jackfrancis/acs-engine/blob/1bcf1af22bc7cd446c95766fec93b6f05930d921/parts/k8s/addons/kubernetesmasteraddons-kube-dns-deployment.yaml

feiskyer · 2018-07-26T02:05:45Z

@jackfrancis Did you mean HPA for other pods or dns-horizontal-autoscaler for kube-dns? The change doesn't seem related with HPA for other pods.

jackfrancis · 2018-07-26T03:16:20Z

@FeiSker just regular HPA. We test deploying nginx, attaching HPA config to it, and then hitting with load.

feiskyer · 2018-07-26T03:18:51Z

Have you checked HPA events? It should include some hints, e.g.

kubectl describe hpa <hpa-name>

jackfrancis · 2018-07-26T19:46:13Z

@feiskyer so, a follow-up run had different side-effects.

I observed kube-dns pod go from CrashLoopBackOff to Error.

See:

$ kubectl logs kube-dns-7f9df74d5b-6g7g2 -n kube-system -c dnsmasq
I0726 19:42:19.211202       1 main.go:74] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --no-negcache --log-facility=- --server=/in-addr.arpa/127.0.0.1#10053 --server=/ip6.arpa/127.0.0.1#10053] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I0726 19:42:19.211373       1 nanny.go:94] Starting dnsmasq [-k --cache-size=1000 --no-negcache --log-facility=- --server=/in-addr.arpa/127.0.0.1#10053 --server=/ip6.arpa/127.0.0.1#10053]
I0726 19:42:19.547769       1 nanny.go:119] 
W0726 19:42:19.547801       1 nanny.go:120] Got EOF from stdout
I0726 19:42:19.547815       1 nanny.go:116] dnsmasq[9]: started, version 2.78 cachesize 1000
I0726 19:42:19.547826       1 nanny.go:116] dnsmasq[9]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0726 19:42:19.547831       1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain ip6.arpa 
I0726 19:42:19.547835       1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0726 19:42:19.547945       1 nanny.go:116] dnsmasq[9]: reading /etc/resolv.conf
I0726 19:42:19.547954       1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain ip6.arpa 
I0726 19:42:19.547958       1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0726 19:42:19.547961       1 nanny.go:116] dnsmasq[9]: using nameserver 168.63.129.16#53
I0726 19:42:19.548104       1 nanny.go:116] dnsmasq[9]: read /etc/hosts - 7 addresses

and:

$ kubectl logs kube-dns-7f9df74d5b-6g7g2 -n kube-system -c sidecar
I0726 19:40:02.174431       1 main.go:51] Version v1.14.8.3
I0726 19:40:02.174499       1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:33} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:33}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
I0726 19:40:02.174527       1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:33}
I0726 19:40:02.174577       1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:33}
W0726 19:40:02.174835       1 server.go:64] Error getting metrics from dnsmasq: read udp 127.0.0.1:56325->127.0.0.1:53: read: connection refused

feiskyer · 2018-07-27T04:39:01Z

parts/k8s/addons/kubernetesmasteraddons-kube-dns-deployment.yaml

+        - -k
+        - --cache-size=1000
+        - --no-negcache
+        - --log-facility=-


missing --server=/cluster.local/127.0.0.1#10053 here?

feiskyer · 2018-07-27T04:41:57Z

It works fine on my cluster after adding --server=/cluster.local/127.0.0.1#10053.

jackfrancis · 2018-07-27T16:57:32Z

That worked @feiskyer, thank you! The changes here are only for 1.11. Do you recommend doing a similar kube-dns conversion for any other k8s versions?

jackfrancis · 2018-07-27T22:22:07Z

@feiskyer for your review. I audited the k8s codebase and determined that the sidecar implementation has been in place since 1.7 (at least, I didn't check earlier).

To be conservative, and with the prediction that there are more clusters on 1.8 and below in the wild (1.8 is the default when none is provided), I have converted 1.9 and above to adhere to an implementation that looks like the upstream base example.

Thoughts? Any reservations about these changes? Thanks again for your eyes!

and clean up tests

feiskyer · 2018-07-28T12:14:52Z

I have converted 1.9 and above to adhere to an implementation that looks like the upstream base example.

I'm also suggesting this, so we're consistent with upstream.

acs-bot added the do-not-merge/work-in-progress label Jun 26, 2018

acs-bot added the approved label Jun 26, 2018

ghost assigned jackfrancis Jun 26, 2018

ghost added the in progress label Jun 26, 2018

acs-bot added the size/XS label Jun 26, 2018

jackfrancis force-pushed the exechealthz-v1.3.0 branch from f50d8ca to 02be0bd Compare July 24, 2018 22:31

acs-bot added size/XL and removed size/XS labels Jul 24, 2018

feiskyer reviewed Jul 25, 2018

View reviewed changes

jackfrancis changed the title ~~[WIP] use new exechealthz w/ k8s 1.11~~ kube-dns updates for 1.11 Jul 25, 2018

acs-bot removed the do-not-merge/work-in-progress label Jul 25, 2018

feiskyer reviewed Jul 27, 2018

View reviewed changes

jackfrancis mentioned this pull request Jul 27, 2018

kube-dns-v20 deployment #3534

Closed

jackfrancis added 4 commits July 27, 2018 09:58

use new exechealthz w/ k8s 1.11

7c5a11d

add sidecar livenessProbe to kube-dns

0a75b13

derived from kube-dns/kube-dns.yaml.base

671d153

--server=/cluster.local/127.0.0.1#10053 to dnsmasq

76b0825

jackfrancis added 3 commits July 27, 2018 10:04

add probe check for external name

47f9791

incorporate external dns name probe checks

8450aa2

deprecate exechealthz for k8s 1.11

342a7ee

jackfrancis force-pushed the exechealthz-v1.3.0 branch from 44693f6 to 342a7ee Compare July 27, 2018 17:09

jackfrancis added 3 commits July 27, 2018 11:57

--no-negcache + cluster.local to dnmasq for 1.9|10

dbe92a3

this probe may be pathological

949404c

backporting 1.9 and 1.10 to be k8s idiomatic

78893bb

acs-bot added size/XXL and removed size/XL labels Jul 27, 2018

remove exechealthz references from 1.9 and 1.10

8f0c308

and clean up tests

jackfrancis changed the title ~~kube-dns updates for 1.11~~ Kubernetes: bring kube-dns implementation up-to-date Jul 27, 2018

jackfrancis added 2 commits July 27, 2018 16:52

configurable k8s-dns-sidecar

8218cd6

whoops 1 too many k8s.gcr.io

06a372f

jackfrancis merged commit 4bee179 into Azure:master Jul 28, 2018

ghost removed the in progress label Jul 28, 2018

jackfrancis deleted the exechealthz-v1.3.0 branch July 28, 2018 15:06

jackfrancis mentioned this pull request Aug 1, 2018

restore exechealthz defs for back compat #3595

Merged

3 tasks

jackfrancis added a commit that referenced this pull request Aug 1, 2018

Kubernetes: bring kube-dns implementation up-to-date (#3373)

e2e7ad8

kovszilard mentioned this pull request Aug 2, 2018

kube-dns sidecar container doesn't expose metrics #3479

Closed

jungopro mentioned this pull request Nov 5, 2018

Targets are down when deploying on AKS prometheus-operator/prometheus-operator#1522

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes: bring kube-dns implementation up-to-date #3373

Kubernetes: bring kube-dns implementation up-to-date #3373

jackfrancis commented Jun 26, 2018 •

edited

Loading

acs-bot commented Jun 26, 2018

jackfrancis commented Jun 26, 2018

codecov bot commented Jun 26, 2018 •

edited

Loading

feiskyer commented Jun 26, 2018

jackfrancis commented Jul 24, 2018

jackfrancis commented Jul 24, 2018

feiskyer Jul 25, 2018

feiskyer Jul 25, 2018

jackfrancis commented Jul 25, 2018

feiskyer commented Jul 26, 2018

jackfrancis commented Jul 26, 2018

feiskyer commented Jul 26, 2018 •

edited

Loading

jackfrancis commented Jul 26, 2018

feiskyer Jul 27, 2018

feiskyer commented Jul 27, 2018

jackfrancis commented Jul 27, 2018

jackfrancis commented Jul 27, 2018

feiskyer commented Jul 28, 2018

Kubernetes: bring kube-dns implementation up-to-date #3373

Kubernetes: bring kube-dns implementation up-to-date #3373

Conversation

jackfrancis commented Jun 26, 2018 • edited Loading

acs-bot commented Jun 26, 2018

jackfrancis commented Jun 26, 2018

codecov bot commented Jun 26, 2018 • edited Loading

Codecov Report

feiskyer commented Jun 26, 2018

jackfrancis commented Jul 24, 2018

jackfrancis commented Jul 24, 2018

feiskyer Jul 25, 2018

Choose a reason for hiding this comment

feiskyer Jul 25, 2018

Choose a reason for hiding this comment

jackfrancis commented Jul 25, 2018

feiskyer commented Jul 26, 2018

jackfrancis commented Jul 26, 2018

feiskyer commented Jul 26, 2018 • edited Loading

jackfrancis commented Jul 26, 2018

feiskyer Jul 27, 2018

Choose a reason for hiding this comment

feiskyer commented Jul 27, 2018

jackfrancis commented Jul 27, 2018

jackfrancis commented Jul 27, 2018

feiskyer commented Jul 28, 2018

jackfrancis commented Jun 26, 2018 •

edited

Loading

codecov bot commented Jun 26, 2018 •

edited

Loading

feiskyer commented Jul 26, 2018 •

edited

Loading