DNS issues with calico and kops 1.6 #2661

lc4ntoni-centralway · 2017-05-31T08:47:12Z

Hello,
I'm experiencing some DNS issue with kops and calico with kops 1.6 and kubernetes 1.6.2 (AWS cloud)

Steps to reproduce:

1- Create a cluster from scratch. I've used this command:
kops create cluster --zones eu-central-1b --node-size t2.medium --master-size t2.medium --vpc vpc-38094c51 --networking calico $NAME

2- After the cluster is created, kube-dns will fail because the configure-calico jobs fails as well:

kube-system   kube-dns-1321724180-gdfjs                                              0/3       ContainerCreating   0          16m
kube-system   kube-dns-autoscaler-265231812-cc19z                                    0/1       ContainerCreating   0          16m

kubectl get jobs --all-namespaces
NAMESPACE     NAME               DESIRED   SUCCESSFUL   AGE
kube-system   configure-calico   1         0            17m

3 - Executing kops rolling-update cluster --force --yes eventually solves this problem

kube-system   kube-dns-1321724180-nnjwf                                             3/3       Running   0          4m
kube-system   kube-dns-1321724180-s2p7j                                             3/3       Running   0          4m

kubectl get jobs --all-namespaces
NAMESPACE     NAME               DESIRED   SUCCESSFUL   AGE
kube-system   configure-calico   1         1            42m

4- At this point, creating a new pod causes temporary dns resolution failures, that eventually get solved by themselves after some time the pod is created (infact the kube-dns pods are running fine, as well as calico ones):

root@test-803530966-gvhc6:/# apt update
Err:1 http://security.ubuntu.com/ubuntu xenial-security InRelease        
  Temporary failure resolving 'security.ubuntu.com'
Err:2 http://archive.ubuntu.com/ubuntu xenial InRelease                  
  Temporary failure resolving 'archive.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu xenial-updates InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://archive.ubuntu.com/ubuntu xenial-backports InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Reading package lists... Done        
Building dependency tree       
Reading state information... Done
All packages are up to date.
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-updates/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-backports/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/xenial-security/InRelease  Temporary failure resolving 'security.ubuntu.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
root@test-803530966-gvhc6:/# apt update
Err:1 http://archive.ubuntu.com/ubuntu xenial InRelease                  
  Temporary failure resolving 'archive.ubuntu.com'
Err:2 http://security.ubuntu.com/ubuntu xenial-security InRelease        
  Temporary failure resolving 'security.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu xenial-updates InRelease          
  Temporary failure resolving 'archive.ubuntu.com'
Get:4 http://archive.ubuntu.com/ubuntu xenial-backports InRelease [102 kB]
Get:5 http://archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages [4927 B]
Get:6 http://archive.ubuntu.com/ubuntu xenial-backports/universe amd64 Packages [6022 B]
Fetched 113 kB in 40s (2814 B/s)   
Reading package lists... Done
Building dependency tree       
Reading state information... Done
All packages are up to date.
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-updates/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/xenial-security/InRelease  Temporary failure resolving 'security.ubuntu.com'

While this happens there's nothing in the kube-dns logs indicating that something went wrong:

kubectl logs kube-dns-1321724180-nnjwf  --namespace=kube-system kubedns
I0531 08:17:31.000131       1 dns.go:49] version: v1.5.2-beta.0+$Format:%h$
I0531 08:17:31.000712       1 server.go:70] Using configuration read from directory: /kube-dns-config%!(EXTRA time.Duration=10s)
I0531 08:17:31.000764       1 server.go:112] FLAG: --alsologtostderr="false"
I0531 08:17:31.000777       1 server.go:112] FLAG: --config-dir="/kube-dns-config"
I0531 08:17:31.000783       1 server.go:112] FLAG: --config-map=""
I0531 08:17:31.000786       1 server.go:112] FLAG: --config-map-namespace="kube-system"
I0531 08:17:31.000790       1 server.go:112] FLAG: --config-period="10s"
I0531 08:17:31.000794       1 server.go:112] FLAG: --dns-bind-address="0.0.0.0"
I0531 08:17:31.000796       1 server.go:112] FLAG: --dns-port="10053"
I0531 08:17:31.000802       1 server.go:112] FLAG: --domain="cluster.local."
I0531 08:17:31.000815       1 server.go:112] FLAG: --federations=""
I0531 08:17:31.000819       1 server.go:112] FLAG: --healthz-port="8081"
I0531 08:17:31.000846       1 server.go:112] FLAG: --initial-sync-timeout="1m0s"
I0531 08:17:31.000850       1 server.go:112] FLAG: --kube-master-url=""
I0531 08:17:31.000856       1 server.go:112] FLAG: --kubecfg-file=""
I0531 08:17:31.000858       1 server.go:112] FLAG: --log-backtrace-at=":0"
I0531 08:17:31.002575       1 server.go:112] FLAG: --log-dir=""
I0531 08:17:31.002595       1 server.go:112] FLAG: --log-flush-frequency="5s"
I0531 08:17:31.002610       1 server.go:112] FLAG: --logtostderr="true"
I0531 08:17:31.002625       1 server.go:112] FLAG: --nameservers=""
I0531 08:17:31.002638       1 server.go:112] FLAG: --stderrthreshold="2"
I0531 08:17:31.002650       1 server.go:112] FLAG: --v="2"
I0531 08:17:31.002683       1 server.go:112] FLAG: --version="false"
I0531 08:17:31.002702       1 server.go:112] FLAG: --vmodule=""
I0531 08:17:31.002766       1 server.go:175] Starting SkyDNS server (0.0.0.0:10053)
I0531 08:17:31.003038       1 server.go:197] Skydns metrics enabled (/metrics:10055)
I0531 08:17:31.003059       1 dns.go:147] Starting endpointsController
I0531 08:17:31.003071       1 dns.go:150] Starting serviceController
I0531 08:17:31.003593       1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0531 08:17:31.003611       1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0531 08:17:31.030806       1 dns.go:264] New service: kubernetes
I0531 08:17:31.030909       1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:17:31.030948       1 dns.go:264] New service: kube-dns
I0531 08:17:31.030974       1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:17:31.031105       1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:17:31.503315       1 dns.go:171] Initialized services and endpoints from apiserver
I0531 08:17:31.503337       1 server.go:128] Setting up Healthz Handler (/readiness)
I0531 08:17:31.503343       1 server.go:133] Setting up cache handler (/cache)
I0531 08:17:31.503348       1 server.go:119] Status HTTP port 8081
I0531 08:22:31.004225       1 dns.go:264] New service: kubernetes
I0531 08:22:31.004457       1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:22:31.004583       1 dns.go:264] New service: kube-dns
I0531 08:22:31.004634       1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:22:31.004673       1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:27:31.004378       1 dns.go:264] New service: kubernetes
I0531 08:27:31.004467       1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:27:31.004502       1 dns.go:264] New service: kube-dns
I0531 08:27:31.004528       1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0531 08:27:31.004549       1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}

kubectl logs kube-dns-1321724180-nnjwf  --namespace=kube-system dnsmasq
I0531 08:17:36.426853       1 main.go:76] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I0531 08:17:36.426935       1 nanny.go:86] Starting dnsmasq [-k --cache-size=1000 --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053]
I0531 08:17:36.512105       1 nanny.go:111] 
W0531 08:17:36.512226       1 nanny.go:112] Got EOF from stdout
I0531 08:17:36.512197       1 nanny.go:108] dnsmasq[11]: started, version 2.76 cachesize 1000
I0531 08:17:36.512253       1 nanny.go:108] dnsmasq[11]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0531 08:17:36.512264       1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in6.arpa 
I0531 08:17:36.512268       1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0531 08:17:36.512271       1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain cluster.local 
I0531 08:17:36.512277       1 nanny.go:108] dnsmasq[11]: reading /etc/resolv.conf
I0531 08:17:36.512281       1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in6.arpa 
I0531 08:17:36.512284       1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0531 08:17:36.512289       1 nanny.go:108] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain cluster.local 
I0531 08:17:36.512292       1 nanny.go:108] dnsmasq[11]: using nameserver 10.0.0.2#53
I0531 08:17:36.512296       1 nanny.go:108] dnsmasq[11]: read /etc/hosts - 7 addresses

kubectl logs kube-dns-1321724180-nnjwf  --namespace=kube-system sidecar
ERROR: logging before flag.Parse: I0531 08:17:39.156690       1 main.go:49] Version v1.14.0-2-g38b2ca3
ERROR: logging before flag.Parse: I0531 08:17:39.156729       1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
ERROR: logging before flag.Parse: I0531 08:17:39.156811       1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
ERROR: logging before flag.Parse: I0531 08:17:39.156881       1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}

The text was updated successfully, but these errors were encountered:

chrislovecnm · 2017-05-31T15:13:53Z

@ottoyiu any ideas?

ottoyiu · 2017-05-31T16:05:33Z

@lc4ntoni-centralway This seems like a service account race condition that was fixed by #2590 and #2641 that's not part of a kops release yet. Just to confirm, are you using the tagged release of Kops 1.6?

lc4ntoni-centralway · 2017-06-01T08:55:33Z

@ottoyiu thanks for replying. I'm using the kops version shipped in brew. sha1:

5a66319cf9f943015f10ae24764e899cbe5d5445  /usr/local/Cellar/kops/1.6.0/bin/kops

ottoyiu · 2017-06-05T21:55:20Z

Not ideal, but try building kops from master, and see if that solves your problem. Pretty sure it's no longer a problem.

chrislovecnm · 2017-06-09T17:28:49Z

@justinsb is working on a 1.6.1 release which will include the fixes.

lc4ntoni-centralway · 2017-06-12T16:00:38Z

Alright, I understand that some fix need to be included within the next release. I'll close then... thanks a lot for your support.

Cisneiros · 2017-08-01T19:59:46Z

Hi guys! I'm running kops 1.6.2 (git-469d82d0b) and created a Kubernetes 1.6.6 cluster with calico and I'm facing the same issues. Has this been fixed yet?

Cisneiros · 2017-08-02T17:34:15Z

Just to add more context: I'm running CentOS 7. I also noticed it seems the issue occurs on recently started nodes (e.g. after cluster-autscaler scaling up).
@lc4ntoni-centralway was your problem solved?

mikesplain · 2017-08-02T18:10:05Z

We're see this too but haven't been able to pin it down. Saw it most recently on a cluster created with kops v1.7.0 and k8s v1.6.2.

ottoyiu · 2017-08-02T18:42:03Z

was it DNS that wasn't working, or were Service IPs also affected? Reason I ask is because DNS is resolved through a service IP of kube-dns'.

Cisneiros · 2017-08-03T14:39:53Z

@ottoyiu Ok, so I tested for that. Apparently Service IPs are not working. I wrote a simple container that tries to wget my-nginx-service-ip over and over until it responds. It failed for a few requests (timeout), and about 2-3 min after starting it succeeded.

This pod was ran on a recently started node. Do these diagnostics give you any ideias?

mikesplain · 2017-08-07T13:37:46Z

@chrislovecnm can we reopen this since a number of us are still seeing this? I'm seeing the same behavior as @Cisneiros.

0x234 · 2017-08-15T13:54:34Z

+1 for this on CentOS 7 with k8s 1.6.7. Services are continually re-registered every 5 minutes like @lc4ntoni-centralway reports. Other containers in the kube-dns pod report no errors.

I0815 13:40:58.212305       1 dns.go:264] New service: kube-dns
I0815 13:40:58.212361       1 dns.go:264] New service: service-alertmanager
I0815 13:40:58.212401       1 dns.go:264] New service: kubernetes-dashboard
I0815 13:40:58.212421       1 dns.go:264] New service: kubernetes
I0815 13:40:58.212471       1 dns.go:264] New service: service-http-backend
I0815 13:40:58.212491       1 dns.go:264] New service: nginx-service
I0815 13:40:58.212511       1 dns.go:264] New service: portal-service
I0815 13:40:58.212532       1 dns.go:264] New service: service-prometheus
I0815 13:45:58.212304       1 dns.go:264] New service: nginx-service
I0815 13:45:58.212396       1 dns.go:264] New service: portal-service
I0815 13:45:58.212421       1 dns.go:264] New service: service-prometheus
I0815 13:45:58.212478       1 dns.go:264] New service: kubernetes
I0815 13:45:58.212537       1 dns.go:264] New service: service-http-backend
I0815 13:45:58.212560       1 dns.go:264] New service: service-alertmanager
I0815 13:45:58.212603       1 dns.go:264] New service: kubernetes-dashboard
I0815 13:45:58.212623       1 dns.go:264] New service: service-grafana
I0815 13:45:58.212662       1 dns.go:264] New service: kube-dns
I0815 13:50:58.212437       1 dns.go:264] New service: kubernetes-dashboard
I0815 13:50:58.212553       1 dns.go:264] New service: service-grafana
I0815 13:50:58.212602       1 dns.go:264] New service: kube-dns
I0815 13:50:58.212657       1 dns.go:264] New service: service-alertmanager
I0815 13:50:58.212697       1 dns.go:264] New service: service-prometheus
I0815 13:50:58.212740       1 dns.go:264] New service: kubernetes
I0815 13:50:58.212778       1 dns.go:264] New service: service-http-backend
I0815 13:50:58.212799       1 dns.go:264] New service: nginx-service
I0815 13:50:58.212818       1 dns.go:264] New service: portal-service

During the periods of adding a new service, DNS fails in the cluster and services cannot communicate. Networking CNI is via Calico.

0x234 · 2017-08-15T15:57:09Z

Scrap that, looks like intended behaviour: https://github.com/kubernetes/dns/blob/master/pkg/dns/dns.go#L52

ottoyiu · 2017-08-15T18:16:25Z

@Cisneiros @mikesplain I'm also seeing this behaviour on new nodes. Are you running calico using cross-subnet mode or just straight-up ipip?

mikesplain · 2017-08-15T18:17:29Z

@ottoyiu We're running cross subnet.

ottoyiu · 2017-08-15T18:19:32Z

@mikesplain and do you observe it on only new nodes that come up? That's what I'm seeing right now.

mikesplain · 2017-08-15T18:24:34Z

@ottoyiu Yes exactly. I haven't tried rebooting a node but it's usually about 3-5 min before things are okay. I setup a daemonset to test (below). The apt-get update almost always has issues for ~5 min before finally working.

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: dns-test
spec:
  template:
    spec:
      containers:
      - name: dns-test
        image: ubuntu:16.04
        command:
        - sh
        - -c
        - apt-get update && apt-get install -y curl && sleep 3600

ottoyiu · 2017-08-15T18:27:17Z

If it's only solely happening in cross-subnet mode, then it could be AWS taking it's time after the API call to actually disable src/dst checks on the node... or it could be k8s-ec2-srcst controller that I wrote which is taking time to respond to nodes coming up and calling the AWS API. 👎 In those both cases, we'll need to mark the node as NotReady or disable scheduling until it is actually ready.

If it's happening in ipip mode as well, then well... something funky is happening with calico.

mikesplain · 2017-08-15T18:29:45Z

@ottoyiu Ahh yeah good idea, any ideas on how to debug this further? Let me know if there's anything I can do to help. All our clusters are cross-subnet but I can try to test it in ipip.

ottoyiu · 2017-08-15T18:31:44Z

@mikesplain If you can test spinning up a cluster using ipip, that'll be great! I'll also do the same to try and isolate the issue.

mikesplain · 2017-08-15T19:25:31Z

@ottoyiu Ahh so I just setup a ipip cluster (Kops v1.7.0 & k8s v1.6.2). I wasn't able to replicate the issue.

Then I wanted to do one more test, upgrade the cluster and see what happens. After updating to k8s 1.6.8 I see the issue on a ipip cluster.

ottoyiu · 2017-08-15T21:07:10Z

@mikesplain I think we might be seeing this while routes are still being advertised through iBGP using BIRD. Going to check the log timestamps from when the node is marked READY, and when other nodes see the routes for it.

ottoyiu · 2017-08-15T21:27:12Z

Similarly to what I expected, the node is marked as 'READY' when the node's BIRD BGP daemon inside the calico/node has not learnt the routes from its neighbours yet. It is not until 4-5 minutes in, does it establish connection with its peers and learn their routes.

The route table when node is marked as 'READY':

admin@ip-10-67-79-106:~$ ip route
default via 10.67.64.1 dev eth0
10.67.64.0/19 dev eth0  proto kernel  scope link  src 10.67.79.106
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1

It is not until 4-5 minutes in, does it learn routes:

admin@ip-10-67-79-106:~$ ip route                                                                                                                                                                                                                                                                                                                                          │
default via 10.67.64.1 dev eth0                                                                                                                                                                                                                                                                                                                                            │
10.67.64.0/19 dev eth0  proto kernel  scope link  src 10.67.79.106                                                                                                                                                                                                                                                                                                         │
100.100.189.0/26 via 10.67.78.252 dev eth0  proto bird                                                                                                                                                                                                                                                                                                                     │
100.117.226.128/26 via 10.67.68.58 dev eth0  proto bird                                                                                                                                                                                                                                                                                                                    │
100.119.202.128/26 via 10.67.82.42 dev eth0  proto bird                                                                                                                                                                                                                                                                                                                    │
blackhole 100.123.242.64/26  proto bird                                                                                                                                                                                                                                                                                                                                    │
100.125.28.128/26 via 10.67.84.204 dev eth0  proto bird                                                                                                                                                                                                                                                                                                                    │
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1

This timing also lines up quite well with the reports of not being able to resolve hostnames using DNS or connect to anything using a service IP.

@caseydavenport have you seen this before, and if so, do you know of a workaround?

caseydavenport · 2017-08-15T21:42:48Z

@ottoyiu route convergence via BGP should be done in seconds - even at thousand node scale - not minutes, so that seems unlikely.

If there's a timing delay in programming those routes it's more likely because either:

BIRD hasn't started yet on some nodes
No Pods have started on those nodes, and so BIRD won't advertise the routes until one starts.

The only other thing I can think of is that if you've told Calico to peer with a Node that doesn't exist, it will take 90s before the routes will be programmed due to a graceful-restart timer.

I'd check for configured nodes using calicoctl get nodes and see if there are any that shouldn't be there.

You can also run calicoctl node status to get the BGP state for a given node to see if it has converged.

ottoyiu · 2017-08-15T21:52:52Z

@caseydavenport thank you very much for your guidance :) much appreciated!
It does seem like its hitting 90s graceful-restart timer before the routes are programmed!

/ # bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg                                                                                                                                                                                                                                                                                            │
bird: device1: Initializing                                                                                                                                                                                                                                                                                                                                                │bird: direct1: Initializing                                                                                                                                                                                                                                                                                                                                                │bird: Mesh_10_67_64_205: Initializing                                                                                                                                                                                                                                                                                                                                      │bird: Mesh_10_67_68_58: Initializing                                                                                                                                                                                                                                                                                                                                       │bird: Mesh_10_67_76_48: Initializing                                                                                                                                                                                                                                                                                                                                       │bird: Mesh_10_67_78_229: Initializing                                                                                                                                                                                                                                                                                                                                      │bird: Mesh_10_67_78_252: Initializing                                                                                                                                                                                                                                                                                                                                      │bird: Mesh_10_67_79_106: Initializing                                                                                                                                                                                                                                                                                                                                      │bird: Mesh_10_67_82_133: Initializing                                                                                                                                                                                                                                                                                                                                      │bird: Mesh_10_67_82_42: Initializing                                                                                                                                                                                                                                                                                                                                       │bird: Mesh_10_67_84_204: Initializing                                                                                                                                                                                                                                                                                                                                      │bird: Mesh_10_67_88_27: Initializing                                                                                                                                                                                                                                                                                                                                       │
bird: Mesh_10_67_91_228: Initializing                                                                                                                                                                                                                                                                                                                                      │
bird: device1: Starting                                                                                                                                                                                                                                                                                                                                                    │
bird: device1: Connected to table master                                                                                                                                                                                                                                                                                                                                   │
bird: device1: State changed to feed                                                                                                                                                                                                                                                                                                                                       │
bird: direct1: Starting                                                                                                                                                                                                                                                                                                                                                    │
bird: direct1: Connected to table master                                                                                                                                                                                                                                                                                                                                   │
bird: direct1: State changed to feed                                                                                                                                                                                                                                                                                                                                       │
bird: Mesh_10_67_64_205: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_64_205: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Mesh_10_67_68_58: Starting                                                                                                                                                                                                                                                                                                                                           │
bird: Mesh_10_67_68_58: State changed to start                                                                                                                                                                                                                                                                                                                             │
bird: Mesh_10_67_76_48: Starting                                                                                                                                                                                                                                                                                                                                           │
bird: Mesh_10_67_76_48: State changed to start                                                                                                                                                                                                                                                                                                                             │
bird: Mesh_10_67_78_229: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_78_229: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Mesh_10_67_78_252: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_78_252: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Mesh_10_67_79_106: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_79_106: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Mesh_10_67_82_133: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_82_133: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Mesh_10_67_82_42: Starting                                                                                                                                                                                                                                                                                                                                           │
bird: Mesh_10_67_82_42: State changed to start                                                                                                                                                                                                                                                                                                                             │
bird: Mesh_10_67_84_204: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_84_204: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Mesh_10_67_88_27: Starting                                                                                                                                                                                                                                                                                                                                           │
bird: Mesh_10_67_88_27: State changed to start                                                                                                                                                                                                                                                                                                                             │
bird: Mesh_10_67_91_228: Starting                                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_91_228: State changed to start                                                                                                                                                                                                                                                                                                                            │
bird: Graceful restart started                                                                                                                                                                                                                                                                                                                                             │
bird: Started                                                                                                                                                                                                                                                                                                                                                              │
bird: device1: State changed to up                                                                                                                                                                                                                                                                                                                                         │
bird: direct1: State changed to up                                                                                                                                                                                                                                                                                                                                         │
bird: Mesh_10_67_82_42: Connected to table master                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_82_42: State changed to wait                                                                                                                                                                                                                                                                                                                              │
bird: Mesh_10_67_79_106: Connected to table master                                                                                                                                                                                                                                                                                                                         │
bird: Mesh_10_67_79_106: State changed to wait                                                                                                                                                                                                                                                                                                                             │bird: Mesh_10_67_68_58: Connected to table master                                                                                                                                                                                                                                                                                                                          │
bird: Mesh_10_67_68_58: State changed to wait                                                                                                                                                                                                                                                                                                                              │bird: Mesh_10_67_78_252: Connected to table master                                                                                                                                                                                                                                                                                                                         │
bird: Mesh_10_67_78_252: State changed to wait                                                                                                                                                                                                                                                                                                                             │
                                                                                                                                                                                                                                                                                                                                                                           │
                                                                                                                                                                                                                                                                                                                                                                           │
                                                                                                                                                                                                                                                                                                                                                                           │
bird: Graceful restart done                                                                                                                                                                                                                                                                                                                                                │
bird: Mesh_10_67_68_58: State changed to feed                                                                                                                                                                                                                                                                                                                              │
bird: Mesh_10_67_78_252: State changed to feed                                                                                                                                                                                                                                                                                                                             │
bird: Mesh_10_67_79_106: State changed to feed                                                                                                                                                                                                                                                                                                                             │
bird: Mesh_10_67_82_42: State changed to feed                                                                                                                                                                                                                                                                                                                              │
bird: Mesh_10_67_68_58: State changed to up                                                                                                                                                                                                                                                                                                                                │
bird: Mesh_10_67_78_252: State changed to up                                                                                                                                                                                                                                                                                                                               │
bird: Mesh_10_67_79_106: State changed to up                                                                                                                                                                                                                                                                                                                               │
bird: Mesh_10_67_82_42: State changed to up

Because of the ephemeral nature of AWS instances, new hosts with new hostnames come up while the old hosts still stay around in calico. This makes BIRD unhappy as it tries to peer with a node that no longer exists. :(

@mikesplain This is why you were able to replicate this behaviour after you did a rolling-update.

oyiu@cayvroyiuworkstn [~][(unknown)*] kubectl run -it --image=calico/ctl:v1.2.1 calicoctl --env CALICO_ETCD_ENDPOINTS=http://etcd-1.internal.i0:4001 get node                                                                                                                                                                                 │
If you don't see a command prompt, try pressing enter.                                                                                                                                                                                                                                                                                                                     │
                                                      Error attaching, falling back to logs: unable to upgrade connection: container calicoctl3 not found in pod calicoctl3-326955638-bt2cr_default                                                                                                                                                                        │
NAME                                                                                                                                                                                                                                                                                                                                                                       │
ip-10-67-64-205                                                                                                                                                                                                                                                                                                                                                            │
ip-10-67-68-58                                                                                                                                                                                                                                                                                                                                                             │
ip-10-67-70-86                                                                                                                                                                                                                                                                                                                                                             │
ip-10-67-76-48                                                                                                                                                                                                                                                                                                                                                             │
ip-10-67-78-229                                                                                                                                                                                                                                                                                                                                                            │
ip-10-67-78-252                                                                                                                                                                                                                                                                                                                                                            │
ip-10-67-79-106                                                                                                                                                                                                                                                                                                                                                            │
ip-10-67-82-133                                                                                                                                                                                                                                                                                                                                                            │
ip-10-67-82-42                                                                                                                                                                                                                                                                                                                                                             │
ip-10-67-84-204                                                                                                                                                                                                                                                                                                                                                            │
ip-10-67-88-27                                                                                                                                                                                                                                                                                                                                                             │
ip-10-67-91-228     

oyiu@cayvroyiuworkstn [~][(unknown)*] kubectl get nodes                                                                                                                                                                                                                                                                                                                    │
NAME                                         STATUS    AGE       VERSION                                                                                                                                                                                                                                                                                                   │
ip-10-67-68-58.us-west-2.compute.internal    Ready     5d        v1.7.3                                                                                                                                                                                                                                                                                                    │
ip-10-67-70-86.us-west-2.compute.internal    Ready     46m       v1.7.3                                                                                                                                                                                                                                                                                                    │
ip-10-67-78-252.us-west-2.compute.internal   Ready     5d        v1.7.3                                                                                                                                                                                                                                                                                                    │
ip-10-67-79-106.us-west-2.compute.internal   Ready     1h        v1.7.3                                                                                                                                                                                                                                                                                                    │
ip-10-67-82-42.us-west-2.compute.internal    Ready     5d        v1.7.3

I'm trying to figure out what the best way is to tackle this problem. Static BGP route reflectors could help in this case. Also, a controller which listens into the kubernetes api, can watch for nodes and delete nodes from calico as they become offline.

ottoyiu · 2017-08-15T22:33:03Z

Oh hey! I found this gem from @caseydavenport 👯‍♂️ 👯‍♀️
https://github.com/caseydavenport/calico-node-controller

I'll see if I have some time to test it out and production-ize it.

Relevant Issue:
projectcalico/canal#11

mikesplain · 2017-08-16T13:13:24Z

@ottoyiu ahh yeah great point! Great job digging into this!

mikesplain · 2017-08-16T16:59:31Z

I also tested this against #3162 to make sure recent change + those in master wouldn't make a different. As expected the issue persisted. Ill take a look at the node-controller in a bit.

lc4ntoni-centralway closed this as completed Jun 12, 2017

ottoyiu mentioned this issue Aug 17, 2017

Calico: Route Programming Delay with new nodes (symptoms: DNS resolution issues) #3224

Closed

cdenneen mentioned this issue Nov 16, 2017

Node cleanup on Kubernetes projectcalico/canal#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS issues with calico and kops 1.6 #2661

DNS issues with calico and kops 1.6 #2661

lc4ntoni-centralway commented May 31, 2017

chrislovecnm commented May 31, 2017

ottoyiu commented May 31, 2017 •

edited

Loading

lc4ntoni-centralway commented Jun 1, 2017

ottoyiu commented Jun 5, 2017

chrislovecnm commented Jun 9, 2017

lc4ntoni-centralway commented Jun 12, 2017

Cisneiros commented Aug 1, 2017 •

edited

Loading

Cisneiros commented Aug 2, 2017

mikesplain commented Aug 2, 2017 •

edited

Loading

ottoyiu commented Aug 2, 2017

Cisneiros commented Aug 3, 2017

mikesplain commented Aug 7, 2017

0x234 commented Aug 15, 2017 •

edited

Loading

0x234 commented Aug 15, 2017

ottoyiu commented Aug 15, 2017 •

edited

Loading

mikesplain commented Aug 15, 2017

ottoyiu commented Aug 15, 2017

mikesplain commented Aug 15, 2017

ottoyiu commented Aug 15, 2017 •

edited

Loading

mikesplain commented Aug 15, 2017

ottoyiu commented Aug 15, 2017

mikesplain commented Aug 15, 2017

ottoyiu commented Aug 15, 2017 •

edited

Loading

ottoyiu commented Aug 15, 2017 •

edited

Loading

caseydavenport commented Aug 15, 2017 •

edited

Loading

ottoyiu commented Aug 15, 2017 •

edited

Loading

ottoyiu commented Aug 15, 2017 •

edited

Loading

mikesplain commented Aug 16, 2017

mikesplain commented Aug 16, 2017

DNS issues with calico and kops 1.6 #2661

DNS issues with calico and kops 1.6 #2661

Comments

lc4ntoni-centralway commented May 31, 2017

chrislovecnm commented May 31, 2017

ottoyiu commented May 31, 2017 • edited Loading

lc4ntoni-centralway commented Jun 1, 2017

ottoyiu commented Jun 5, 2017

chrislovecnm commented Jun 9, 2017

lc4ntoni-centralway commented Jun 12, 2017

Cisneiros commented Aug 1, 2017 • edited Loading

Cisneiros commented Aug 2, 2017

mikesplain commented Aug 2, 2017 • edited Loading

ottoyiu commented Aug 2, 2017

Cisneiros commented Aug 3, 2017

mikesplain commented Aug 7, 2017

0x234 commented Aug 15, 2017 • edited Loading

0x234 commented Aug 15, 2017

ottoyiu commented Aug 15, 2017 • edited Loading

mikesplain commented Aug 15, 2017

ottoyiu commented Aug 15, 2017

mikesplain commented Aug 15, 2017

ottoyiu commented Aug 15, 2017 • edited Loading

mikesplain commented Aug 15, 2017

ottoyiu commented Aug 15, 2017

mikesplain commented Aug 15, 2017

ottoyiu commented Aug 15, 2017 • edited Loading

ottoyiu commented Aug 15, 2017 • edited Loading

caseydavenport commented Aug 15, 2017 • edited Loading

ottoyiu commented Aug 15, 2017 • edited Loading

ottoyiu commented Aug 15, 2017 • edited Loading

mikesplain commented Aug 16, 2017

mikesplain commented Aug 16, 2017

ottoyiu commented May 31, 2017 •

edited

Loading

Cisneiros commented Aug 1, 2017 •

edited

Loading

mikesplain commented Aug 2, 2017 •

edited

Loading

0x234 commented Aug 15, 2017 •

edited

Loading

ottoyiu commented Aug 15, 2017 •

edited

Loading

ottoyiu commented Aug 15, 2017 •

edited

Loading

ottoyiu commented Aug 15, 2017 •

edited

Loading

ottoyiu commented Aug 15, 2017 •

edited

Loading

caseydavenport commented Aug 15, 2017 •

edited

Loading

ottoyiu commented Aug 15, 2017 •

edited

Loading

ottoyiu commented Aug 15, 2017 •

edited

Loading