-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS issues with calico and kops 1.6 #2661
Comments
@ottoyiu any ideas? |
@lc4ntoni-centralway This seems like a service account race condition that was fixed by #2590 and #2641 that's not part of a kops release yet. Just to confirm, are you using the tagged release of Kops 1.6? |
@ottoyiu thanks for replying. I'm using the kops version shipped in brew. sha1:
|
Not ideal, but try building kops from master, and see if that solves your problem. Pretty sure it's no longer a problem. |
@justinsb is working on a 1.6.1 release which will include the fixes. |
Alright, I understand that some fix need to be included within the next release. I'll close then... thanks a lot for your support. |
Hi guys! I'm running kops 1.6.2 (git-469d82d0b) and created a Kubernetes 1.6.6 cluster with calico and I'm facing the same issues. Has this been fixed yet? |
Just to add more context: I'm running CentOS 7. I also noticed it seems the issue occurs on recently started nodes (e.g. after cluster-autscaler scaling up). |
We're see this too but haven't been able to pin it down. Saw it most recently on a cluster created with kops v1.7.0 and k8s v1.6.2. |
was it DNS that wasn't working, or were Service IPs also affected? Reason I ask is because DNS is resolved through a service IP of kube-dns'. |
@ottoyiu Ok, so I tested for that. Apparently Service IPs are not working. I wrote a simple container that tries to This pod was ran on a recently started node. Do these diagnostics give you any ideias? |
@chrislovecnm can we reopen this since a number of us are still seeing this? I'm seeing the same behavior as @Cisneiros. |
+1 for this on CentOS 7 with k8s 1.6.7. Services are continually re-registered every 5 minutes like @lc4ntoni-centralway reports. Other containers in the kube-dns pod report no errors.
During the periods of adding a new service, DNS fails in the cluster and services cannot communicate. Networking CNI is via Calico. |
Scrap that, looks like intended behaviour: https://github.com/kubernetes/dns/blob/master/pkg/dns/dns.go#L52 |
@Cisneiros @mikesplain I'm also seeing this behaviour on new nodes. Are you running calico using cross-subnet mode or just straight-up ipip? |
@ottoyiu We're running cross subnet. |
@mikesplain and do you observe it on only new nodes that come up? That's what I'm seeing right now. |
@ottoyiu Yes exactly. I haven't tried rebooting a node but it's usually about 3-5 min before things are okay. I setup a daemonset to test (below). The
|
If it's only solely happening in cross-subnet mode, then it could be AWS taking it's time after the API call to actually disable src/dst checks on the node... or it could be k8s-ec2-srcst controller that I wrote which is taking time to respond to nodes coming up and calling the AWS API. 👎 In those both cases, we'll need to mark the node as NotReady or disable scheduling until it is actually ready. If it's happening in ipip mode as well, then well... something funky is happening with calico. |
@ottoyiu Ahh yeah good idea, any ideas on how to debug this further? Let me know if there's anything I can do to help. All our clusters are cross-subnet but I can try to test it in ipip. |
@mikesplain If you can test spinning up a cluster using ipip, that'll be great! I'll also do the same to try and isolate the issue. |
@ottoyiu Ahh so I just setup a ipip cluster (Kops v1.7.0 & k8s v1.6.2). I wasn't able to replicate the issue. Then I wanted to do one more test, upgrade the cluster and see what happens. After updating to k8s 1.6.8 I see the issue on a ipip cluster. |
@mikesplain I think we might be seeing this while routes are still being advertised through iBGP using BIRD. Going to check the log timestamps from when the node is marked READY, and when other nodes see the routes for it. |
Similarly to what I expected, the node is marked as 'READY' when the node's BIRD BGP daemon inside the calico/node has not learnt the routes from its neighbours yet. It is not until 4-5 minutes in, does it establish connection with its peers and learn their routes. The route table when node is marked as 'READY':
It is not until 4-5 minutes in, does it learn routes:
This timing also lines up quite well with the reports of not being able to resolve hostnames using DNS or connect to anything using a service IP. @caseydavenport have you seen this before, and if so, do you know of a workaround? |
@ottoyiu route convergence via BGP should be done in seconds - even at thousand node scale - not minutes, so that seems unlikely. If there's a timing delay in programming those routes it's more likely because either:
The only other thing I can think of is that if you've told Calico to peer with a Node that doesn't exist, it will take 90s before the routes will be programmed due to a graceful-restart timer. I'd check for configured nodes using You can also run |
@caseydavenport thank you very much for your guidance :) much appreciated!
Because of the ephemeral nature of AWS instances, new hosts with new hostnames come up while the old hosts still stay around in calico. This makes BIRD unhappy as it tries to peer with a node that no longer exists. :( @mikesplain This is why you were able to replicate this behaviour after you did a rolling-update.
I'm trying to figure out what the best way is to tackle this problem. Static BGP route reflectors could help in this case. Also, a controller which listens into the kubernetes api, can watch for nodes and delete nodes from calico as they become offline. |
Oh hey! I found this gem from @caseydavenport 👯♂️ 👯♀️ I'll see if I have some time to test it out and production-ize it. Relevant Issue: |
@ottoyiu ahh yeah great point! Great job digging into this! |
I also tested this against #3162 to make sure recent change + those in master wouldn't make a different. As expected the issue persisted. Ill take a look at the node-controller in a bit. |
Hello,
I'm experiencing some DNS issue with kops and calico with kops 1.6 and kubernetes 1.6.2 (AWS cloud)
Steps to reproduce:
1- Create a cluster from scratch. I've used this command:
kops create cluster --zones eu-central-1b --node-size t2.medium --master-size t2.medium --vpc vpc-38094c51 --networking calico $NAME
2- After the cluster is created, kube-dns will fail because the configure-calico jobs fails as well:
3 - Executing kops rolling-update cluster --force --yes eventually solves this problem
4- At this point, creating a new pod causes temporary dns resolution failures, that eventually get solved by themselves after some time the pod is created (infact the kube-dns pods are running fine, as well as calico ones):
While this happens there's nothing in the kube-dns logs indicating that something went wrong:
The text was updated successfully, but these errors were encountered: