Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Container Fails - Kops 1.6 #2529

Closed
odehsemreen opened this issue May 8, 2017 · 20 comments · Fixed by #2590
Closed

DNS Container Fails - Kops 1.6 #2529

odehsemreen opened this issue May 8, 2017 · 20 comments · Fixed by #2590

Comments

@odehsemreen
Copy link

I currently have an issue after creating the cluster with kops 1.6 alpha2 git-d57ceda and kubernetes 1.6.2. The cluster is created successfully but the DNS containers fails to start, it stays in "Creating Container" state or rpc error. I am using Calico for my networking. Here is the command am using to create the cluster.

kops create cluster \
    --channel alpha \
    --node-count 3 \
    --zones eu-west-1a, eu-west-1b, eu-west-1c \
    --master-zones eu-west-1a, eu-west-1b, eu-west-1c \
    --dns-zone cluster.k8s.domain.com \
    --node-size c3.large \
    --master-size c3.large \
    --topology private \
    --networking calico \
    --ssh-public-key ~/.ssh/id_rsa \
    --vpc=vpc-123456 \
    --bastion \
    cluster.k8s.domain.com

When I describe the pods in kube-system, I get those errors "message":"cannot join network of a non running container: and network: No configured Calico pools.

I tried the latest kops release as well and same applies.

Thanks a lot of your support.

@igorcanadi
Copy link
Contributor

We're seeing the same issue

@georgebuckerfield
Copy link
Contributor

georgebuckerfield commented May 11, 2017

Having the same issue with 1.6.3. I was able to workaround by deleting the failed pod and letting the replication set re-launch it.

Pod Failures in kube-system
NAME
configure-calico-wpskv
kube-dns-autoscaler-387649234-x9d1d

Validation Failed
Ready Master(s) 1 out of 1.
Ready Node(s) 0 out of 2.

your nodes are NOT ready kubernetes.example.com

$ kubectl get deployment --all-namespaces
NAMESPACE     NAME                       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system   calico-policy-controller   1         1         1            1           1m
kube-system   dns-controller             1         1         1            1           1m
kube-system   kube-dns                   1         1         1            0           1m
kube-system   kube-dns-autoscaler        1         1         1            0           1m

$ kubectl delete pod --namespace=kube-system kube-dns-autoscaler-387649234-x9d1d
pod "kube-dns-autoscaler-387649234-x9d1d" deleted

$ kubectl get deployments --all-namespaces
NAMESPACE     NAME                       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system   calico-policy-controller   1         1         1            1           14m
kube-system   dns-controller             1         1         1            1           14m
kube-system   kube-dns                   2         2         2            2           14m
kube-system   kube-dns-autoscaler        1         1         1            1           14m

It appears to be related to this issue?

@a-chernykh
Copy link
Contributor

Having the same problem with kops 1.6.0-beta.1, Kubernetes 1.6.2 and calico, deleting pods does not help

@dolftax
Copy link
Contributor

dolftax commented May 14, 2017

@georgebuckerfield Logs of configure-calico-wpskv and kube-dns-autoscaler-387649234-x9d1d ?

@deleonjavier
Copy link

I'm having the same issue. I'm using Version 1.6.0-beta.1 (git-77f222d) of kops. My reason for using this version was to get Kubernetes 1.6.2 working. Is it possible to have kops 1.5.3 (brew install) create a v1.6.2 Kubernetes cluster?

@chrislovecnm
Copy link
Contributor

@deleonjavier kops 1.5.3 does not support kubernetes 1.6.x. Master includes an updated version of Calico, which may help with this problem.

@deleonjavier
Copy link

deleonjavier commented May 15, 2017

@chrislovecnm Actually I just used kops (Version 1.5.3) to create a Kubernetes Cluster (v1.6.2). Looks like the broken parameter for me was using --networking cni. I was using that --networking setting , so I could in the future use weave but remain on kubenet networking till then.

@odehsemreen
Copy link
Author

@georgebuckerfield I have done that few times, was the first thing to try, and still didn't work. I did not delete the cluster and after 52 restarts of kube-dns service (done automatically), it has managed to recover! I could not troubleshoot on the networking though. I hope we get a feedback on this.

@chrislovecnm
Copy link
Contributor

Can we confirm this with the 1.6.0 kops release? Calico has been upgraded.

@mikesplain
Copy link
Contributor

@chrislovecnm I'm seeing this right now as well, with 1.6.0. I upgraded a 1.5.7 cluster to 1.6.2 with flannel and I'm seeing dns pods stuck in creating for the new dns rs. Trying to get more info.

@cbuckley01
Copy link

I just attempted to do a new cluster build with 1.6.0/Calico and am seeing these errors

@ottoyiu
Copy link
Contributor

ottoyiu commented May 17, 2017

I'm also getting the same error. Restarting kube-dns does not alleviate the issue.

@willtrking
Copy link

willtrking commented May 18, 2017

Seeing the same issue here with the kops 1.6.0 release with Calico on a fresh cluster, no issues at all when using Canal however. Restarting kube-dns after ensuring Calico is running on nodes doesn't seem to have an effect.

Configured with

kops create cluster \
    --admin-access XXX.XXX.XX.X/32 \
    --node-count 4 \
    --encrypt-etcd-storage \
    --zones us-west-2a,us-west-2b,us-west-2c \
    --master-zones us-west-2a,us-west-2b,us-west-2c \
    --node-size m4.xlarge \
    --master-size m4.large \
    --topology private \
    --networking calico \
    --ssh-public-key=XXXXX.pub \
    --bastion \
    ${NAME}

Some hopefully useful logs:

  14m		13m		6	default-scheduler							Warning		FailedScheduling	no nodes available to schedule pods
  13m		13m		2	default-scheduler							Warning		FailedScheduling	No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (2).
  13m		13m		2	default-scheduler							Warning		FailedScheduling	No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (3).
  12m		12m		1	default-scheduler							Normal		Scheduled		Successfully assigned kube-dns-1321724180-1w9ds to ip-172-20-123-54.us-west-2.compute.internal
  11m		11m		1	kubelet, ip-172-20-123-54.us-west-2.compute.internal			Warning		FailedSync		Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: c5511c8981cb4613b6ee8a9879163f5887e9f0aa4dbbf13408430f0bdbcc435f
  9m		9m		1	kubelet, ip-172-20-123-54.us-west-2.compute.internal			Warning		FailedSync		Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: db3d2d9632d5072126215d0770c3695f46266fd01640b863a1019d69b46e5e35
  7m		7m		1	kubelet, ip-172-20-123-54.us-west-2.compute.internal			Warning		FailedSync		Error syncing pod, skipping: failed to "KillPodSandbox" for "fd7933fd-3b5a-11e7-ba84-06eb9ab37f5e" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kube-dns-1321724180-1w9ds_kube-system\" network: CNI failed to retrieve network namespace path: Error: No such container: ae7f9802e57a5311a147aeb4cd847f1304572ec0772d456a1928aaddfd7fbf5e"

  7m	7m	1	kubelet, ip-172-20-123-54.us-west-2.compute.internal		Warning	FailedSync	Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: 0257fbc853134d66dddd19080deabb70a9bde5f36ab1bc5c7dd61f81a605052d
  6m	6m	1	kubelet, ip-172-20-123-54.us-west-2.compute.internal		Warning	FailedSync	Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: a77d4afbaf03a376b43b1c50c7d8e74486a6b63efc2dd107e68e6792dc50417a
  5m	5m	1	kubelet, ip-172-20-123-54.us-west-2.compute.internal		Warning	FailedSync	Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: ecf421f2311dcd2023ace6eb8eda1d7582f1fbe43e47fa6c6b31c1c417c7243e
  4m	4m	1	kubelet, ip-172-20-123-54.us-west-2.compute.internal		Warning	FailedSync	Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: ef0e93f29f39163838d34684bc1b070add7846cf85a86e744c5542439e671f27
  3m	3m	1	kubelet, ip-172-20-123-54.us-west-2.compute.internal		Warning	FailedSync	Error syncing pod, skipping: failed to "KillPodSandbox" for "fd7933fd-3b5a-11e7-ba84-06eb9ab37f5e" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"kube-dns-1321724180-1w9ds_kube-system\" network: CNI failed to retrieve network namespace path: Error: No such container: 092978763e32577f6a5570fb1e25ba59cddf7a3f087455012d0aaee6a03e7778"

  2m	29s	4	kubelet, ip-172-20-123-54.us-west-2.compute.internal		Warning	FailedSync	(events with common reason combined)
  12m	4s	196	kubelet, ip-172-20-123-54.us-west-2.compute.internal		Warning	FailedSync	Error syncing pod, skipping: failed to "CreatePodSandbox" for "kube-dns-1321724180-1w9ds_kube-system(fd7933fd-3b5a-11e7-ba84-06eb9ab37f5e)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-dns-1321724180-1w9ds_kube-system(fd7933fd-3b5a-11e7-ba84-06eb9ab37f5e)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"kube-dns-1321724180-1w9ds_kube-system\" network: No configured Calico pools"

  11m	3s	200	kubelet, ip-172-20-123-54.us-west-2.compute.internal		Normal	SandboxChanged	Pod sandbox changed, it will be killed and re-created.

@willtrking
Copy link

willtrking commented May 18, 2017

OK, found a surprisingly simple workaround (on a fresh cluster, kops 1.6.0 w/ calico)

Seems like what happened was the original configure-calico job didn't succeed, so I just ran that job again under a new name. In order to do so I simply found the relevant kops YAML, renamed the job and kubectl create -f that YAML file.

Williams-MBP-2:kops willtrking$ kubectl get jobs --namespace=kube-system
NAME                     DESIRED   SUCCESSFUL   AGE
configure-calico         1         0            4m
configure-calico-again   1         1            1m

The YAML for the job was pulled from here:

https://github.com/kubernetes/kops/blob/master/upup/models/cloudup/resources/addons/networking.projectcalico.org/k8s-1.6.yaml.template#L222

After which kube-dns started running automatically:

Williams-MBP-2:kops willtrking$ kubectl get pods --namespace=kube-system
NAME                                                                  READY     STATUS    RESTARTS   AGE
calico-node-4lhvm                                                     2/2       Running   0          3m
calico-node-5k6hj                                                     2/2       Running   0          2m
calico-node-d8sql                                                     2/2       Running   0          2m
calico-node-h13zw                                                     2/2       Running   0          2m
calico-node-qk30q                                                     2/2       Running   0          3m
calico-node-ttl7b                                                     2/2       Running   0          3m
calico-node-xgqb9                                                     2/2       Running   0          2m
calico-policy-controller-811246363-bp4r7                              1/1       Running   0          3m
dns-controller-116990191-cp8xt                                        1/1       Running   0          3m
etcd-server-events-ip-172-20-114-94.us-west-2.compute.internal        1/1       Running   0          2m
etcd-server-events-ip-172-20-46-208.us-west-2.compute.internal        1/1       Running   0          2m
etcd-server-events-ip-172-20-71-14.us-west-2.compute.internal         1/1       Running   0          3m
etcd-server-ip-172-20-114-94.us-west-2.compute.internal               1/1       Running   0          2m
etcd-server-ip-172-20-46-208.us-west-2.compute.internal               1/1       Running   0          2m
etcd-server-ip-172-20-71-14.us-west-2.compute.internal                1/1       Running   0          2m
kube-apiserver-ip-172-20-114-94.us-west-2.compute.internal            1/1       Running   0          3m
kube-apiserver-ip-172-20-46-208.us-west-2.compute.internal            1/1       Running   1          2m
kube-apiserver-ip-172-20-71-14.us-west-2.compute.internal             1/1       Running   0          3m
kube-controller-manager-ip-172-20-114-94.us-west-2.compute.internal   1/1       Running   0          2m
kube-controller-manager-ip-172-20-46-208.us-west-2.compute.internal   1/1       Running   0          2m
kube-controller-manager-ip-172-20-71-14.us-west-2.compute.internal    1/1       Running   0          3m
kube-dns-1321724180-7pm55                                             3/3       Running   0          3m
kube-dns-1321724180-8g01w                                             3/3       Running   0          16s
kube-dns-autoscaler-265231812-21kc7                                   1/1       Running   0          3m
kube-proxy-ip-172-20-111-46.us-west-2.compute.internal                1/1       Running   0          2m
kube-proxy-ip-172-20-114-94.us-west-2.compute.internal                1/1       Running   0          2m
kube-proxy-ip-172-20-37-76.us-west-2.compute.internal                 1/1       Running   0          1m
kube-proxy-ip-172-20-46-208.us-west-2.compute.internal                1/1       Running   0          2m
kube-proxy-ip-172-20-46-72.us-west-2.compute.internal                 1/1       Running   0          2m
kube-proxy-ip-172-20-71-14.us-west-2.compute.internal                 1/1       Running   0          2m
kube-proxy-ip-172-20-86-222.us-west-2.compute.internal                1/1       Running   0          1m
kube-scheduler-ip-172-20-114-94.us-west-2.compute.internal            1/1       Running   0          3m
kube-scheduler-ip-172-20-46-208.us-west-2.compute.internal            1/1       Running   0          2m
kube-scheduler-ip-172-20-71-14.us-west-2.compute.internal             1/1       Running   0          2m

And kops validate cluster is happy!

A note here, my master nodes are consistently starting before any of my regular nodes. Here's the log line from the original configure-calico denoting failure

10m		10m		1	job-controller			Warning		FailedCreate	Error creating: pods "configure-calico-" is forbidden: service account kube-system/calico was not found, retry after the service account is created

@odehsemreen
Copy link
Author

@chrislovecnm I have tried all kops 1.6 (alpha 1 and 2, beta 1) releases with all Kubernetes 1.6 (1.6.0, 1.6.1, 1.6.2) releases.

@jhuntoo
Copy link

jhuntoo commented May 18, 2017

I've also experienced this issue - kops 1.6 & k8s 1.6.3.

@chrislovecnm
Copy link
Contributor

cc @caseydavenport , @shadoi

Casey any ideas on how we can diagnose what is going on?

@caseydavenport
Copy link
Member

network: No configured Calico pools

Yeah, definitely a configure-calico problem. The logs from the failed Pod would hopefully indicate what went wrong.

That said, the latest release of Calico doesn't require that Job - we should update the manifest to remove it and use the CALICO_IPV4POOL_CIDR configuration option in the DaemonSet instead.

From the latest upstream manifest:

            # Configure the IP Pool from which Pod IPs will be chosen.
            - name: CALICO_IPV4POOL_CIDR
              value: "192.168.0.0/16"

@ozdanborne @heschlie

@ozdanborne
Copy link
Contributor

I'll open up a PR that implements the change @caseydavenport is describing.

@blakebarnett
Copy link

I'm assuming it's a race condition with the calico-node DaemonSet getting scheduled before some part of the kops bootstrap process has put the service accounts in place? I'm not sure if that's done after things would start getting scheduled? Maybe the master nodes could be cordoned until these steps have been verified first.

ottoyiu added a commit to ottoyiu/kops that referenced this issue May 26, 2017
…nt first

This fixes the behaviour described in kubernetes#2529 which was fixed by kubernetes#2590, by
avoiding the configure-calico job all together.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.