New cluster with calico v3 fails to become ready #6085

Smirl · 2018-11-16T00:06:28Z

1. What kops version are you running? The command kops version, will display
this information.

Version 1.11.0-alpha.1 (git-bac89b8)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-30T21:39:38Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:08:19Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops create cluster \
	--name mycluster.example.com \
	--admin-access '10.0.0.0/8' \
	--api-loadbalancer-type=internal \
	--associate-public-ip=false \
	--bastion="true" \
	--cloud-labels "Project=k8s" \
	--disable-subnet-tags \
	--dns private \
	--dns-zone example.com \
	--master-zones=eu-west-1a,eu-west-1b,eu-west-1c \
	--networking=calico \
	--node-count=2 \
	--ssh-access '10.0.0.0/8' \
	--subnets=subnet-0000000000,subnet-0000000000,subnet-0000000000 \
	--topology=private \
	--utility-subnets=subnet-0000000000,subnet-0000000000,subnet-0000000000 \
	--vpc vpc-11111111111111111 \
	--zones eu-west-1a,eu-west-1b,eu-west-1c

Then added crossSubnet: true to the calico network before applying.

5. What happened after the commands executed?
This cluster resources are provisioned but masters never become ready and nodes never join the cluster.

6. What did you expect to happen?
New cluster to start with calico running ready to start putting deployments onto. You know, the good stuff.

7. Please provide your cluster manifest.
See create output from above.

8. Debbugging.

When trying to debug the issue I looked at the kubelet logs with: journalctl -u kubelet.service. This was showing the following lines many many times.

Nov 15 23:43:45 ip-10-41-108-45 kubelet[7224]: W1115 23:43:45.250786    7224 cni.go:120] Error loading CNI config list file /etc/cni/net.d/10-calico.conflist: error parsing configuration list: invalid character '}' looking for beginning of object key string                                                                                                               
Nov 15 23:43:45 ip-10-41-108-45 kubelet[7224]: W1115 23:43:45.250810    7224 cni.go:172] Unable to update cni config: No valid networks found in /etc/cni/net.d/                                                
Nov 15 23:43:45 ip-10-41-108-45 kubelet[7224]: E1115 23:43:45.250910    7224 kubelet.go:2110] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized                                                                                                                                                                        
Nov 15 23:43:50 ip-10-41-108-45 kubelet[7224]: W1115 23:43:50.251870    7224 cni.go:120] Error loading CNI config list file /etc/cni/net.d/10-calico.conflist: error parsing configuration list: invalid character '}' looking for beginning of object key string                                                                                                               
Nov 15 23:43:50 ip-10-41-108-45 kubelet[7224]: W1115 23:43:50.251894    7224 cni.go:172] Unable to update cni config: No valid networks found in /etc/cni/net.d/                                                
Nov 15 23:43:50 ip-10-41-108-45 kubelet[7224]: E1115 23:43:50.252130    7224 kubelet.go:2110] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

I then looked at ls /etc/cni/net.d/

-rw-r--r-- 1 root root  713 Nov 15 23:45 10-calico.conflist
-rw------- 1 root root 2590 Nov 15 23:18 calico-kubeconfig

Then the file where there was an error cat /etc/cni/net.d/10-calico.conflist

{
  "name": "k8s-pod-network",
  "cniVersion": "0.3.0",
  "plugins": [
    {
      "type": "calico",
      "etcd_endpoints": "http://etcd-a.internal.cluster.example.com:4001,http://etcd-b.internal.cluster.example.com:4001,http://etcd-c.internal.cluster.example.com:4001",
      "log_level": "info",
      "ipam": {
        "type": "calico-ipam"
      },
      "policy": {
        "type": "k8s",
      },
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
      }
    },
    {
      "type": "portmap",
      "snat": true,
      "capabilities": {"portMappings": true}
    }
  ]
}

The issue seems to be the comma after "type": "k8s", so I edited this file locally with nano on a master node. This then made that master node healthy.

I then edited the file with kubectl -n kube-system edit cm calico-config to remove the comma.

Then did a rolling-update of the cluster which fixed everything.

9. Anything else do we need to know?

fix: I believe that removing the comma on here will do the trick: https://github.com/kubernetes/kops/blob/master/upup/models/cloudup/resources/addons/networking.projectcalico.org/k8s-1.7-v3.yaml.template#L39

workaround: follow the steps above after the cluster has started to fix the issue.

The text was updated successfully, but these errors were encountered:

KashifSaadat · 2018-11-16T10:51:53Z

Hey @Smirl, thanks for the great detail in the issue and good job on finding the cause :)

Would you be able to raise a PR with the fix to the template file? Additionally the following will need updating to kops.2, so that the manifest change is picked up and rolled out: https://github.com/kubernetes/kops/blob/master/upup/pkg/fi/cloudup/bootstrapchannelbuilder.go#L648

"k8s-1.7-v3":  "3.3.1-kops.2",

Thanks!
Kash

Smirl mentioned this issue Nov 16, 2018

Remove trailing comma from from k8s-1.7-v3.yaml.template #6086

Merged

k8s-ci-robot closed this as completed in #6086 Nov 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New cluster with calico v3 fails to become ready #6085

New cluster with calico v3 fails to become ready #6085

Smirl commented Nov 16, 2018

KashifSaadat commented Nov 16, 2018

New cluster with calico v3 fails to become ready #6085

New cluster with calico v3 fails to become ready #6085

Comments

Smirl commented Nov 16, 2018

KashifSaadat commented Nov 16, 2018