Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New cluster with calico v3 fails to become ready #6085

Closed
Smirl opened this issue Nov 16, 2018 · 1 comment · Fixed by #6086
Closed

New cluster with calico v3 fails to become ready #6085

Smirl opened this issue Nov 16, 2018 · 1 comment · Fixed by #6086

Comments

@Smirl
Copy link
Contributor

Smirl commented Nov 16, 2018

1. What kops version are you running? The command kops version, will display
this information.

Version 1.11.0-alpha.1 (git-bac89b8)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-30T21:39:38Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:08:19Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops create cluster \
	--name mycluster.example.com \
	--admin-access '10.0.0.0/8' \
	--api-loadbalancer-type=internal \
	--associate-public-ip=false \
	--bastion="true" \
	--cloud-labels "Project=k8s" \
	--disable-subnet-tags \
	--dns private \
	--dns-zone example.com \
	--master-zones=eu-west-1a,eu-west-1b,eu-west-1c \
	--networking=calico \
	--node-count=2 \
	--ssh-access '10.0.0.0/8' \
	--subnets=subnet-0000000000,subnet-0000000000,subnet-0000000000 \
	--topology=private \
	--utility-subnets=subnet-0000000000,subnet-0000000000,subnet-0000000000 \
	--vpc vpc-11111111111111111 \
	--zones eu-west-1a,eu-west-1b,eu-west-1c

Then added crossSubnet: true to the calico network before applying.

5. What happened after the commands executed?
This cluster resources are provisioned but masters never become ready and nodes never join the cluster.

6. What did you expect to happen?
New cluster to start with calico running ready to start putting deployments onto. You know, the good stuff.

7. Please provide your cluster manifest.
See create output from above.

8. Debbugging.

When trying to debug the issue I looked at the kubelet logs with: journalctl -u kubelet.service. This was showing the following lines many many times.

Nov 15 23:43:45 ip-10-41-108-45 kubelet[7224]: W1115 23:43:45.250786    7224 cni.go:120] Error loading CNI config list file /etc/cni/net.d/10-calico.conflist: error parsing configuration list: invalid character '}' looking for beginning of object key string                                                                                                               
Nov 15 23:43:45 ip-10-41-108-45 kubelet[7224]: W1115 23:43:45.250810    7224 cni.go:172] Unable to update cni config: No valid networks found in /etc/cni/net.d/                                                
Nov 15 23:43:45 ip-10-41-108-45 kubelet[7224]: E1115 23:43:45.250910    7224 kubelet.go:2110] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized                                                                                                                                                                        
Nov 15 23:43:50 ip-10-41-108-45 kubelet[7224]: W1115 23:43:50.251870    7224 cni.go:120] Error loading CNI config list file /etc/cni/net.d/10-calico.conflist: error parsing configuration list: invalid character '}' looking for beginning of object key string                                                                                                               
Nov 15 23:43:50 ip-10-41-108-45 kubelet[7224]: W1115 23:43:50.251894    7224 cni.go:172] Unable to update cni config: No valid networks found in /etc/cni/net.d/                                                
Nov 15 23:43:50 ip-10-41-108-45 kubelet[7224]: E1115 23:43:50.252130    7224 kubelet.go:2110] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized                                                        

I then looked at ls /etc/cni/net.d/

-rw-r--r-- 1 root root  713 Nov 15 23:45 10-calico.conflist
-rw------- 1 root root 2590 Nov 15 23:18 calico-kubeconfig

Then the file where there was an error cat /etc/cni/net.d/10-calico.conflist

{
  "name": "k8s-pod-network",
  "cniVersion": "0.3.0",
  "plugins": [
    {
      "type": "calico",
      "etcd_endpoints": "http://etcd-a.internal.cluster.example.com:4001,http://etcd-b.internal.cluster.example.com:4001,http://etcd-c.internal.cluster.example.com:4001",
      "log_level": "info",
      "ipam": {
        "type": "calico-ipam"
      },
      "policy": {
        "type": "k8s",
      },
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
      }
    },
    {
      "type": "portmap",
      "snat": true,
      "capabilities": {"portMappings": true}
    }
  ]
}

The issue seems to be the comma after "type": "k8s", so I edited this file locally with nano on a master node. This then made that master node healthy.

I then edited the file with kubectl -n kube-system edit cm calico-config to remove the comma.

Then did a rolling-update of the cluster which fixed everything.

9. Anything else do we need to know?

fix: I believe that removing the comma on here will do the trick: https://github.com/kubernetes/kops/blob/master/upup/models/cloudup/resources/addons/networking.projectcalico.org/k8s-1.7-v3.yaml.template#L39

workaround: follow the steps above after the cluster has started to fix the issue.

@KashifSaadat
Copy link
Contributor

Hey @Smirl, thanks for the great detail in the issue and good job on finding the cause :)

Would you be able to raise a PR with the fix to the template file? Additionally the following will need updating to kops.2, so that the manifest change is picked up and rolled out: https://github.com/kubernetes/kops/blob/master/upup/pkg/fi/cloudup/bootstrapchannelbuilder.go#L648

"k8s-1.7-v3":  "3.3.1-kops.2",

Thanks!
Kash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants