Skip to content
This repository has been archived by the owner on Oct 21, 2020. It is now read-only.

Digital Ocean Provisioning not working (provisioner unable to connect to internet) #527

Closed
tianhuil opened this issue Dec 25, 2017 · 28 comments

Comments

@tianhuil
Copy link
Contributor

tianhuil commented Dec 25, 2017

Hi, I'm trying to use the new DO provisioner. I'm running into this problem:

  1. I have followed the instructions in the README.md. Yet when I run the test commands
kubectl create -f examples/pvc.yaml
kubectl create -f examples/pod-application.yaml

I get the error below (see Error below) which suggests that the provisioner is not allowed to speak with DO. I know that the security token is working as, by loading the token on my dev machine, I am able to run this command and view all my droplets:

$ curl -X GET "https://api.digitalocean.com/v2/droplets" -H "Authorization: Bearer $TOKEN" > /tmp/droplets.json

Error Message

I1225 01:10:29.946936       1 main.go:51] Provisioner external/digitalocean specified
I1225 01:10:29.947095       1 main.go:65] Building kube configs for running in cluster...
I1225 01:10:30.263427       1 controller.go:407] Starting provisioner controller 6644ad75-e910-11e7-a729-0a580af40110!
I1225 01:11:42.192272       1 controller.go:1080] scheduleOperation[lock-provision-default/pv1[9123185d-e910-11e7-901c
-a2fa2817e02b]]
I1225 01:11:42.245201       1 controller.go:1080] scheduleOperation[lock-provision-default/pv1[9123185d-e910-11e7-901c
-a2fa2817e02b]]
I1225 01:11:42.357717       1 leaderelection.go:156] attempting to acquire leader lease...
I1225 01:11:42.445066       1 leaderelection.go:178] successfully acquired lease to provision for pvc default/pv1
I1225 01:11:42.445604       1 controller.go:1080] scheduleOperation[provision-default/pv1[9123185d-e910-11e7-901c-a2fa
2817e02b]]
I1225 01:11:45.269755       1 controller.go:1080] scheduleOperation[provision-default/pv1[9123185d-e910-11e7-901c-a2fa
2817e02b]]
I1225 01:12:00.270052       1 controller.go:1080] scheduleOperation[provision-default/pv1[9123185d-e910-11e7-901c-a2fa
2817e02b]]
E1225 01:12:12.544855       1 provision.go:145] Failed to create volume {Delete pvc-9123185d-e910-11e7-901c-a2fa2817e0
2b &PersistentVolumeClaim{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:pv1,GenerateName:,Namespace:
default,SelfLink:/api/v1/namespaces/default/persistentvolumeclaims/pv1,UID:9123185d-e910-11e7-901c-a2fa2817e02b,Resour
ceVersion:863536,Generation:0,CreationTimestamp:2017-12-25 01:11:42 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePer
iodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName
:,Initializers:nil,},Spec:PersistentVolumeClaimSpec{AccessModes:[ReadWriteOnce],Resources:ResourceRequirements{Limits:
ResourceList{},Requests:ResourceList{storage: {{1048576 0} {<nil>} 1Mi BinarySI},},},VolumeName:,Selector:nil,StorageC
lassName:*default,},Status:PersistentVolumeClaimStatus{Phase:Pending,AccessModes:[],Capacity:ResourceList{},Conditions
:[],},} map[zone:nyc1]}, error: Post https://api.digitalocean.com/v2/volumes: dial tcp: i/o timeout
E1225 01:12:12.545468       1 controller.go:808] Failed to provision volume for claim "default/pv1" with StorageClass
"default": Post https://api.digitalocean.com/v2/volumes: dial tcp: i/o timeout
E1225 01:12:12.545561       1 goroutinemap.go:165] Operation for "provision-default/pv1[9123185d-e910-11e7-901c-a2fa28
17e02b]" failed. No retries permitted until 2017-12-25 01:12:13.045507602 +0000 UTC m=+105.497482293 (durationBeforeRe
try 500ms). Error: Post https://api.digitalocean.com/v2/volumes: dial tcp: i/o timeout
I1225 01:12:14.145504       1 leaderelection.go:198] stopped trying to renew lease to provision for pvc default/pv1, t
ask failed
I1225 01:12:15.270247       1 controller.go:1080] scheduleOperation[provision-default/pv1[9123185d-e910-11e7-901c-a2fa
2817e02b]]
I1225 01:12:30.270540       1 controller.go:1080] scheduleOperation[provision-default/pv1[9123185d-e910-11e7-901c-a2fa
2817e02b]]
I1225 01:12:45.439176       1 controller.go:1080] scheduleOperation[lock-provision-default/pv1[9123185d-e910-11e7-901c
-a2fa2817e02b]]
E1225 01:12:45.439237       1 provision.go:145] Failed to create volume {Delete pvc-9123185d-e910-11e7-901c-a2fa2817e0
2b &PersistentVolumeClaim{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:pv1,GenerateName:,Namespace:
default,SelfLink:/api/v1/namespaces/default/persistentvolumeclaims/pv1,UID:9123185d-e910-11e7-901c-a2fa2817e02b,Resour
ceVersion:863623,Generation:0,CreationTimestamp:2017-12-25 01:11:42 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePer
iodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{control-plane.alpha.kubernetes.io/leader: {"ho
lderIdentity":"6644ad75-e910-11e7-a729-0a580af40110","leaseDurationSeconds":15,"acquireTime":"2017-12-25T01:11:42Z","r
enewTime":"2017-12-25T01:12:14Z","leaderTransitions":0},volume.beta.kubernetes.io/storage-provisioner: external/digita
locean,},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PersistentVolumeClaimSpec{AccessModes:[
ReadWriteOnce],Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{storage: {{1048576 0} {<nil>
} 1Mi BinarySI},},},VolumeName:,Selector:nil,StorageClassName:*default,},Status:PersistentVolumeClaimStatus{Phase:Pend
ing,AccessModes:[],Capacity:ResourceList{},Conditions:[],},} map[zone:nyc1]}, error: Post https://api.digitalocean.com
/v2/volumes: dial tcp: i/o timeout
E1225 01:12:45.439404       1 controller.go:808] Failed to provision volume for claim "default/pv1" with StorageClass
"default": Post https://api.digitalocean.com/v2/volumes: dial tcp: i/o timeout

DNS Error?:

It appears that this is a network issue. When I ssh into the shell, I'm unable to resolve either k8s or external DNS querie:

$ kubectl -it exec --namespace=kube-system digitalocean-provisioner-6c7bbf4ccc-s2g8n -- /bin/sh
/ # nslookup kubernetes.default 100.64.0.10
Server:    100.64.0.10
Address 1: 100.64.0.10

nslookup: can't resolve 'kubernetes.default': Try again
/ # nslookup kubernetes.default.svc.cluster.local 100.64.0.10
Server:    100.64.0.10
Address 1: 100.64.0.10

nslookup: can't resolve 'kubernetes.default.svc.cluster.local': Try again
/ # cat /etc/resolv.conf
nameserver 10.96.0.10
search kube-system.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
/ # nslookup nodejs.org 8.8.8.8
Server:    8.8.8.8
Address 1: 8.8.8.8

nslookup: can't resolve 'nodejs.org': Try again
/ # nslookup kubernetes.default.svc.cluster.local
nslookup: can't resolve '(null)': Name does not resolve

nslookup: can't resolve 'kubernetes.default.svc.cluster.local': Try again

Any clue? Particularly @klausenbusk

@tianhuil tianhuil changed the title Digital Ocean Provisioning not working Digital Ocean Provisioning not working (provisioner unable to connect to internet) Dec 25, 2017
@klausenbusk
Copy link
Contributor

Any clue? Particularly @klausenbusk

This sounds like cluster issue.

Some questions:

  • Does networking work for regular pods?
  • How was the cluster created?
  • What do you use for pod networking? (flannel, calico, something else?)

@tianhuil
Copy link
Contributor Author

tianhuil commented Dec 26, 2017

Thanks @klausenbusk. This was a DNS issue.

  1. In the provisioner pod, ip addresses work, but not not DNS.
  2. In the flannel pod, both ip addresses and DNS work.

The difference seems to be the nameservers:

  1. In the provisioner pod, only the k8s nameserver is there
  2. In the flannel pod, the DO public nameservers are there:
$ kubectl -it exec --namespace=kube-system digitalocean-provisioner-6c7bbf4ccc-s2g8n -- more /etc/resolv.conf
nameserver 10.96.0.10
search kube-system.svc.cluster.local svc.cluster.local cluster.local
ptions ndots:5
$ kubectl -it exec --namespace=kube-system kube-flannel-ds-29cb5 -- more /etc/resolv.conf
Defaulting container name to kube-flannel.
Use 'kubectl describe pod/kube-flannel-ds-29cb5' to see all of the containers in this pod.
nameserver 67.207.67.2
nameserver 67.207.67.3

Once I manually added in nameserver 67.207.67.2 to the provisioner's /etc/resolv.conf, I was fine. To answer your remaining questions, the clusters are created on DO following these tutorials:

Any idea how to solve this more automatically @klausenbusk?

@klausenbusk
Copy link
Contributor

Any idea how to solve this more automatically @klausenbusk?

Is kube-dns running? It could be this issue: kubernetes/kubeadm#587

@tianhuil
Copy link
Contributor Author

tianhuil commented Dec 26, 2017

Apparently, it's suppose to inherit the node's nameservers (as the flannel pod does) but this is not happening in the provisioner. See this note. Nor does it seem to be happening in the other pods I create.

@tianhuil
Copy link
Contributor Author

I believe kube-dns is running:

$ kubectl get deployment --namespace=kube-system
NAME                       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
digitalocean-provisioner   2         2         2            2           1d
kube-dns                   1         1         1            1           8d
kubernetes-dashboard       1         1         1            1           8d
tiller-deploy              1         1         1            1           8d

@tianhuil
Copy link
Contributor Author

tianhuil commented Dec 26, 2017

To document, appending to nameservers to the /etc/resolv.confs appears to solve this problem. Closing issue.

@klausenbusk
Copy link
Contributor

To document, appending to nameservers to the /etc/hosts appears to solve this problem. Closing issue.

Did you mean /etc/resolv.conf? It sounds like a workaround, you should fix the underlying issue.

@tianhuil
Copy link
Contributor Author

Yes, sorry, /etc/resolv.conf. And I fixed the earlier comment to avoid future confusion. It's definitely a workaround. According to this, pods are supposed to inherit their node's nameservers. For some reason, that's not happening here. I'll have to continue investigating.

@klausenbusk
Copy link
Contributor

klausenbusk commented Dec 26, 2017

According to this, pods are supposed to inherit their node's nameservers. For some reason, that's not happening here. I'll have to continue investigating.

The DNS pod is exposed as a Kubernetes Service with a static IP. Once assigned the kubelet passes DNS configured using the --cluster-dns=<dns-service-ip> flag to each container.

Probably set by /etc/systemd/system/kubelet.service.d/10-kubeadm.conf, as the scripts you used use kubeadm. Note: Removing --cluster-dns=<dns-service-ip> is also a workaround, you should fix the underlying issue, which is probably related to flannel/kube-dns/kube-proxy.

@tianhuil
Copy link
Contributor Author

tianhuil commented Dec 27, 2017

Documenting in case people care. This was fixed by upgraidng to flannel v0.9.1, which contains a fix to the DNS issue: flannel-io/flannel#872. Thanks to @klausenbusk for pointing out the solution!

@mattinsler
Copy link

@tianhuil did you have to make changes to your kube-controller-manager to make this work properly?

@klausenbusk
Copy link
Contributor

@mattinsler correct, please see #529 for the final solution. I'm happy to help if you have any issues.

@tianhuil
Copy link
Contributor Author

tianhuil commented Jan 2, 2018

@mattinsler: to be specific:

  1. upgrading to flannel v0.9.1 fixed this specific issue with the DNS: network/iptables: Add iptables rules to FORWARD chain flannel-io/flannel#872
  2. I did have to do Digital Ocean external volume test failing #529 to finally get it working (because of a volume visibility issue)

@mattinsler
Copy link

Ahh OK. I'm stuck on how to execute 529. I'm still new to k8s... If you wouldn't mind, how would I (after using kubeadm to get a working cluster) add the correct config to the kube-controller-manager and restart it? Also, what exactly is the correct config? I've tried a bunch of things out and different ways to try to update or restart kube-controller-manager and I've just torn down and re-created the cluster each time because I get stuck.

@tianhuil
Copy link
Contributor Author

tianhuil commented Jan 2, 2018 via email

@mattinsler
Copy link

Sorry, I'm still really new. So I edited my /etc/kubernetes/manifests/kube-controller-manager.yaml file and the pod is now gone and has not come back again. How do I figure out what happened? How do I make it start up again?

@tianhuil
Copy link
Contributor Author

tianhuil commented Jan 3, 2018

No worries -- so edits to the file automatically propagate to the pod. If you made a mistake, I'm not sure what happens but it might kill the pod.

If you save a backup copy, will the pod re-appear? You might also try kubectl create -f /etc/kubernetes/manifests/kube-controller-manager.yaml or just restart the cluster.

@mattinsler
Copy link

Hmm, OK. I tried to create it and the mounts are successful, but the logs say:

I0103 04:43:29.595193       1 controllermanager.go:108] Version: v1.7.12
stat /etc/kubernetes/controller-manager.conf: no such file or directory

Describing the pod shows:

    Mounts:
      /etc/kubernetes from k8s (ro)
      /etc/pki from pki (rw)
      /etc/ssl/certs from certs (rw)
      /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ from flexvolume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-smw5m (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  k8s:
    Type:  HostPath (bare host directory volume)
    Path:  /etc/kubernetes
  certs:
    Type:  HostPath (bare host directory volume)
    Path:  /etc/ssl/certs
  pki:
    Type:  HostPath (bare host directory volume)
    Path:  /etc/pki
  flexvolume:
    Type:  HostPath (bare host directory volume)
    Path:  /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
  default-token-smw5m:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-smw5m
    Optional:    false

There is definitely a file at /etc/kubernetes/controller-manager.conf on the master. Maybe the pod needs to be explicitly run on the master? Is it easy to do that?

@tianhuil
Copy link
Contributor Author

tianhuil commented Jan 3, 2018

What version of k8s are you using? I've been using 1.9.0. Are you on 1.7.12?

@mattinsler
Copy link

mattinsler commented Jan 3, 2018 via email

@mattinsler
Copy link

Got it working! I created a new cluster on 1.9 and made the changes to the kube-controller-manager.yaml file before adding any nodes. Things just worked that way.

@klausenbusk
Copy link
Contributor

Got it working! I created a new cluster on 1.9 and made the changes to the kube-controller-manager.yaml file before adding any nodes. Things just worked that way.

FWIW, if /usr/libexec/kubernetes/kubelet-plugins/volume/exec exists kubeadm automatic add the required hostPath volume.

@mattinsler
Copy link

Ahh, good to know!

@klausenbusk
Copy link
Contributor

/area digitalocean

Just for the record.

@hcabnettek
Copy link

@klausenbusk @tianhuil I think I am having this same issue. I install a fresh cluster 1.10.2 on a Digital Ocean droplet, etc... Following the tutorials ... Eventually led me here. The provisioner pods don't have the nameserver records either. I had to manually add them to /etc/resolv.conf. Even after doing that no volumes are ever created. When I kubectl describe pvc/pv1 I see Failed to provision volume with StorageClass "default": invalid character 'U' looking for beginning of value How can I fix this? I verify I am able to use doctl and create a volume.

@klausenbusk
Copy link
Contributor

I install a fresh cluster 1.10.2 on a Digital Ocean droplet, etc...

Which tool did you use?

@hcabnettek
Copy link

@klausenbusk I used kubeadm. Sorry I just saw your reply. I can't get past this error no matter what I try. =(

@klausenbusk
Copy link
Contributor

@klausenbusk I used kubeadm. Sorry I just saw your reply. I can't get past this error no matter what I try. =(

Lets continue the discussion in #761

prateekpandey14 added a commit to prateekpandey14/external-storage that referenced this issue Oct 5, 2018
The snapshot workflow is being changed to use the CAS templates.
(openebs/maya kubernetes-retired#602 kubernetes-retired#527)

Until it's updated to use the CAS templates way of creating the
snapshots, disabling them from CI.

Signed-off-by: prateekpandey14 <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants