etcd operator doesn't bring etcd nodes up after k8s node restart, vault crashes #179

eLco · 2019-01-17T09:33:09Z

What happened:

k8s worker node was restarted, no etcd pods are running

What you expected to happen:

etcd operator should bring etcd nodes up

How to reproduce it (as minimally and precisely as possible):

Deploy etcd component and restart the k8s worker node.

Anything else we need to know?:

Pods status (use kubectl -n namespace get pods):

kubectl get po --all-namespaces

NAMESPACE        NAME                                                               READY   STATUS             RESTARTS   AGE
automation-hub   api-7995f95f56-x99rp                                               1/1     Running            0          2d9h
automation-hub   auth-service-cache-redis-df49ffdbc-5pbxd                           1/1     Running            0          2d9h
automation-hub   auth-service-f44997f6f-bpbn6                                       1/1     Running            0          2d9h
automation-hub   automation-hub-5b659d8b4b-8h7sf                                    1/1     Running            8          2d9h
automation-hub   automation-hub-cache-redis-6969bc4cbc-r5sxc                        1/1     Running            0          2d9h
automation-hub   bubbles-64887554cf-66dbb                                           1/1     Running            0          2d9h
automation-hub   etcd-etcd-operator-etcd-backup-operator-79874f77cd-gkfgz           1/1     Running            1          2d9h
automation-hub   etcd-etcd-operator-etcd-operator-5ff4457cc9-nn54v                  1/1     Running            1          2d9h
automation-hub   etcd-etcd-operator-etcd-restore-operator-dcc679df-4drbl            1/1     Running            3          2d9h
automation-hub   git-7ddf67d46f-j5hfw                                               1/1     Running            0          2d9h
automation-hub   secrets-service-5ddb9cf8b4-22njd                                   1/1     Running            0          2d9h
automation-hub   slack-notifier-f4ff8c774-qvj8x                                     1/1     Running            0          2d9h
automation-hub   subscriptions-service-6fb784d585-zkjtd                             1/1     Running            0          2d9h
automation-hub   subscriptions-service-cache-redis-7c85b9764c-mskhq                 1/1     Running            0          2d9h
automation-hub   vault-vault-5b4dcdb785-6xv2v                                       0/1     CrashLoopBackOff   1109       2d9h
dex              auth-operator-controller-manager-0                                 1/1     Running            0          2d8h
dex              dex-7d9f5fd667-ttzds                                               1/1     Running            6          2d9h
ingress          traefik-7bfc6d9d7b-k8k82                                           1/1     Running            0          2d9h
ingress          traefik-dashboard-auth-7c758d4857-k76jv                            1/1     Running            2          2d9h
kube-system      coredns-677f858775-xf5dt                                           1/1     Running            0          9d
kube-system      kube-apiserver-7pc4d                                               1/1     Running            2          9d
kube-system      kube-controller-manager-595bf67ccb-rvjm4                           1/1     Running            8          9d
kube-system      kube-flannel-c5nss                                                 2/2     Running            0          9d
kube-system      kube-flannel-s6fnd                                                 2/2     Running            0          2d8h
kube-system      kube-proxy-fkszz                                                   1/1     Running            0          9d
kube-system      kube-proxy-lv27q                                                   1/1     Running            0          2d8h
kube-system      kube-scheduler-dd4494dcd-xkpcj                                     1/1     Running            3          9d
kube-system      pod-checkpointer-kws6r                                             1/1     Running            0          9d
kube-system      pod-checkpointer-kws6r-ip-10-0-20-231.us-east-2.compute.internal   1/1     Running            0          9d
kube-system      tiller-deploy-776b5cb874-snfhj                                     1/1     Running            0          2d9h

Pods logs (use kubectl -n namespace logs podname):

kubectl logs etcd-etcd-operator-etcd-operator-5ff4457cc9-nn54v -n automation-hub

time="2019-01-16T19:37:45Z" level=info msg="etcd-operator Version: 0.9.2"
time="2019-01-16T19:37:45Z" level=info msg="Git SHA: a0032c1f"
time="2019-01-16T19:37:45Z" level=info msg="Go Version: go1.10"
time="2019-01-16T19:37:45Z" level=info msg="Go OS/Arch: linux/amd64"
E0116 19:38:45.358931       1 leaderelection.go:224] error retrieving resource lock automation-hub/etcd-operator: the server was unable to return a response in the time allotted, but may still be processing the request (get endpoints etcd-operator)
time="2019-01-16T19:39:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Endpoints\", Namespace:\"automation-hub\", Name:\"etcd-operator\", UID:\"fa0f075d-126a-11e9-a3b1-0aa36742a15e\", APIVersion:\"v1\", ResourceVersion:\"2393793\", FieldPath:\"\"}): type: 'Normal' reason: 'LeaderElection' etcd-etcd-operator-etcd-operator-5ff4457cc9-nn54v became leader"
E0116 19:39:27.649265       1 leaderelection.go:258] Failed to update lock: etcdserver: request timed out
time="2019-01-16T19:39:48Z" level=info msg="start running..." cluster-name=etcd pkg=cluster
time="2019-01-16T19:39:57Z" level=warning msg="all etcd pods are dead." cluster-name=etcd pkg=cluster
time="2019-01-16T19:40:05Z" level=warning msg="all etcd pods are dead." cluster-name=etcd pkg=cluster
time="2019-01-16T19:40:13Z" level=warning msg="all etcd pods are dead." cluster-name=etcd pkg=cluster
time="2019-01-16T19:40:21Z" level=warning msg="all etcd pods are dead." cluster-name=etcd pkg=cluster
time="2019-01-16T19:40:29Z" level=warning msg="all etcd pods are dead." cluster-name=etcd pkg=cluster

kubectl logs vault-vault-5b4dcdb785-6xv2v -n automation-hub

Error initializing storage of type etcd: failed to get etcd API version: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.3.17.172:2379: i/o timeout

Environment:

Hub CLI version (use hub version):
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T19:44:19Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.4", GitCommit:"f49fa022dbe63faafd0da106ef7e05a29721d3f1", GitTreeState:"clean", BuildDate:"2018-12-14T06:59:37Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

Toolbox docker tag:
Others:

The text was updated successfully, but these errors were encountered:

arkadijs · 2019-01-17T10:14:23Z

IIRC that's the discussion with no resolution.
coreos/etcd-operator#1323

eLco · 2019-01-17T10:16:06Z

It doesn't solves our own problem with etcd and vault combination. We need to abandon etcd operator then and migrate to simple etcd component.

arkadijs · 2019-01-17T11:13:56Z

Also backup and restore.

arkadijs · 2019-11-07T23:32:15Z

Etcd storage backend superseded by S3.

eLco added this to the Sprint 28 milestone Jan 17, 2019

eLco self-assigned this Jan 17, 2019

eLco added bug etcd issues related to etcd labels Jan 17, 2019

eLco modified the milestones: Sprint 28, Sprint 29 Jan 21, 2019

eLco removed this from the Sprint 29 milestone Feb 8, 2019

eLco removed their assignment Oct 29, 2019

arkadijs closed this as completed Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd operator doesn't bring etcd nodes up after k8s node restart, vault crashes #179

etcd operator doesn't bring etcd nodes up after k8s node restart, vault crashes #179

eLco commented Jan 17, 2019

arkadijs commented Jan 17, 2019

eLco commented Jan 17, 2019

arkadijs commented Jan 17, 2019

arkadijs commented Nov 7, 2019

etcd operator doesn't bring etcd nodes up after k8s node restart, vault crashes #179

etcd operator doesn't bring etcd nodes up after k8s node restart, vault crashes #179

Comments

eLco commented Jan 17, 2019

arkadijs commented Jan 17, 2019

eLco commented Jan 17, 2019

arkadijs commented Jan 17, 2019

arkadijs commented Nov 7, 2019