Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd operator doesn't bring etcd nodes up after k8s node restart, vault crashes #179

Closed
eLco opened this issue Jan 17, 2019 · 4 comments
Closed
Labels
bug etcd issues related to etcd

Comments

@eLco
Copy link
Contributor

eLco commented Jan 17, 2019

What happened:

k8s worker node was restarted, no etcd pods are running

What you expected to happen:

etcd operator should bring etcd nodes up

How to reproduce it (as minimally and precisely as possible):

Deploy etcd component and restart the k8s worker node.

Anything else we need to know?:

  • Pods status (use kubectl -n namespace get pods):

kubectl get po --all-namespaces

NAMESPACE        NAME                                                               READY   STATUS             RESTARTS   AGE
automation-hub   api-7995f95f56-x99rp                                               1/1     Running            0          2d9h
automation-hub   auth-service-cache-redis-df49ffdbc-5pbxd                           1/1     Running            0          2d9h
automation-hub   auth-service-f44997f6f-bpbn6                                       1/1     Running            0          2d9h
automation-hub   automation-hub-5b659d8b4b-8h7sf                                    1/1     Running            8          2d9h
automation-hub   automation-hub-cache-redis-6969bc4cbc-r5sxc                        1/1     Running            0          2d9h
automation-hub   bubbles-64887554cf-66dbb                                           1/1     Running            0          2d9h
automation-hub   etcd-etcd-operator-etcd-backup-operator-79874f77cd-gkfgz           1/1     Running            1          2d9h
automation-hub   etcd-etcd-operator-etcd-operator-5ff4457cc9-nn54v                  1/1     Running            1          2d9h
automation-hub   etcd-etcd-operator-etcd-restore-operator-dcc679df-4drbl            1/1     Running            3          2d9h
automation-hub   git-7ddf67d46f-j5hfw                                               1/1     Running            0          2d9h
automation-hub   secrets-service-5ddb9cf8b4-22njd                                   1/1     Running            0          2d9h
automation-hub   slack-notifier-f4ff8c774-qvj8x                                     1/1     Running            0          2d9h
automation-hub   subscriptions-service-6fb784d585-zkjtd                             1/1     Running            0          2d9h
automation-hub   subscriptions-service-cache-redis-7c85b9764c-mskhq                 1/1     Running            0          2d9h
automation-hub   vault-vault-5b4dcdb785-6xv2v                                       0/1     CrashLoopBackOff   1109       2d9h
dex              auth-operator-controller-manager-0                                 1/1     Running            0          2d8h
dex              dex-7d9f5fd667-ttzds                                               1/1     Running            6          2d9h
ingress          traefik-7bfc6d9d7b-k8k82                                           1/1     Running            0          2d9h
ingress          traefik-dashboard-auth-7c758d4857-k76jv                            1/1     Running            2          2d9h
kube-system      coredns-677f858775-xf5dt                                           1/1     Running            0          9d
kube-system      kube-apiserver-7pc4d                                               1/1     Running            2          9d
kube-system      kube-controller-manager-595bf67ccb-rvjm4                           1/1     Running            8          9d
kube-system      kube-flannel-c5nss                                                 2/2     Running            0          9d
kube-system      kube-flannel-s6fnd                                                 2/2     Running            0          2d8h
kube-system      kube-proxy-fkszz                                                   1/1     Running            0          9d
kube-system      kube-proxy-lv27q                                                   1/1     Running            0          2d8h
kube-system      kube-scheduler-dd4494dcd-xkpcj                                     1/1     Running            3          9d
kube-system      pod-checkpointer-kws6r                                             1/1     Running            0          9d
kube-system      pod-checkpointer-kws6r-ip-10-0-20-231.us-east-2.compute.internal   1/1     Running            0          9d
kube-system      tiller-deploy-776b5cb874-snfhj                                     1/1     Running            0          2d9h
  • Pods logs (use kubectl -n namespace logs podname):

kubectl logs etcd-etcd-operator-etcd-operator-5ff4457cc9-nn54v -n automation-hub

time="2019-01-16T19:37:45Z" level=info msg="etcd-operator Version: 0.9.2"
time="2019-01-16T19:37:45Z" level=info msg="Git SHA: a0032c1f"
time="2019-01-16T19:37:45Z" level=info msg="Go Version: go1.10"
time="2019-01-16T19:37:45Z" level=info msg="Go OS/Arch: linux/amd64"
E0116 19:38:45.358931       1 leaderelection.go:224] error retrieving resource lock automation-hub/etcd-operator: the server was unable to return a response in the time allotted, but may still be processing the request (get endpoints etcd-operator)
time="2019-01-16T19:39:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Endpoints\", Namespace:\"automation-hub\", Name:\"etcd-operator\", UID:\"fa0f075d-126a-11e9-a3b1-0aa36742a15e\", APIVersion:\"v1\", ResourceVersion:\"2393793\", FieldPath:\"\"}): type: 'Normal' reason: 'LeaderElection' etcd-etcd-operator-etcd-operator-5ff4457cc9-nn54v became leader"
E0116 19:39:27.649265       1 leaderelection.go:258] Failed to update lock: etcdserver: request timed out
time="2019-01-16T19:39:48Z" level=info msg="start running..." cluster-name=etcd pkg=cluster
time="2019-01-16T19:39:57Z" level=warning msg="all etcd pods are dead." cluster-name=etcd pkg=cluster
time="2019-01-16T19:40:05Z" level=warning msg="all etcd pods are dead." cluster-name=etcd pkg=cluster
time="2019-01-16T19:40:13Z" level=warning msg="all etcd pods are dead." cluster-name=etcd pkg=cluster
time="2019-01-16T19:40:21Z" level=warning msg="all etcd pods are dead." cluster-name=etcd pkg=cluster
time="2019-01-16T19:40:29Z" level=warning msg="all etcd pods are dead." cluster-name=etcd pkg=cluster

kubectl logs vault-vault-5b4dcdb785-6xv2v -n automation-hub

Error initializing storage of type etcd: failed to get etcd API version: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.3.17.172:2379: i/o timeout

Environment:

  • Hub CLI version (use hub version):
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T19:44:19Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.4", GitCommit:"f49fa022dbe63faafd0da106ef7e05a29721d3f1", GitTreeState:"clean", BuildDate:"2018-12-14T06:59:37Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
  • Toolbox docker tag:
  • Others:
@eLco eLco added this to the Sprint 28 milestone Jan 17, 2019
@eLco eLco self-assigned this Jan 17, 2019
@eLco eLco added bug etcd issues related to etcd labels Jan 17, 2019
@arkadijs
Copy link
Contributor

IIRC that's the discussion with no resolution.
coreos/etcd-operator#1323

@eLco
Copy link
Contributor Author

eLco commented Jan 17, 2019

It doesn't solves our own problem with etcd and vault combination. We need to abandon etcd operator then and migrate to simple etcd component.

@arkadijs
Copy link
Contributor

Also backup and restore.

@eLco eLco modified the milestones: Sprint 28, Sprint 29 Jan 21, 2019
@eLco eLco removed this from the Sprint 29 milestone Feb 8, 2019
@eLco eLco removed their assignment Oct 29, 2019
@arkadijs
Copy link
Contributor

arkadijs commented Nov 7, 2019

Etcd storage backend superseded by S3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug etcd issues related to etcd
Projects
None yet
Development

No branches or pull requests

2 participants