Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic of snapshot not find after power failure does it have any solution! #12492

Closed
mamiapatrick opened this issue Nov 25, 2020 · 6 comments
Closed
Labels

Comments

@mamiapatrick
Copy link

mamiapatrick commented Nov 25, 2020

What kind of request is this (question/bug/enhancement/feature request):
BUG

Steps to reproduce (least amount of steps as possible):
Power failure of rancher node (RancherOS)

Result:
By inspecting the log I got an error

etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
panic: recovering backend from snapshot error: database snapshot file path 

Other details that may be helpful:

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): rancher/rancher:v2.3.2
  • Installation option (single install/HA): HA (3 nodes of etcd and 2 master)

Cluster information

- Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Imported
- Machine type (cloud/VM/metal) and specifications (CPU/memory): VM, 8 CPU, 8 Gb RAM
- Kubernetes version (use `kubectl version`): v1.15.5
  • Docker version (use docker version):
Rancher 2.3.1
RKE 0.3.1
Kubernetes v1.15.5
docker: 18.06.3-ce
@tedyu
Copy link
Contributor

tedyu commented Nov 28, 2020

Can you ls the contents of snap directory ?

Was there SIGSEGV stack trace similar to the one shown in #12237 ?

@mamiapatrick
Copy link
Author

mamiapatrick commented Dec 1, 2020

yes i got the same SIGSEGV srack trace, the content of the folder is
[root@rancheros etcd]# tree . -- member
|-- snap
| |-- 000000000000007c-0000000002a271fa.snap
| |-- 000000000000007c-0000000002a3f89b.snap
| |-- 000000000000007e-0000000002a57f3c.snap
| |-- 000000000000007f-0000000002a705dd.snap
| |-- 0000000000000080-0000000002a88c7e.snap
| -- db -- wal
|-- 0000000000000566-0000000002a77322.wal
|-- 0000000000000567-0000000002a8010a.wal
|-- 0000000000000568-0000000002a888f5.wal
|-- 0000000000000569-0000000002a90594.wal
|-- 000000000000056a-0000000002a985c6.wal
`-- 1.tmp

3 directories, 12 files`

and the stack trace is

rancher/rancher:v2.3.1.log

"log":"2020/11/14 10:31:52 [INFO] Rancher version v2.3.1 is starting\n","stream":"stdout","time":"2020-11-14T10:31:52.467448468Z"} {"log":"2020/11/14 10:31:52 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:auto Embedded:false KubeConfig: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false NoCACerts:false ListenConfig:\u003cnil\u003e AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features:}\n","stream":"stdout","time":"2020-11-14T10:31:52.467563717Z"} {"log":"2020/11/14 10:31:52 [INFO] Listening on /tmp/log.sock\n","stream":"stdout","time":"2020-11-14T10:31:52.467628423Z"} {"log":"2020/11/14 10:31:52 [INFO] Running etcd --data-dir=management-state/etcd\n","stream":"stdout","time":"2020-11-14T10:31:52.468826968Z"} {"log":"2020-11-14 10:31:52.497291 W | pkg/flags: unrecognized environment variable ETCD_URL_arm64=https://github.com/etcd-io/etcd/releases/download/v3.3.14/etcd-v3.3.14-linux-arm64.tar.gz\n","stream":"stderr","time":"2020-11-14T10:31:52.497585627Z"} {"log":"2020-11-14 10:31:52.497716 W | pkg/flags: unrecognized environment variable ETCD_URL_amd64=https://github.com/etcd-io/etcd/releases/download/v3.3.14/etcd-v3.3.14-linux-amd64.tar.gz\n","stream":"stderr","time":"2020-11-14T10:31:52.497944845Z"} {"log":"2020-11-14 10:31:52.497998 W | pkg/flags: unrecognized environment variable ETCD_UNSUPPORTED_ARCH=amd64\n","stream":"stderr","time":"2020-11-14T10:31:52.498213733Z"} {"log":"2020-11-14 10:31:52.498322 W | pkg/flags: unrecognized environment variable ETCD_URL=ETCD_URL_amd64\n","stream":"stderr","time":"2020-11-14T10:31:52.498554465Z"} {"log":"2020-11-14 10:31:52.498627 I | etcdmain: etcd Version: 3.3.14\n","stream":"stderr","time":"2020-11-14T10:31:52.498833307Z"} {"log":"2020-11-14 10:31:52.498954 I | etcdmain: Git SHA: 5cf5d88a1\n","stream":"stderr","time":"2020-11-14T10:31:52.499254549Z"} {"log":"2020-11-14 10:31:52.499118 I | etcdmain: Go Version: go1.12.9\n","stream":"stderr","time":"2020-11-14T10:31:52.499637835Z"} {"log":"2020-11-14 10:31:52.499454 I | etcdmain: Go OS/Arch: linux/amd64\n","stream":"stderr","time":"2020-11-14T10:31:52.499681696Z"} {"log":"2020-11-14 10:31:52.499539 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4\n","stream":"stderr","time":"2020-11-14T10:31:52.499960028Z"} {"log":"2020-11-14 10:31:52.500200 N | etcdmain: the server is already initialized as member before, starting as etcd member...\n","stream":"stderr","time":"2020-11-14T10:31:52.500393874Z"} {"log":"2020-11-14 10:31:52.501685 I | embed: listening for peers on http://localhost:2380\n","stream":"stderr","time":"2020-11-14T10:31:52.501900448Z"} {"log":"2020-11-14 10:31:52.502430 I | embed: listening for client requests on localhost:2379\n","stream":"stderr","time":"2020-11-14T10:31:52.502596206Z"} {"log":"2020-11-14 10:31:52.571703 I | etcdserver: recovered store from snapshot at index 44600446\n","stream":"stderr","time":"2020-11-14T10:31:52.572105412Z"} {"log":"2020-11-14 10:31:52.595485 C | etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist\n","stream":"stderr","time":"2020-11-14T10:31:52.595732324Z"} {"log":"panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist\n","stream":"stderr","time":"2020-11-14T10:31:52.599286335Z"} {"log":"\u0009panic: runtime error: invalid memory address or nil pointer dereference\n","stream":"stderr","time":"2020-11-14T10:31:52.599322369Z"} {"log":"[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xbc425e]\n","stream":"stderr","time":"2020-11-14T10:31:52.599528396Z"} {"log":"\n","stream":"stderr","time":"2020-11-14T10:31:52.599556917Z"} {"log":"goroutine 1 [running]:\n","stream":"stderr","time":"2020-11-14T10:31:52.599572197Z"} {"log":"github.com/coreos/etcd/etcdserver.NewServer.func1(0xc000287d38, 0xc000286ab8)\n","stream":"stderr","time":"2020-11-14T10:31:52.599587522Z"} {"log":"\u0009/tmp/etcd-release-3.3.14/etcd/release/etcd/etcdserver/server.go:293 +0x3e\n","stream":"stderr","time":"2020-11-14T10:31:52.599603075Z"} {"log":"panic(0xdfac80, 0xc0001900f0)\n","stream":"stderr","time":"2020-11-14T10:31:52.599618729Z"} {"log":"\u0009/usr/local/go/src/runtime/panic.go:522 +0x1b5\n","stream":"stderr","time":"2020-11-14T10:31:52.59984813Z"} {"log":"github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc00020a740, 0xfbb334, 0x2a, 0xc0002a6b88, 0x1, 0x1)\n","stream":"stderr","time":"2020-11-14T10:31:52.599888332Z"} {"log":"\u0009/Users/leegyuho/go/pkg/mod/github.com/coreos/[email protected]/capnslog/pkg_logger.go:75 +0x135\n","stream":"stderr","time":"2020-11-14T10:31:52.599906669Z"} {"log":"github.com/coreos/etcd/etcdserver.NewServer(0xf976d8, 0x7, 0x0, 0x0, 0x0, 0x0, 0xc000272780, 0x1, 0x1, 0xc000272680, ...)\n","stream":"stderr","time":"2020-11-14T10:31:52.599955756Z"} {"log":"\u0009/tmp/etcd-release-3.3.14/etcd/release/etcd/etcdserver/server.go:388 +0x2c7b\n","stream":"stderr","time":"2020-11-14T10:31:52.59997304Z"} {"log":"github.com/coreos/etcd/embed.StartEtcd(0xc000278000, 0xc000278480, 0x0, 0x0)\n","stream":"stderr","time":"2020-11-14T10:31:52.599988177Z"} {"log":"\u0009/tmp/etcd-release-3.3.14/etcd/release/etcd/embed/etcd.go:179 +0x7da\n","stream":"stderr","time":"2020-11-14T10:31:52.60000345Z"} {"log":"github.com/coreos/etcd/etcdmain.startEtcd(0xc000278000, 0xf968d7, 0x6, 0x1, 0xc00021ee00)\n","stream":"stderr","time":"2020-11-14T10:31:52.60014219Z"} {"log":"\u0009/tmp/etcd-release-3.3.14/etcd/release/etcd/etcdmain/etcd.go:181 +0x40\n","stream":"stderr","time":"2020-11-14T10:31:52.600203522Z"} {"log":"github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()\n","stream":"stderr","time":"2020-11-14T10:31:52.600221782Z"} {"log":"\u0009/tmp/etcd-release-3.3.14/etcd/release/etcd/etcdmain/etcd.go:102 +0x13fb\n","stream":"stderr","time":"2020-11-14T10:31:52.600236869Z"} {"log":"github.com/coreos/etcd/etcdmain.Main()\n","stream":"stderr","time":"2020-11-14T10:31:52.600252073Z"} {"log":"\u0009/tmp/etcd-release-3.3.14/etcd/release/etcd/etcdmain/main.go:46 +0x38\n","stream":"stderr","time":"2020-11-14T10:31:52.600513301Z"} {"log":"main.main()\n","stream":"stderr","time":"2020-11-14T10:31:52.600546156Z"} {"log":"\u0009/tmp/etcd-release-3.3.14/etcd/release/etcd/main.go:28 +0x20\n","stream":"stderr","time":"2020-11-14T10:31:52.600561853Z"} {"log":"2020/11/14 10:31:52 [FATAL] etcd exited\n","stream":"stdout","time":"2020-11-14T10:31:52.602190562Z"}

@ptabor
Copy link
Contributor

ptabor commented Jan 8, 2021

{"log":"2020-11-14 10:31:52.571703 I | etcdserver: recovered store from snapshot at index 44600446\n","stream":"stderr","time":"2020-11-14T10:31:52.572105412Z"} 
{"log":"2020-11-14 10:31:52.595485 C | etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist\n","stream":"stderr","time":"2020-11-14T10:31:52.595732324Z"} {"log":

44600446 = 0x2a88c7e

So the file seems to exists :
0000000000000080-0000000002a88c7e.snap

Does it have a regular size, proper permissions ?
Looking from this code https://chromium.googlesource.com/external/github.com/coreos/etcd/+/v3.2.22/snap/db.go#69 perspective the process cannot locate the 'existing' file:

If the file exists and have proper permissions & ownership: can you inspect this file using this tool:

https://docs.google.com/document/d/1O2o1IApHWmSioXG3fez4eVlUHOrXICYGNVIzaqNS0IQ/edit?resourcekey=0-e6Iywgdkol0uiVBAaV1oww#heading=h.wmsrr7l6rw5k ?

BTW: There is no good reason etcd depends on the file existence as they contain V2 store content that is redundant to snap/db file content. I think we should develop a flag that allows to ignore the files and depend on snap from WAL files alone.

@stale
Copy link

stale bot commented Apr 9, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 9, 2021
@stale stale bot closed this as completed Apr 30, 2021
@veerendra2
Copy link

I had similar issue. etcd pod was failed start due to unexpected Azure VM reboot.

2021-06-28 16:12:32.246822 I | pkg/flags: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/etcd/ca.crt
2021-06-28 16:12:32.246900 I | etcdmain: etcd Version: 3.2.22
2021-06-28 16:12:32.246910 I | etcdmain: Git SHA: 1674e682f
2021-06-28 16:12:32.246915 I | etcdmain: Go Version: go1.8.7
2021-06-28 16:12:32.246922 I | etcdmain: Go OS/Arch: linux/amd64
2021-06-28 16:12:32.246928 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2021-06-28 16:12:32.246984 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2021-06-28 16:12:32.247035 I | embed: peerTLS: cert = /etc/etcd/peer.crt, key = /etc/etcd/peer.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2021-06-28 16:12:32.248344 I | embed: listening for peers on https://10.0.1.21:2380 
2021-06-28 16:12:32.248912 I | embed: listening for client requests on 10.0.1.21:2379
2021-06-28 16:12:32.281703 I | etcdserver: recovered store from snapshot at index 444604492
2021-06-28 16:12:32.303933 C | etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
 panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb7bbdc]
goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc4201cc398, 0xc4201cc170)
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:284 +0x3c
panic(0xdb1280, 0xc42000e230)
 /usr/local/google/home/jpbetz/.gvm/gos/go1.8.7/src/runtime/panic.go:489 +0x2cf
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc4201c2020, 0xf98878, 0x2a, 0xc4201cc1e0, 0x1, 0x1)
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc420056ea0, 0x0, 0x14b0580, 0xc42000e0b0)
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:379 +0x2e75
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc42004c800, 0xc420074a80, 0x0, 0x0)
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:157 +0x76a
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc42004c800, 0x6, 0xf73bf3, 0x6, 0x1)
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:186 +0x58
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:103 +0x1579
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:39 +0x61
main.main()
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20

In my case, there are 2 members up and running. Below is thigs I did

## 0. Start failed etcd pod(etcd-01) in debug mode in openshift dashboard

## 2. Take backup snapshot on etcd-0(etcd-01 is already running fine)
$ oc exec -it master-etcd-ocp-master-prd-0 -c etcd -- /bin/sh -c "ETCDCTL_API=3 etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints 10.0.1.20:2379 snapshot save /var/lib/etcd/snapshot-node0-$(date +%m%d).db"; 

## 3. Copy snapshot file to your local machine from etcd-0
$ oc rsync master-etcd-ocp-master-prd-0:/var/lib/etcd/snapshot-node0-$(date +%m%d).db .

## 4. Sync the directory (snapshot fil)e to file debug pod
oc rsync ./snapshot/ master-etcd-ocp-master-prd-1-debug:/var/lib/etcd/

## 5. Move snapshot db file /var/lib/etcd/member/snap/
## These commands shoud run in debug pod
$ cp /var/lib/etcd/snapshot-node0-0630.db /var/lib/etcd/member/snap/db

## 6. Wait for it restart, then etcd-1 came up, joined in cluster

@csidiro
Copy link

csidiro commented Dec 3, 2021

I had similar issue. etcd pod was failed start due to unexpected Azure VM reboot.

2021-06-28 16:12:32.246822 I | pkg/flags: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/etcd/ca.crt
2021-06-28 16:12:32.246900 I | etcdmain: etcd Version: 3.2.22
2021-06-28 16:12:32.246910 I | etcdmain: Git SHA: 1674e682f
2021-06-28 16:12:32.246915 I | etcdmain: Go Version: go1.8.7
2021-06-28 16:12:32.246922 I | etcdmain: Go OS/Arch: linux/amd64
2021-06-28 16:12:32.246928 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2021-06-28 16:12:32.246984 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2021-06-28 16:12:32.247035 I | embed: peerTLS: cert = /etc/etcd/peer.crt, key = /etc/etcd/peer.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2021-06-28 16:12:32.248344 I | embed: listening for peers on https://10.0.1.21:2380 
2021-06-28 16:12:32.248912 I | embed: listening for client requests on 10.0.1.21:2379
2021-06-28 16:12:32.281703 I | etcdserver: recovered store from snapshot at index 444604492
2021-06-28 16:12:32.303933 C | etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
 panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb7bbdc]
goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc4201cc398, 0xc4201cc170)
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:284 +0x3c
panic(0xdb1280, 0xc42000e230)
 /usr/local/google/home/jpbetz/.gvm/gos/go1.8.7/src/runtime/panic.go:489 +0x2cf
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc4201c2020, 0xf98878, 0x2a, 0xc4201cc1e0, 0x1, 0x1)
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc420056ea0, 0x0, 0x14b0580, 0xc42000e0b0)
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:379 +0x2e75
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc42004c800, 0xc420074a80, 0x0, 0x0)
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:157 +0x76a
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc42004c800, 0x6, 0xf73bf3, 0x6, 0x1)
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:186 +0x58
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:103 +0x1579
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:39 +0x61
main.main()
 /tmp/etcd-release-3.2.22/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20

In my case, there are 2 members up and running. Below is thigs I did

## 0. Start failed etcd pod(etcd-01) in debug mode in openshift dashboard

## 2. Take backup snapshot on etcd-0(etcd-01 is already running fine)
$ oc exec -it master-etcd-ocp-master-prd-0 -c etcd -- /bin/sh -c "ETCDCTL_API=3 etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints 10.0.1.20:2379 snapshot save /var/lib/etcd/snapshot-node0-$(date +%m%d).db"; 

## 3. Copy snapshot file to your local machine from etcd-0
$ oc rsync master-etcd-ocp-master-prd-0:/var/lib/etcd/snapshot-node0-$(date +%m%d).db .

## 4. Sync the directory (snapshot fil)e to file debug pod
oc rsync ./snapshot/ master-etcd-ocp-master-prd-1-debug:/var/lib/etcd/

## 5. Move snapshot db file /var/lib/etcd/member/snap/
## These commands shoud run in debug pod
$ cp /var/lib/etcd/snapshot-node0-0630.db /var/lib/etcd/member/snap/db

## 6. Wait for it restart, then etcd-1 came up, joined in cluster

Thanks @veerendra2 for this. Had the same issue and the steps you mentioned here worked for me as well to recover a failing etcd node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

5 participants