Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster member fail when restore etcd data #7615

Closed
luweijie007 opened this issue Mar 28, 2017 · 18 comments
Closed

Add cluster member fail when restore etcd data #7615

luweijie007 opened this issue Mar 28, 2017 · 18 comments
Assignees

Comments

@luweijie007
Copy link

Bug reporting

A good bug report has some very specific qualities, so please read over our short document on reporting bugs before submitting a bug report.

To ask a question, go ahead and ignore this.

@luweijie007
Copy link
Author

luweijie007 commented Mar 28, 2017

I use backup to restore etcd data
1> etcdctl backup --data-dir /opt/dzhyun/etcd-cluster-1/data/ --wal-dir /opt/dzhyun/etcd-cluster-1/data/ --backup-dir /home/wwf/etcd_back/ --backup-wal-dir home/wwf/etcd_back/

2> etcd -data-dir=/home/wwf/etcd_back/ -force-new-cluster --name infra0 --initial-advertise-peer-urls http://10.15.209.165:2480 --listen-peer-urls http://10.15.209.165:2480 --listen-client-urls http://10.15.209.165:2479,http://127.0.0.1:2479 --advertise-client-urls http://10.15.209.165:2479 --initial-cluster-token etcd-cluster-1 --initial-cluster infra0=http://10.15.209.165:2480 --initial-cluster-state new

I can not found any data in this new restore etcd server, why?
I guess If I want to read this data from this new restore etcd, maybe I need to create 2 new etcd and make three nodes to been one cluster. So I to do as follow:

and this next I run 2 new etcd server on other machine, like follow:
//on 10.15.107.143:
./etcd --initial-advertise-peer-urls http://10.15.107.143:2380 --listen-peer-urls http://10.15.107.143:2380 --listen-client-urls http://10.15.107.143:2379,http://127.0.0.1:2379 --advertise-client-urls http://10.15.107.143:2379
//on 10.15.107.141:
./etcd --name infra1 --initial-advertise-peer-urls http://10.15.107.141:2381 --listen-peer-urls http://10.15.107.141:2381 --listen-client-urls http://10.15.107.141:2379,http://127.0.0.1:2379 --advertise-client-urls http://10.15.107.141:2379

on 10.15.209.165 I try to add this new 2 etcd server as cluster members:
[root@10 member]# etcdctl --endpoint 10.15.209.165:2479 member add infra0 http://10.15.107.141:2379
Added member named infra0 with ID 3d6bf7d7459a39cb to cluster

ETCD_NAME="infra0"
ETCD_INITIAL_CLUSTER="infra0=http://10.15.209.165:2480,infra0=http://10.15.107.141:2379"
ETCD_INITIAL_CLUSTER_STATE="existing"

But add other fail:

[root@10 member]# etcdctl --endpoint 10.15.209.165:2479 member add infra1 http://10.15.107.143:2379
client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member http://10.15.209.165:2479 has no leader

and on 10.15.209.165 etcd servrer print log as follow:
2017-03-28 14:44:06.063336 I | raft: 30c2969a5d0e09f0 is starting a new election at term 297
2017-03-28 14:44:06.063395 I | raft: 30c2969a5d0e09f0 became candidate at term 298
2017-03-28 14:44:06.063416 I | raft: 30c2969a5d0e09f0 received MsgVoteResp from 30c2969a5d0e09f0 at term 298
2017-03-28 14:44:06.063437 I | raft: 30c2969a5d0e09f0 [logterm: 2, index: 21] sent MsgVote request to 3d6bf7d7459a39cb at term 298
2017-03-28 14:44:07.763282 I | raft: 30c2969a5d0e09f0 is starting a new election at term 298
2017-03-28 14:44:07.763341 I | raft: 30c2969a5d0e09f0 became candidate at term 299
2017-03-28 14:44:07.763362 I | raft: 30c2969a5d0e09f0 received MsgVoteResp from 30c2969a5d0e09f0 at term 299
2017-03-28 14:44:07.763381 I | raft: 30c2969a5d0e09f0 [logterm: 2, index: 21] sent MsgVote request to 3d6bf7d7459a39cb at term 299
2017-03-28 14:44:09.063349 I | raft: 30c2969a5d0e09f0 is starting a new election at term 299
2017-03-28 14:44:09.063412 I | raft: 30c2969a5d0e09f0 became candidate at term 300
2017-03-28 14:44:09.063434 I | raft: 30c2969a5d0e09f0 received MsgVoteResp from 30c2969a5d0e09f0 at term 300
2017-03-28 14:44:09.063452 I | raft: 30c2969a5d0e09f0 [logterm: 2, index: 21] sent MsgVote request to 3d6bf7d7459a39cb at term 300
2017-03-28 14:44:09.970830 W | rafthttp: health check for peer 3d6bf7d7459a39cb could not connect: json: cannot unmarshal number into Go value of type probing.Health
2017-03-28 14:44:10.363266 I | raft: 30c2969a5d0e09f0 is starting a new election at term 300
2017-03-28 14:44:10.363310 I | raft: 30c2969a5d0e09f0 became candidate at term 301
2017-03-28 14:44:10.363330 I | raft: 30c2969a5d0e09f0 received MsgVoteResp from 30c2969a5d0e09f0 at term 301
2017-03-28 14:44:10.363350 I | raft: 30c2969a5d0e09f0 [logterm: 2, index: 21] sent MsgVote request to 3d6bf7d7459a39cb at term 301
2017-03-28 14:44:12.063308 I | raft: 30c2969a5d0e09f0 is starting a new election at term 301
2017-03-28 14:44:12.063382 I | raft: 30c2969a5d0e09f0 became candidate at term 302
2017-03-28 14:44:12.063405 I | raft: 30c2969a5d0e09f0 received MsgVoteResp from 30c2969a5d0e09f0 at term 302
2017-03-28 14:44:12.063426 I | raft: 30c2969a5d0e09f0 [logterm: 2, index: 21] sent MsgVote request to 3d6bf7d7459a39cb at term 302
2017-03-28 14:44:13.963319 I | raft: 30c2969a5d0e09f0 is starting a new election at term 302
2017-03-28 14:44:13.963374 I | raft: 30c2969a5d0e09f0 became candidate at term 303
2017-03-28 14:44:13.963412 I | raft: 30c2969a5d0e09f0 received MsgVoteResp from 30c2969a5d0e09f0 at term 303
2017-03-28 14:44:13.963433 I | raft: 30c2969a5d0e09f0 [logterm: 2, index: 21] sent MsgVote request to 3d6bf7d7459a39cb at term 303
2017-03-28 14:44:14.971181 W | rafthttp: health check for peer 3d6bf7d7459a39cb could not connect: json: cannot unmarshal number into Go value of
// too much log ........

If some one can tell me how to do etcd restore,I has read:
https://github.com/coreos/etcd/blob/40ae83beab6ecc55ed64825bac59db21a7e0c2c2/Documentation/op-guide/recovery.md
and
https://github.com/coreos/etcd/blob/40ae83beab6ecc55ed64825bac59db21a7e0c2c2/Documentation/v2/admin_guide.md#disaster-recovery

And my final question is that I has a etcd cluster which both has v2 data and v3 data ,How can I restore this cluser data?

@fanminshi
Copy link
Member

taking a look.

@fanminshi
Copy link
Member

fanminshi commented Mar 28, 2017

I can not found any data in this new restore etcd server, why?

from your command
etcdctl backup --data-dir /opt/dzhyun/etcd-cluster-1/data/ --wal-dir /opt/dzhyun/etcd-cluster-1/data/ --backup-dir /home/wwf/etcd_back/ --backup-wal-dir home/wwf/etcd_back/

It seems to me that your wal file is in your data dir. So there is no need to specify --wal-dir flag.

try etcdctl backup --data-dir /opt/dzhyun/etcd-cluster-1/data/ --backup-dir /home/wwf/etcd_back/

then start etcd with the new backup dir should work.

@luweijie007
Copy link
Author

@fanminshi thanks your suggust!
I try :
etcdctl backup --data-dir /opt/dzhyun/etcd-cluster-1/data/ --backup-dir /home/wwf/etcd_back/
and next:
image
it still unwork, and I has check this file: /home/wwf/etcd_back/member/snap/db is not exist .

@fanminshi
Copy link
Member

fanminshi commented Mar 29, 2017

I was able to reproduce the same error /home/wwf/etcd_back/member/snap/db is not exist .

Setup:
etcd Version: 3.2.0+git
Git SHA: 123b258
Go Version: go1.8
Go OS/Arch: darwin/amd64

Steps:

$ bin/etcd --snapshot-count 5
...
2017-03-29 15:59:28.008316 I | etcdserver: start to snapshot (applied: 6, lastsnap: 0)
2017-03-29 15:59:28.027225 I | etcdserver: saved snapshot at index 6
2017-03-29 15:59:28.027254 I | etcdserver: compacted raft log at 1
...

// another window
// trigger etcd to snapshot
$ ETCDCTL_API=2 bin/etcdctl set foo bar1
bar1
$ ETCDCTL_API=2 bin/etcdctl set foo bar2
bar2
$ ETCDCTL_API=2 bin/etcdctl set foo bar3
bar3
$ ETCDCTL_API=2 bin/etcdctl set foo bar4
bar4
$ ETCDCTL_API=2 bin/etcdctl set foo bar5
bar5
$ ETCDCTL_API=2 bin/etcdctl set foo bar6

$ tree default.etcd/
default.etcd/
└── member
    ├── snap
    │   ├── 0000000000000002-0000000000000006.snap
    │   └── db
    └── wal
        └── 0000000000000000-0000000000000000.wal

// backup
$ ETCDCTL_API=2 bin/etcdctl backup --data-dir default.etcd/ --backup-dir backup/
// backup doesn't contain a db file
$ tree backup
backup
└── member
    ├── snap
    │   └── 0000000000000002-0000000000000006.snap
    └── wal
        └── 0000000000000000-0000000000000000.wal

// kill old etcd proccess
// start new one with backup
$ bin/etcd -data-dir backup -force-new-cluster
2017-03-29 16:06:53.180176 I | etcdserver: recovered store from snapshot at index 6
2017-03-29 16:06:53.180185 I | etcdserver: name = default
2017-03-29 16:06:53.180198 I | etcdserver: force new cluster
2017-03-29 16:06:53.180200 I | etcdserver: data dir = backup
2017-03-29 16:06:53.180203 I | etcdserver: member dir = backup/member
2017-03-29 16:06:53.180208 I | etcdserver: heartbeat = 100ms
2017-03-29 16:06:53.180210 I | etcdserver: election = 1000ms
2017-03-29 16:06:53.180212 I | etcdserver: snapshot count = 100000
2017-03-29 16:06:53.180217 I | etcdserver: advertise client URLs = http://localhost:2379
2017-03-29 16:06:53.237776 I | etcdserver: forcing restart of member 5b1c4f256b01 in cluster 5b1c4f256b02 at commit index 12
2017-03-29 16:06:53.237886 I | raft: 5b1c4f256b01 became follower at term 2
2017-03-29 16:06:53.237915 I | raft: newRaft 5b1c4f256b01 [peers: [8e9e05c52164694d], term: 2, commit: 12, applied: 6, lastindex: 12, lastterm: 2]
2017-03-29 16:06:53.238150 I | etcdserver/api: enabled capabilities for version 3.2
2017-03-29 16:06:53.238175 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster 5b1c4f256b02 from store
2017-03-29 16:06:53.238183 I | etcdserver/membership: set the cluster version to 3.2 from store
2017-03-29 16:06:53.243627 C | etcdmain: database file (backup/member/snap/db) of the backend is missing

the issue is that the db file is not present in the backup folder. etcd fails at this check https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L391

@luweijie007 I am investigating this issue. I'll let you know my progress.

@heyitsanthony
Copy link
Contributor

3.1 expects a db file since restoring from backup is expected to come from an etcdctl snapshot restore. The simplest workaround would probably be to add an empty db file when creating the backup with etcdctl backup

@fanminshi
Copy link
Member

@heyitsanthony I was also able get around of this issue by just copying the db from original data-dir to the backup-dir after running etcdctl backup . However, I wasn't sure if that's correct. And it seems to me that etcdctl snapshot save only saves v3 key-value pairs but not v2 key-value pairs.

@heyitsanthony
Copy link
Contributor

@fanminshi there is already an open issue on this subject at #7002. I don't think copying the db file is safe for when doing a v2 restore because the WAL's membership data will not match the membership data in the db.

@fanminshi
Copy link
Member

@heyitsanthony agreed.

@heyitsanthony
Copy link
Contributor

@luweijie007 is the cluster only storing v3 keys? If so, that would explain why the data isn't showing up after etcdctl backup / restore. Try etcdctl snapshot's save and restore, there's an example in the etcd3 recovery guide

@fanminshi
Copy link
Member

fanminshi commented Mar 29, 2017

@luweijie007

When after backing up with
etcdctl backup --data-dir /opt/dzhyun/etcd-cluster-1/data/ --backup-dir /home/wwf/etcd_back/

create an empty db file with
touch /home/wwf/etcd_back/db

then start etcd with the new backup dir should work; I tested that with cluster storing both v2 and v3 keys.

edit: this doesn't work as intended. see #7615 (comment) below.

@luweijie007
Copy link
Author

luweijie007 commented Mar 30, 2017

@heyitsanthony @fanminshi
this cluster I want to restore has keep v3 anv v2 key. I do the restore follow this doc:
https://github.com/coreos/etcd/blob/40ae83beab6ecc55ed64825bac59db21a7e0c2c2/Documentation/v2/admin_guide.md#disaster-recovery
ok I will try to test your way fanminshi
thinks

@heyitsanthony
Copy link
Contributor

@luweijie007 it's not possible to restore both v2 and v3 keys, hence the issue #7002. backup will only save v2 keys.

@luweijie007
Copy link
Author

@heyitsanthony , you mean that It is not no way to restore a cluster which has keep v2 and v3 keys?
but and from this comment from @fanminshi
image
I consider it can restore both v2 and v3 keys .
this issue #7002, i may been can not do this

@heyitsanthony
Copy link
Contributor

@luweijie007 that's storing keys. Not retrieving old keys. The v3 keys are held in the db; creating an empty db file won't restore them.

@luweijie007
Copy link
Author

ok , I try as fanminshi instruction. and this new cluster can get v2 keys, but v3 keys just has a little. NO all v3 keys.
so this result is that If one cluster which store v2 and v3 keys is not way to restore all keys at this time?
@fanminshi , so can I store key v2 and v3 keys one the same cluster etcd ?

@heyitsanthony
Copy link
Contributor

@luweijie007 what is happening in this case is the backed up WAL still has some v3 proposals in it. However, there's no guarantee it will have all the v3 keys since the WAL is periodically pruned and the v3 keys are saved into the DB. It's not a reliable backup method for v3 keys.

It's possible to store both v2 and v3 keys into an etcd cluster, but there's no way official way to restore both into a new cluster.

@luweijie007
Copy link
Author

luweijie007 commented Mar 30, 2017

@heyitsanthony @fanminshi
ok, I understand about restore from your comment .
thinks a lots!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants