Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: state.commit is out of range #5664

Closed
heyitsanthony opened this issue Jun 14, 2016 · 21 comments · Fixed by #5690
Closed

raft: state.commit is out of range #5664

heyitsanthony opened this issue Jun 14, 2016 · 21 comments · Fixed by #5690

Comments

@heyitsanthony
Copy link
Contributor

via local-tester with reordering:

2016-06-13 17:47:28.006281 I | etcdmain: etcd Version: 3.0.0-beta.0+git
2016-06-13 17:47:28.006332 I | etcdmain: Git SHA: 65e19a1
2016-06-13 17:47:28.006337 I | etcdmain: Go Version: go1.6
2016-06-13 17:47:28.006348 I | etcdmain: Go OS/Arch: linux/amd64
2016-06-13 17:47:28.006353 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2016-06-13 17:47:28.006359 W | etcdmain: no data-dir provided, using default data-dir ./infra1.etcd
2016-06-13 17:47:28.006391 N | etcdmain: the server is already initialized as member before, starting as etcd member... 
2016-06-13 17:47:28.006451 I | etcdmain: listening for peers on http://127.0.0.1:12380
2016-06-13 17:47:28.006474 I | etcdmain: listening for client requests on 127.0.0.1:11119 
2016-06-13 17:47:28.013070 I | etcdserver: recovered store from snapshot at index 38626 
2016-06-13 17:47:28.013081 I | etcdserver: name = infra1
2016-06-13 17:47:28.013087 I | etcdserver: data dir = infra1.etcd 
2016-06-13 17:47:28.013093 I | etcdserver: member dir = infra1.etcd/member
2016-06-13 17:47:28.013098 I | etcdserver: heartbeat = 100ms
2016-06-13 17:47:28.013104 I | etcdserver: election = 1000ms
2016-06-13 17:47:28.013109 I | etcdserver: snapshot count = 1000
2016-06-13 17:47:28.013118 I | etcdserver: advertise client URLs = http://127.0.0.1:2379
2016-06-13 17:47:28.065753 I | etcdserver: restarting member 5da0b1f0ade347d1 in cluster ea3db81f3897e3ad at commit index 38040
2016-06-13 17:47:28.065854 C | raft: 5da0b1f0ade347d1 state.commit 38040 is out of range [38626, 38626]
panic: 
5da0b1f0ade347d1 state.commit 38040 is out of range [38626, 38626]
goroutine 1 [running]:
panic(0xcd2fa0, 0xc8205ff170) 
        /usr/lib/go/src/runtime/panic.go:464 +0x3e6
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc82010d100, 0x119e200, 0x2b, 0xc8204a5d40, 0x4, 0x4)
        /home/anthony/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:73 +0x191
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).loadState(0xc820204340, 0x67a, 0xd2643f51f16cc22b, 0x9498, 0x0, 0x0, 0x0)
        /home/anthony/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:942 +0x2a2
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.newRaft(0xc8201f7b30, 0x451b20)
        /home/anthony/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:225 +0x8ff
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.RestartNode(0xc8201f7b30, 0x0, 0x0)
        /home/anthony/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:212 +0x45
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.restartNode(0xc8202163c0, 0xc820208990, 0x29, 0xc8201f7f68, 0x0, 0x0, 0x0, 0x0)
        /home/anthony/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/raft.go:361 +0x7c7
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc8202163c0, 0x0, 0x0, 0x0)
        /home/anthony/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:350 +0x3cf2
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc820160800, 0x0, 0x0, 0x0)
        /home/anthony/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:366 +0x23ea
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
        /home/anthony/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:116 +0x213d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
        /home/anthony/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:36 +0x21e
main.main()
        /home/anthony/go/src/github.com/coreos/etcd/cmd/main.go:28 +0x14
Terminating etcd1
@xiang90
Copy link
Contributor

xiang90 commented Jun 15, 2016

Seems like a bad assumption we made in raft library.

@siddontang
Copy link
Contributor

can reordering the raft message in test reproduce this?
@xiang90

@xiang90
Copy link
Contributor

xiang90 commented Jun 15, 2016

@siddontang Probably. You need a full raft restart + receive an out of order message from the previous connection (or the sender holds it for a really long time). I do not expect this to happen a lot in practice. But, yes, we need to fix this.

@marclennox
Copy link

I have a node that won't start because of this error, I believe because the other members have a bad commit and therefore the node is failing with this error on recovery.

Any suggestions on how to recover this node and/or the cluster?

@xiang90
Copy link
Contributor

xiang90 commented Jun 30, 2016

@marclennox Can you please provide the full startup log?

@marclennox
Copy link

Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.989579 I | flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://jrprd-db01.justreply.co:2379
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.989739 I | flags: recognized and used environment variable ETCD_CERT_FILE=/parasite-config/conf/etcd/client-cert.pem
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.989765 I | flags: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=1
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.989788 I | flags: recognized and used environment variable ETCD_DATA_DIR=/parasite-data/etcd
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.989818 I | flags: recognized and used environment variable ETCD_ELECTION_TIMEOUT=5000
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.989839 I | flags: recognized and used environment variable ETCD_HEARTBEAT_INTERVAL=1000
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.989864 I | flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://jrprd-db01.justreply.co:2380
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.989973 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=jrprd-db01=https://jrprd-db01.justreply.co:2380,jrprd-web01=https://jrprd-web01.justreply.co:2380,jrstg-db01=https://jrstg-db01.justreply.co:2380,jrstg-web01=https://jrstg-web01.justreply.co:2380,jrprd-ops01=https://jrprd-ops01.justreply.co:2380
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.989996 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990013 I | flags: recognized and used environment variable ETCD_KEY_FILE=/parasite-config/conf/etcd/client-key.pem
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990044 I | flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990062 I | flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990084 I | flags: recognized and used environment variable ETCD_NAME=jrprd-db01
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990138 I | flags: recognized and used environment variable ETCD_PEER_CERT_FILE=/parasite-config/conf/etcd/peer-cert.pem
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990176 I | flags: recognized and used environment variable ETCD_PEER_CLIENT_CERT_AUTH=1
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990195 I | flags: recognized and used environment variable ETCD_PEER_KEY_FILE=/parasite-config/conf/etcd/peer-key.pem
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990209 I | flags: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/parasite-config/conf/etcd/peer-ca.pem
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990241 I | flags: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/parasite-config/conf/etcd/client-ca.pem
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990407 I | etcdmain: etcd Version: 3.0.0
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990425 I | etcdmain: Git SHA: 6f48bda
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990436 I | etcdmain: Go Version: go1.6.2
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990447 I | etcdmain: Go OS/Arch: linux/amd64
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990459 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990560 N | etcdmain: the server is already initialized as member before, starting as etcd member...
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.990609 I | etcdmain: peerTLS: cert = /parasite-config/conf/etcd/peer-cert.pem, key = /parasite-config/conf/etcd/peer-key.pem, ca = , trusted-ca = /parasite-config/conf/etcd/peer-ca.pem, client-cert-auth = true
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.992868 I | etcdmain: listening for peers on https://0.0.0.0:2380
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.992935 I | etcdmain: clientTLS: cert = /parasite-config/conf/etcd/client-cert.pem, key = /parasite-config/conf/etcd/client-key.pem, ca = , trusted-ca = /parasite-config/conf/etcd/client-ca.pem, client-cert-auth = true
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.995093 I | etcdmain: listening for client requests on 0.0.0.0:2379
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.998423 I | etcdserver: name = jrprd-db01
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.998480 I | etcdserver: data dir = /parasite-data/etcd
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.998498 I | etcdserver: member dir = /parasite-data/etcd/member
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.998510 I | etcdserver: heartbeat = 1000ms
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.998521 I | etcdserver: election = 5000ms
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.998533 I | etcdserver: snapshot count = 10000
Jun 30 23:32:15 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:15.998549 I | etcdserver: advertise client URLs = https://jrprd-db01.justreply.co:2379
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:16.000775 I | etcdserver: restarting member 39246a319e218d4a in cluster 7b622c05bd899518 at commit index 51644801
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: 2016-06-30 23:32:16.001104 C | raft: 39246a319e218d4a state.commit 51644801 is out of range [0, 0]
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: panic: 39246a319e218d4a state.commit 51644801 is out of range [0, 0]
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: goroutine 1 [running]:
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: panic(0xd44e00, 0xc82012e430)
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: /usr/local/go/src/runtime/panic.go:481 +0x3e6
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc8201ba380, 0x1235f80, 0x2b, 0xc820136600, 0x4, 0x4)
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x191
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).loadState(0xc8204ec0d0, 0x856, 0x0, 0x3140981, 0x0, 0x0, 0x0)
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:942 +0x2a2
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.newRaft(0xc820157a88, 0x0)
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:225 +0x8ff
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.RestartNode(0xc820157a88, 0x0, 0x0)
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:213 +0x45
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.restartNode(0xc820299680, 0x0, 0x7f22acf5b028, 0xc82019e730, 0x0, 0x0, 0xc82001611a, 0x26)
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/raft.go:369 +0x7c7
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc820299680, 0x0, 0x7f22acf5b028, 0xc82019e730)
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:353 +0x411d
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc8201cc400, 0x0, 0x0, 0x0)
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:366 +0x23ea
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:116 +0x213d
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:36 +0x21e
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: main.main()
Jun 30 23:32:16 jrprd-db01.justreply.co docker[25549]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/main.go:28 +0x14

@heyitsanthony
Copy link
Contributor Author

@marclennox this looks like a different problem from the one in the issue (although it panics on the same path). I'll see if I can reproduce this behavior.

In the meantime, I think the easiest fix (but not 100% sure this will work) is to delete the broken node's etcd data directory (back up the directory somewhere first just in case) so that the node can rebuild its raft state on joining the cluster. /cc @xiang90

@marclennox
Copy link

Thanks @heyitsanthony. I've already tried deleting the data directory, same error when it tries to rebuild its raft state.

@xiang90
Copy link
Contributor

xiang90 commented Jul 1, 2016

@marclennox From the log, it looks like raft node lost its previous state somehow (snapshot file is broken/missing?)

So your cluster is still running? The easiest to recovery is treat the node as a failed one. Then you can remove the bad member using etcd member API, and add it back.

Check: https://github.com/coreos/etcd/blob/release-2.3/Documentation/runtime-configuration.md#replace-a-failed-machine

@marclennox
Copy link

Thanks @xiang90

Now I'm getting the following error

Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: 2016-07-01 03:16:03.322319 I | raft: db49ed2084c203b4 [commit: 0, lastindex: 0, lastterm: 0] starts to restore snapshot [index: 51674804, term: 2134]
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: 2016-07-01 03:16:03.322375 I | raft: log [committed=0, applied=0, unstable.offset=1, len(unstable.Entries)=0] starts to restore snapshot [index: 51674804, term: 2134]
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: 2016-07-01 03:16:03.322399 I | raft: db49ed2084c203b4 restored progress of 39246a319e218d4a [next = 51674805, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: 2016-07-01 03:16:03.322423 I | raft: db49ed2084c203b4 restored progress of 4bb64a6466927376 [next = 51674805, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: 2016-07-01 03:16:03.322438 I | raft: db49ed2084c203b4 restored progress of 77d8cbcb900dd306 [next = 51674805, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: 2016-07-01 03:16:03.322452 I | raft: db49ed2084c203b4 restored progress of aee0eec5f4946784 [next = 51674805, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: 2016-07-01 03:16:03.322466 I | raft: db49ed2084c203b4 restored progress of f48c6b505d0dc072 [next = 51674805, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: 2016-07-01 03:16:03.322899 I | raft: db49ed2084c203b4 [commit: 51674804] restored snapshot [index: 51674804, term: 2134]
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: 2016-07-01 03:16:03.341853 I | etcdserver: applying snapshot at index 0...
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: 2016-07-01 03:16:03.347874 C | etcdserver: get database snapshot file path error: snap: snapshot file doesn't exist
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: 2016-07-01 03:16:03.347889 I | etcdserver: finished applying incoming snapshot at index 0
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: panic: get database snapshot file path error: snap: snapshot file doesn't exist
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: goroutine 192 [running]:
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: panic(0xd44e00, 0xc822ece9b0)
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: /usr/local/go/src/runtime/panic.go:481 +0x3e6
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc820135980, 0x125b5c0, 0x29, 0xc8204d16b8, 0x1, 0x1)
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x191
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).applySnapshot(0xc820365200, 0xc8203e2d80, 0xc8216f1ce0)
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:650 +0x5a1
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).applyAll(0xc820365200, 0xc8203e2d80, 0xc8216f1ce0)
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:611 +0x60
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).run.func2(0x7fe98a741590, 0xc8203e2d40)
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:592 +0x32
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule.(*fifo).run(0xc8203dfc20)
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule/schedule.go:160 +0x323
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule.NewFIFOScheduler
Jul 01 03:16:03 jrprd-db01.justreply.co docker[46726]: /home/gyuho/go/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/schedule/schedule.go:71 +0x27d

@xiang90
Copy link
Contributor

xiang90 commented Jul 1, 2016

@marclennox What is the version of your etcdserver? Have you cleaned up the data-dir entirely before rejoining?

@marclennox
Copy link

I'm running it using the published docker image quay.io/coreos/etcd:latest, and yes definitely cleaned out data directory.

@marclennox
Copy link

Oh I see, so this is now version 3.0.0.

@marclennox
Copy link

I'll revert to 2.3.7 and see how that goes.

@xiang90
Copy link
Contributor

xiang90 commented Jul 1, 2016

@marclennox OK. Thanks!

@marclennox
Copy link

Yep, that fixed it. Thanks @xiang90 and @heyitsanthony for helping me work through the problem. :) Sorry for the noise.

@tbchj
Copy link

tbchj commented Mar 17, 2017

@heyitsanthony how do you solve the problem you said?
clean the data directory? is this the only way?
in my cluster,i only have one node. what can i do? clean all data?
i don't think this is the best way to solve question.

@heyitsanthony
Copy link
Contributor Author

heyitsanthony commented Mar 17, 2017

@tbchj if the raft state is corrupted for a single node, then disaster recovery is usually the only way out of it, if possible.

@tbchj
Copy link

tbchj commented Apr 14, 2017

@heyitsanthony sorry, busy.
i solved the problem i said . but not the way clean all data. maybe clean all data also works.
in the data directory, i remove the broken file both wal and snap.
after several times retry. i found it maybe was both wal and snap are inconsistent. the newest wal message lose, so i delete the newest snap file, and i think the message will write from wal to snap again.
and it works. the etcd start work.

@cwx559275
Copy link

Hi. Does anybody know the root reason of this problem like "panic: bda4ffc1bc48207d state.commit 472372997 is out of range [472308405, 472310039]"?? And in which version it has been fixed already ????
please tell me.

@Queetinliu
Copy link

@heyitsanthony sorry, busy. i solved the problem i said . but not the way clean all data. maybe clean all data also works. in the data directory, i remove the broken file both wal and snap. after several times retry. i found it maybe was both wal and snap are inconsistent. the newest wal message lose, so i delete the newest snap file, and i think the message will write from wal to snap again. and it works. the etcd start work.

I does you said,remove broken files under wal and delete the latest snap file ,then restart etcd,it solves my problem,thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

7 participants