Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boltdb panic while removing member from cluster #7322

Closed
dennwc opened this issue Feb 14, 2017 · 8 comments · Fixed by #7743
Closed

boltdb panic while removing member from cluster #7322

dennwc opened this issue Feb 14, 2017 · 8 comments · Fixed by #7743
Assignees
Labels
Milestone

Comments

@dennwc
Copy link
Contributor

dennwc commented Feb 14, 2017

We have a cluster of 3 nodes in embed mode. If we try to remove node 2 from cluster (request from node 1) we see a panic in logs of node 2:

017-02-14 12:38:55.399045 I | rafthttp: stopped peer ef67b54f03db7ce0
panic: assertion failed: tx closed

goroutine 1082253 [running]:
panic(0x139ae00, 0xc42bff1170)
        /goroot/src/runtime/panic.go:500 +0x1a1
vendor/github.com/boltdb/bolt._assert(0x14e1c00, 0x163c179, 0x9, 0x0, 0x0, 0x0)
        vendor/github.com/boltdb/bolt/db.go:1029 +0xff
vendor/github.com/boltdb/bolt.(*Cursor).seek(0xc4329f0e88, 0x20e2656, 0x3, 0x3, 0x0, 0x0, 0xc42bcac5c0, 0xc421114400, 0x0, 0x2, ...)
        vendor/github.com/boltdb/bolt/cursor.go:155 +0x6f
vendor/github.com/boltdb/bolt.(*Bucket).Bucket(0xc4215c2638, 0x20e2656, 0x3, 0x3, 0x100000000000000)
        vendor/github.com/boltdb/bolt/bucket.go:112 +0x108
vendor/github.com/boltdb/bolt.(*Tx).Bucket(0xc4215c2620, 0x20e2656, 0x3, 0x3, 0x7f743588e000)
        vendor/github.com/boltdb/bolt/tx.go:101 +0x4f
vendor/github.com/coreos/etcd/mvcc/backend.(*batchTx).UnsafeRange(0xc420e61ac0, 0x20e2656, 0x3, 0x3, 0xc42121fea0, 0x11, 0x12, 0xc42121fec0, 0x11, 0x12, ...)
        vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:88 +0x79
vendor/github.com/coreos/etcd/mvcc.(*store).rangeKeys(0xc420067ee0, 0xc432967b90, 0x69, 0x70, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc42121fe00, ...)
        vendor/github.com/coreos/etcd/mvcc/kvstore.go:524 +0x216
vendor/github.com/coreos/etcd/mvcc.(*store).TxnRange(0xc420067ee0, 0x2552232b93996851, 0xc432967b90, 0x69, 0x70, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        vendor/github.com/coreos/etcd/mvcc/kvstore.go:230 +0xc9
vendor/github.com/coreos/etcd/etcdserver.(*applierV3backend).Range(0xc420118350, 0x2552232b93996851, 0xc432967b20, 0x0, 0x2552232b93996851, 0xc4201af000)
        vendor/github.com/coreos/etcd/etcdserver/apply.go:279 +0x1a8
vendor/github.com/coreos/etcd/etcdserver.(*applierV3backend).applyUnion(0xc420118350, 0x2552232b93996851, 0xc42bff10e0, 0xc42f3c8400)
        vendor/github.com/coreos/etcd/etcdserver/apply.go:474 +0xeb
vendor/github.com/coreos/etcd/etcdserver.(*applierV3backend).Txn(0xc420118350, 0xc4264f7e50, 0xeca9e0, 0x2102a00, 0xc420e55650)
        vendor/github.com/coreos/etcd/etcdserver/apply.go:380 +0x211
vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Txn.func2()
        vendor/github.com/coreos/etcd/etcdserver/v3_server.go:184 +0x5b
vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).doSerialize(0xc4200c5b80, 0x7f74357b8908, 0xc43295c750, 0xc42121fe20, 0xc4329f1780, 0x0, 0x0)
        vendor/github.com/coreos/etcd/etcdserver/v3_server.go:680 +0xfe
vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Txn(0xc4200c5b80, 0x7f74357b8908, 0xc43295c750, 0xc4264f7e50, 0xf14d5b, 0xc420e08180, 0x156b0c0)
        vendor/github.com/coreos/etcd/etcdserver/v3_server.go:185 +0x287
vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*kvServer).Txn(0xc4202ef300, 0x7f74357b8908, 0xc43295c750, 0xc4264f7e50, 0xc4264f7e50, 0x0, 0x0)
        vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/key.go:101 +0x88
vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*quotaKVServer).Txn(0xc4202ef340, 0x7f74357b8908, 0xc43295c750, 0xc4264f7e50, 0x5, 0x164e79c, 0xf)
        vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/quota.go:69 +0xa2
vendor/github.com/coreos/etcd/etcdserver/etcdserverpb._KV_Txn_Handler.func1(0x7f74357b8908, 0xc43295c750, 0x156b0c0, 0xc4264f7e50, 0xc4264f7ea0, 0xc4329f1a08, 0xb0c8d6, 0xc43295c750)
        vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:2341 +0xa2
vendor/github.com/grpc-ecosystem/go-grpc-prometheus.UnaryServerInterceptor(0x7f74357b8908, 0xc43295c750, 0x156b0c0, 0xc4264f7e50, 0xc42121fd60, 0xc42121fd80, 0x419938, 0x20, 0x1491500, 0x1)
        vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server.go:29 +0xa4
vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x7f74357b8908, 0xc43295c750, 0x156b0c0, 0xc4264f7e50, 0xc42121fd60, 0xc42121fd80, 0x71, 0xc420eb6b28, 0x0, 0xc4215bbb08)
        vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/interceptor.go:57 +0xc3
vendor/github.com/coreos/etcd/etcdserver/etcdserverpb._KV_Txn_Handler(0x14c2660, 0xc4202ef340, 0x7f74357b8908, 0xc43295c750, 0xc423563920, 0xc420e080b0, 0x0, 0x0, 0x4, 0xc420eb6e10)
        vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:2343 +0x166
vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc420e060c0, 0x20fe8a0, 0xc423563740, 0xc42bf7e100, 0xc4211bcf60, 0x20daae8, 0x0, 0x0, 0x0)
        vendor/google.golang.org/grpc/server.go:708 +0xbd8
vendor/google.golang.org/grpc.(*Server).handleStream(0xc420e060c0, 0x20fe8a0, 0xc423563740, 0xc42bf7e100, 0x0)
        vendor/google.golang.org/grpc/server.go:892 +0x6a4
vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc42bff1060, 0xc420e060c0, 0x20fe8a0, 0xc423563740, 0xc42bf7e100)
        vendor/google.golang.org/grpc/server.go:468 +0xab
created by vendor/google.golang.org/grpc.(*Server).serveStreams.func1
        vendor/google.golang.org/grpc/server.go:469 +0xa3

Probably node 2 is doing some requests to KV during the process (from our side).

etcd: v3.1.0, 8ba2897a21e4fc51b298ca553d251318425f93ae
bolt: e9cf4fae01b5a8ff89d0ec6b32f0d9c9f79aefdd

@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2017

Can you easily reproduce this?

@dennwc
Copy link
Contributor Author

dennwc commented Feb 14, 2017

@xiang90 Yes, but we don't have a small reproducer for now

@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2017

OK. It seems like we closed the db without waiting for all tx to finish.

We wil take a look. Thank you for reporting.

@jonsyu1
Copy link
Contributor

jonsyu1 commented Feb 28, 2017

We ran into this without removing a member from the cluster. Unfortunately, I do not have concrete repro steps, but we got the same panic after AWS restarted our machines hosting our etcd cluster. We will post more information if we are able to reproduce this failure mode.

@gyuho
Copy link
Contributor

gyuho commented Mar 7, 2017

@heyitsanthony @xiang90

If there's an inflight request to backend, t.tx.DB() == nil could happen

e.g. here https://github.com/coreos/etcd/blob/master/mvcc/backend/batch_tx.go#L87-L88

// UnsafeRange must be called holding the lock on the tx.
func (t *batchTx) UnsafeRange(bucketName []byte, key, endKey []byte, limit int64) (keys [][]byte, vs [][]byte) {
        // inflight db close
        // t.tx.DB() == nil, so it will panic with
        // panic: assertion failed: tx closed
	bucket := t.tx.Bucket(bucketName)

Simple test case in etcd/integration

func TestV3KVInflightRangeOnClosedBackend(t *testing.T) {
	defer testutil.AfterTest(t)
	clus := NewClusterV3(t, &ClusterConfig{Size: 1})
	defer clus.Terminate(t)

	cli := clus.RandClient()
	kvc := toGRPC(cli).KV

	if _, err := kvc.Put(context.Background(), &pb.PutRequest{Key: []byte("foo"), Value: []byte("bar")}); err != nil {
		panic(err)
	}

	ctx, cancel := context.WithTimeout(context.Background(), time.Second)

	donec := make(chan struct{})
	go func() {
		defer close(donec)
		kvc.Range(ctx, &pb.RangeRequest{Key: []byte("foo"), Serializable: true}, grpc.FailFast(false))
	}()

	// instead of clus.Members[0].s.HardStop()
	// to simulate inflight db closing
	clus.Members[0].s.Backend().Close()
	cancel()

	<-donec
}

Any thoughts?

@heyitsanthony
Copy link
Contributor

This is the same problem from #6662; it never really fixed the underlying issue of crashing on in-flight backend operations.

@xiang90
Copy link
Contributor

xiang90 commented Mar 16, 2017

@gyuho @heyitsanthony

Is there a way to stop gRPC and drain pending requests?

@xiang90
Copy link
Contributor

xiang90 commented Mar 16, 2017

@gyuho gyuho changed the title Panic while removing member from cluster boltdb panic while removing member from cluster Mar 17, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 14, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 14, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 14, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 14, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 14, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 14, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017
- Test etcd-io#7322.
- Remove test case added in etcd-io#6662.

Signed-off-by: Gyu-Ho Lee <[email protected]>
gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017
- Test etcd-io#7322.
- Remove test case added in etcd-io#6662.

Signed-off-by: Gyu-Ho Lee <[email protected]>
gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017
- Test etcd-io#7322.
- Remove test case added in etcd-io#6662.

Signed-off-by: Gyu-Ho Lee <[email protected]>
gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 15, 2017
- Test etcd-io#7322.
- Remove test case added in etcd-io#6662.

Signed-off-by: Gyu-Ho Lee <[email protected]>
gyuho added a commit to gyuho/etcd that referenced this issue Apr 17, 2017
gyuho added a commit to gyuho/etcd that referenced this issue Apr 17, 2017
- Test etcd-io#7322.
- Remove test case added in etcd-io#6662.

Signed-off-by: Gyu-Ho Lee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

5 participants