Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe performance degradation when shutting one of 3 nodes down. #19236

Closed
arctica opened this issue Oct 13, 2017 · 13 comments
Closed

Severe performance degradation when shutting one of 3 nodes down. #19236

arctica opened this issue Oct 13, 2017 · 13 comments
Assignees
Labels
C-question A question rather than an issue. No code/spec/doc change needed. O-community Originated from the community
Milestone

Comments

@arctica
Copy link

arctica commented Oct 13, 2017

Running a 3 node cluster of the recent v1.1 release with the load generator from #18657 running on node "master" (other two nodes are called node1 and node2).

When I shut down node2, queries per second drop from around 1600 to a very low 50-250.
This is unexpected to me since I'd expect the two remaining healthy nodes to still be able to form a raft quorum and make normal progress.

Node2 was shut down gracefully and the admin UI doesn't show any unavailable ranges.

screen shot 2017-10-13 at 15 33 37

After bringing node2 back up, qps get back to the same level as before the shutdown.
The table at the time of testing had over 6.6k ranges.

Log entries from node "master":

W171013 13:22:51.465186 229 storage/node_liveness.go:400  [n1,hb] slow heartbeat took 2.3s
W171013 13:22:51.465200 229 storage/node_liveness.go:343  [n1,hb] failed node liveness heartbeat: context deadline exceeded
I171013 13:22:58.291636 74 storage/replica_proposal.go:453  [n1,s1,r5385/1:/Table/51/1/"EP{UJ"-WY"}] new range lease repl=(n1,s1):1 start=1507896297.518459938,0 epo=3 pro=1507900978.219057027,21 following repl=(n1,s1):1 start=1507896297.518459938,0 epo=3 pro=1507896604.493999035,0
I171013 13:22:58.291687 110 storage/replica_proposal.go:453  [n1,s1,r1760/1:/Table/51/1/"AY{GM"-JI"}] new range lease repl=(n1,s1):1 start=1507890075.674923969,0 epo=3 pro=1507900978.220069898,6 following repl=(n1,s1):1 start=1507890075.674923969,0 epo=3 pro=1507896601.862517599,0
I171013 13:22:58.291892 113 storage/replica_proposal.go:453  [n1,s1,r5305/1:/Table/51/1/"OZ{VB"-XP"}] new range lease repl=(n1,s1):1 start=1507890204.289175109,0 epo=3 pro=1507900978.217880919,0 following repl=(n1,s1):1 start=1507890204.289175109,0 epo=3 pro=1507896602.855056336,0
... tons more of these
E171013 13:23:55.562359 47913703 storage/replica_command.go:2263  [replica consistency checker,n1,s1,r4648/1:/Table/51/1/"ZC{OQ"-QP"}] could not CollectChecksum from replica (n2,s2):2: rpc error: code = Unavailable desc = grpc: the connection is unavailable
I171013 13:23:58.713177 142 server/status/runtime.go:223  [n1] runtime stats: 16 GiB RSS, 418 goroutines, 315 MiB/61 MiB/496 MiB GO alloc/idle/total, 13 GiB/15 GiB CGO alloc/total, 2856.56cgo/sec, 1.27/0.12 %(u/s)time, 0.00 %gc (3x)
W171013 13:24:02.553239 229 storage/node_liveness.go:400  [n1,hb] slow heartbeat took 1.3s
E171013 13:24:07.836866 47923321 storage/replica_command.go:2263  [replica consistency checker,n1,s1,r6400/1:/Table/51/1/"UM{UQ"-WX"}] could not CollectChecksum from replica (n2,s2):2: rpc error: code = Unavailable desc = grpc: the connection is unavailable
I171013 13:24:08.710601 140 gossip/gossip.go:478  [n1] gossip status (ok, 3 nodes)
gossip client (1/3 cur/max conns)
  3: node1.example.com:26257 (3h1m44s: infos 7036/17366 sent/received, bytes 1501477B/2397808B sent/received)
gossip server (0/3 cur/max conns, infos 33426/22727 sent/received, bytes 5384879B/3485053B sent/received)
I171013 13:24:08.713265 142 server/status/runtime.go:223  [n1] runtime stats: 16 GiB RSS, 396 goroutines, 300 MiB/70 MiB/496 MiB GO alloc/idle/total, 13 GiB/15 GiB CGO alloc/total, 1923.64cgo/sec, 1.12/0.09 %(u/s)time, 0.00 %gc (3x)
I171013 13:24:12.451042 68 storage/replica_proposal.go:453  [n1,s1,r5862/1:/Table/51/1/"IN{RZ"-UH"}] new range lease repl=(n1,s1):1 start=1507901050.537382719,0 epo=3 pro=1507901050.537388241,0 following repl=(n3,s3):3 start=1507890364.479579070,0 epo=3 pro=1507890364.479581520,0
I171013 13:24:16.044786 292 vendor/google.golang.org/grpc/grpclog/grpclog.go:75  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 88.99.192.219:26257: getsockopt: connection refused"; Reconnecting to {node2.example.com:26257 <nil>}
I171013 13:24:18.713002 142 server/status/runtime.go:223  [n1] runtime stats: 16 GiB RSS, 408 goroutines, 230 MiB/117 MiB/496 MiB GO alloc/idle/total, 13 GiB/15 GiB CGO alloc/total, 1650.31cgo/sec, 1.01/0.10 %(u/s)time, 0.00 %gc (3x)
W171013 13:24:20.251922 229 storage/node_liveness.go:400  [n1,hb] slow heartbeat took 1.0s
E171013 13:24:21.670268 47929080 storage/replica_command.go:2263  [replica consistency checker,n1,s1,r4541/1:/Table/51/1/"GD{VX"-YB"}] could not CollectChecksum from replica (n2,s2):2: rpc error: code = Unavailable desc = grpc: the connection is unavailable
W171013 13:24:22.687896 69 storage/store.go:3581  [n1,s1] handle raft ready: 0.7s [processed=0]
...
W171013 13:24:24.850171 229 storage/node_liveness.go:400  [n1,hb] slow heartbeat took 1.1s
W171013 13:24:28.144420 124 vendor/github.com/coreos/etcd/raft/raft.go:825  [n1,s1,r6654/1:/Table/51/1/"GM{AI"-CH"}] 1 stepped down to follower since quorum is not active
W171013 13:24:28.154713 87 vendor/github.com/coreos/etcd/raft/raft.go:825  [n1,s1,r6132/1:/Table/51/1/"RT{VW"-YH"}] 1 stepped down to follower since quorum is not active
W171013 13:24:28.173957 92 vendor/github.com/coreos/etcd/raft/raft.go:825  [n1,s1,r1233/1:/Table/51/1/"KL{CO"-FJ"}] 1 stepped down to follower since quorum is not active
W171013 13:24:28.339199 104 vendor/github.com/coreos/etcd/raft/raft.go:825  [n1,s1,r2525/1:/Table/51/1/"FI{VO"-YL"}] 1 stepped down to follower since quorum is not active
W171013 13:24:28.339357 108 vendor/github.com/coreos/etcd/raft/raft.go:825  [n1,s1,r6656/1:/Table/51/1/"GJ{KW"-MS"}] 1 stepped down to follower since quorum is not active
W171013 13:24:28.407124 92 vendor/github.com/coreos/etcd/raft/raft.go:825  [n1,s1,r2791/1:/Table/51/1/"WK{FI"-ID"}] 1 stepped down to follower since quorum is not active
W171013 13:24:28.407199 90 vendor/github.com/coreos/etcd/raft/raft.go:825  [n1,s1,r1097/1:/Table/51/1/"W{KXM"-LAI"}] 1 stepped down to follower since quorum is not active
W171013 13:24:28.485848 110 vendor/github.com/coreos/etcd/raft/raft.go:825  [n1,s1,r2505/1:/Table/51/1/"K{OYN"-PBL"}] 1 stepped down to follower since quorum is not active
W171013 13:24:28.537472 87 vendor/github.com/coreos/etcd/raft/raft.go:825  [n1,s1,r3794/1:/Table/51/1/"KL{SX"-VI"}] 1 stepped down to follower since quorum is not active
W171013 13:24:28.537560 69 vendor/github.com/coreos/etcd/raft/raft.go:825  [n1,s1,r4444/1:/Table/51/1/"UK{RX"-UM"}] 1 stepped down to follower since quorum is not active
W171013 13:24:28.537705 128 vendor/github.com/coreos/etcd/raft/raft.go:825  [n1,s1,r1010/1:/Table/51/1/"L{LZW"-MCO"}] 1 stepped down to follower since quorum is not active
W171013 13:24:28.568405 119 vendor/github.com/coreos/etcd/raft/raft.go:825  [n1,s1,r5608/1:/Table/51/1/"JN{BN"-EA"}] 1 stepped down to follower since quorum is not active
I171013 13:24:28.712992 142 server/status/runtime.go:223  [n1] runtime stats: 16 GiB RSS, 434 goroutines, 302 MiB/75 MiB/502 MiB GO alloc/idle/total, 13 GiB/15 GiB CGO alloc/total, 1180.39cgo/sec, 0.49/0.05 %(u/s)time, 0.00 %gc (1x)
@a-robinson
Copy link
Contributor

Thanks once again, @arctica :)

This is indeed not expected. We'll take a close look at this soon (next week, most likely), but if we aren't able to easily reproduce it then the following information would be useful:

  • The logs from node 3
  • The graphs from the Replication, Distributed, and Queues pages
  • The graphs from the hidden Raft page (at path #/raft/messages/all/)

Assigning to myself to look early next week, but cc @cuongdo in case you have someone else in mind.

@a-robinson a-robinson self-assigned this Oct 13, 2017
@arctica
Copy link
Author

arctica commented Oct 13, 2017

Hi @a-robinson , thanks for taking an interest in this :)

Good to hear this is unexpected indeed. Please find attached the screenshots of the 4 pages you asked for.

Logs from the node that was shut down:

I171013 13:23:12.079770 1 cli/start.go:617  received signal 'interrupt'
I171013 13:23:12.118399 1 cli/start.go:644  initiating graceful shutdown of server
I171013 13:23:12.318046 4361324 util/stop/stopper.go:297  [drain,n2,s2,r4811/2:/Table/51/1/"RU{DZ"-GJ"}] stopper throttling task from storage.Store: draining replica due to semaphore
I171013 13:23:12.345576 4361324 util/stop/stopper.go:297  [drain,n2,s2,r6217/2:/Table/51/1/"RN{KA"-MI"}] stopper throttling task from storage.Store: draining replica due to semaphore
I171013 13:23:12.379660 4361324 util/stop/stopper.go:297  [drain,n2,s2,r3608/2:/Table/51/1/"HG{ML"-PA"}] stopper throttling task from storage.Store: draining replica due to semaphore
... lots more like this...
I171013 13:23:13.578675 4361324 util/stop/stopper.go:297  [drain,n2,s2,r3655/2:/Table/51/1/"FM{FA"-HP"}] stopper throttling task from storage.Store: draining replica due to semaphore
I171013 13:23:13.585677 148 storage/replica_proposal.go:453  [n2,s2,r574/2:/Table/51/1/"KL{NJ"-QD"}] new range lease repl=(n2,s2):2 start=1507900993.475835622,0 epo=7 pro=1507900993.475843442,0 following repl=(n3,s3):3 start=1507890225.221899355,0 epo=3 pro=1507893815.524783496,0
I171013 13:23:13.588193 4361324 util/stop/stopper.go:297  [drain,n2,s2,r991/2:/Table/51/1/"H{UYJ"-VBH"}] stopper throttling task from storage.Store: draining replica due to semaphore
.... more more more...
I171013 13:23:14.698511 4361324 util/stop/stopper.go:297  [drain,n2,s2,r767/2:/Table/51/1/"EX{FC"-IB"}] stopper throttling task from storage.Store: draining replica due to semaphore
I171013 13:23:14.703680 156 storage/replica_proposal.go:453  [n2,s2,r3955/2:/Table/51/1/"X{SXX"-TAJ"}] new range lease repl=(n2,s2):2 start=1507900994.562998312,0 epo=7 pro=1507900994.563000270,0 following repl=(n3,s3):3 start=1507890454.138951606,0 epo=3 pro=1507890454.138954188,0
I171013 13:23:14.706852 4361324 util/stop/stopper.go:297  [drain,n2,s2,r2430/2:/Table/51/1/"UL{UL"-XJ"}] stopper throttling task from storage.Store: draining replica due to semaphore
... more of the same...
I171013 13:23:15.382318 4361324 util/stop/stopper.go:476  quiescing; tasks left:
W171013 13:23:15.382481 293 storage/raft_transport.go:461  [n2] raft transport stream to node 3 failed: rpc error: code = Unavailable desc = transport is closing
W171013 13:23:15.382501 338 storage/raft_transport.go:461  [n2] raft transport stream to node 1 failed: rpc error: code = Unavailable desc = transport is closing
W171013 13:23:15.382521 42 gossip/gossip.go:1241  [n2] no incoming or outgoing connections
E171013 13:23:15.382782 198 storage/queue.go:656  [raftlog,n2,s2,r1344/2:/Table/51/1/"QE{QS"-TP"}] failed to send RPC: sending to all 3 replicas failed; last error: {<nil> rpc error: code = FailedPrecondition desc = grpc: the client connection is closing}
I171013 13:23:15.422095 4361324 util/stop/stopper.go:476  quiescing; tasks left:
W171013 13:23:15.431484 4375878 gossip/infostore.go:303  [n2] node unavailable; try another peer
I171013 13:23:15.436456 205 vendor/google.golang.org/grpc/grpclog/grpclog.go:75  transport: http2Server.HandleStreams failed to read frame: read tcp 88.99.192.219:26257->88.99.193.108:52168: use of closed network connection
I171013 13:23:15.436533 206 vendor/google.golang.org/grpc/grpclog/grpclog.go:75  transport: http2Server.HandleStreams failed to read frame: read tcp 88.99.192.219:26257->88.99.192.217:40222: use of closed network connection
I171013 13:23:15.436685 4361324 storage/engine/rocksdb.go:526  closing rocksdb instance at "/root/cockroach-data/local"
I171013 13:23:15.451496 4361324 storage/engine/rocksdb.go:526  closing rocksdb instance at "/root/cockroach-data"
I171013 13:23:17.168533 1 cli/start.go:676  server drained and shutdown completed
E171013 13:23:17.179554 1 cli/error.go:68  interrupted

The only weird thing that jumps out to me is that the node seems to still have gotten some range leases during shutdown.

screenshot-2017-10-13 cockroach console 3
screenshot-2017-10-13 cockroach console 2
screenshot-2017-10-13 cockroach console 1
screenshot-2017-10-13 cockroach console

@tbg
Copy link
Member

tbg commented Oct 13, 2017

@a-robinson I assigned you to #17611 which seems largely aligned with this issue.

@a-robinson
Copy link
Contributor

As mentioned on #19262, the load generator as written doesn't handle errors well - in my first attempt to reproduce this, the QPS dropped to 0 because the load generator received an ambiguous error and log.Fataled. In this specific case, I assume that the load generator process never crashed, and when you restarted the third node it just started working again without you having to restart it?

Also, I assume that the three nodes had very low latency network connections between them? We've seen this before in somewhat similar situations in widely distributed clusters, but I'm having trouble reproducing with 3 VMs in the same GCE zone after replacing the log.Fatal(err) in the load generator with log.Print(err). I only ever get very brief dips, like this:

screen shot 2017-10-16 at 3 24 35 pm

That probably just means that I need to try harder to get a repro case, I just want to make sure I know exactly what was happening when you ran into it.

@arctica
Copy link
Author

arctica commented Oct 16, 2017

When this degradation happened in my latest test (where the graphs are from), the load generator did not quit, hence there still being queries running. The latency between the machines is around 0.2ms. They are dedicated machines in the same datacenter.

Yes, when I restarted the third node, speed came back up without doing anything to the load generator. Note that my table has very small range min/max values set (64kb) in order to make it split into many ranges. That might be a deciding factor. You can see in the graphs that there were over 5k ranges.

@arctica
Copy link
Author

arctica commented Oct 17, 2017

I've just tested this with UPDATE and SELECT queries, same degradation happens in all cases.
Here the slightly updated load generator: https://gist.github.com/arctica/7cbd23fd8873a157ef94af5fef4e7808

I'm running it with -strlength 4 btw. My original test some time ago used 3.

@a-robinson
Copy link
Contributor

Alright, so now that I have splits working at much smaller sizes than before (#19339), I've been able to reproduce the performance degradation you reported and I know what's going on.

The core issue is #9446. To summarize, we have an optimization that disables the periodic network heartbeats in ranges that aren't receiving traffic. This optimization gets disabled if any of the replicas of a range are behind, which is the case when one of the replicas is offline.

It's worth noting that this issue is at its absolute worst in a 3 node cluster. We won't remove the replicas that are on the dead node in a 3 node cluster because there's no other node to move them to, which means the cluster gets stuck in this state indefinitely. On a larger cluster, the problem would go away after 5+ minutes, since we'd start creating new copies of the data from the dead node after 5 minutes.

The reason I wasn't seeing this before doing all the splits is because the amount of heartbeat traffic scales with the number of ranges in the cluster.

I've copied the most relevant graphs below, but to prove that this was really the issue I built a custom binary that never quiesces replica. With that custom binary, I see the exact same behavior with 3 nodes running as I was seeing with the standard binary when one of the nodes is down.

I'd say we'll need to bump the priority of #9446, at least to avoid permanent situations like this in 3 node clusters. cc @petermattis

screen shot 2017-10-18 at 2 37 35 pm

screen shot 2017-10-18 at 2 44 01 pm

screen shot 2017-10-18 at 2 44 17 pm

screen shot 2017-10-18 at 2 44 26 pm

screen shot 2017-10-18 at 2 44 31 pm

screen shot 2017-10-18 at 2 45 18 pm

screen shot 2017-10-18 at 2 46 42 pm

screen shot 2017-10-18 at 2 47 05 pm

screen shot 2017-10-18 at 2 49 01 pm

@arctica
Copy link
Author

arctica commented Oct 19, 2017

@a-robinson Great to see you were able to reproduce and on top also figure out what the issue is.

Also good to hear that a cluster with >3 nodes wouldn't have this issue, I can afford more than 3 nodes when going into production but I like to test some extreme conditions first. It makes sense that a 3 node cluster with 3 replicas runs into tough situations when one node gues down. But sounds like that's fixable so all good.

@a-robinson
Copy link
Contributor

Also good to hear that a cluster with >3 nodes wouldn't have this issue

It would still have performance degradation, it would just be less intense and would go away after the node is considered dead (which happens 5 minutes after it goes down).

Note that the reason it's so bad in this case is because you split your data into so many small ranges -- if you had let it split more naturally, or not reduced the range size by quite so much, you wouldn't see nearly as bad of performance hit. For example, when I was running with 850 ranges (with my max_range_size set to 2 MiB, I believe), I didn't see any issue from taking a node down.

@arctica
Copy link
Author

arctica commented Oct 19, 2017

Rare performance degradation over a timeframe of 5 minutes are fine for this usecase though I would expect a clean graceful shutdown to avoid that delay.

I'll try to rerun the test with bigger ranges. The reason I've set such a low value was that initially when the table had very few ranges, performance was very bad and only picked up when there were a couple dozen ranges or so. Due to the very small row sizes it would take a long time to split. In this specific case (and I'm sure many others as well), it would actually be great if one could specify a number of ranges to create instead of range sizes.

@arctica
Copy link
Author

arctica commented Oct 19, 2017

Early results of a re-test with a fresh table and range size of 2MiB indicate that performance is much improved over the v1.0 and v1.1 I tested before. Now QPS start right away at nearly the level I saw before with many ranges. QPS are consistently at between 1k and 1.5k with the avg maybe being something like 1.2k.

I'm seeing one weird thing right now: the Replication page still shows 150k ranges even though the old table has been dropped and recreated. The databases page shows 8 ranges for the system db and 32 currently for the test db. Disk storage seems also still in use and the Replicas graph shows 150k as well. Probably because I ctrl-c'd the DROP statement as it was taking a long time due to the load generator still running. Then I stopped the generator and saw the table got dropped. I'll probably have to open another issue for this.

@a-robinson
Copy link
Contributor

In this specific case (and I'm sure many others as well), it would actually be great if one could specify a number of ranges to create instead of range sizes.

Your best bet as of now (until #19154 gets done) is to use the ALTER TABLE ... SPLIT AT to manually create some split points right after you create the table. I'd recommend doing this instead of using an artificially low range size. For example, you might run:

CREATE TABLE test.bench (s string primary key, n int);
ALTER TABLE test.bench SPLIT AT VALUES ('D'), ('I'), ('N'), ('S'), ('X');

to get the table split into 6 roughly evenly sized ranges before you even start loading data. Based on my testing of your load in #18657, you shouldn't need very many ranges to max out performance, and going into the thousands of ranges appeared to make performance noticeably worse.

@a-robinson
Copy link
Contributor

Well, unfortunately this didn't make the cut for 2.0. Our roadmap has us fixing this before the 2.1 release in the fall.

I'm going to close this as a dupe of #9446, but please reopen if there's anything separate to do here.

@jordanlewis jordanlewis added C-question A question rather than an issue. No code/spec/doc change needed. O-community Originated from the community and removed O-deprecated-community-questions labels Apr 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-question A question rather than an issue. No code/spec/doc change needed. O-community Originated from the community
Projects
None yet
Development

No branches or pull requests

5 participants