acceptance: qps drops to 0 under block_writer load #8192

cuongdo · 2016-08-01T19:01:49Z

Using this command:

make test \
  PKG=./acceptance \
  TESTTIMEOUT=6h \
  TESTS=ContinuousLoad_BlockWriter \
  TESTFLAGS='-v -remote -key-name google_compute_engine -cwd allocator_terraform -nodes 4 -tf.keep-cluster=failed'

block_writer qps drops to 0 and stays there after 1-3 minutes of load.

There are some stuck tasks:

Some Raft requests are taking a while:

I see some interesting log entries after about 10 seconds of load:

I160801 18:41:51.593554 kv/dist_sender.go:953  application error: key range /Local/Range/"\x96"/RangeDescriptor-/Min outside of bounds of range /Table/13-/Table/14
W160801 18:41:51.593601 kv/dist_sender.go:748  failed to invoke Get [/Local/Range/"\x96"/RangeDescriptor,/Min): key range /Local/Range/"\x96"/RangeDescriptor-/Min outside of bounds of range /Table/13-/Table/14

Then I see a lot of these entries:

I160801 18:41:56.881117 storage/replica.go:1660  store=1:1 range=7 [/Table/51-/Max): pending commands: refurbished 0, reproposing 7 (at 894.883); reasonTicks
I160801 18:41:56.895153 storage/replica.go:1881  store=1:1 range=7 [/Table/51-/Max): applying command with forced error: storage/replica.go:1833: command observed at lease index 890, but required < 884
E160801 18:41:56.895405 storage/replica.go:1923  store=1:1 range=7 [/Table/51-/Max): error executing raft command: storage/replica.go:1833: command observed at lease index 890, but required < 884
I160801 18:41:56.895476 storage/replica.go:1881  store=1:1 range=7 [/Table/51-/Max): applying command with forced error: storage/replica.go:1833: command observed at lease index 890, but required < 885

then lots of:

I160801 18:42:12.683543 storage/replica.go:1660  store=1:1 range=10 [/Table/51/1/2250817352697598038/"cbc9866f-4304-4d00-bd6b-4559a202b6f7"/200-/Table/51/1/4513707353089095697/"8e8d8c40-dd3c-4b96-9161-7ae75dc43bb4"/117): pending commands: refurbished 0, reproposing 3 (at 785.757); reasonTicks
I160801 18:42:12.755140 storage/replica.go:1881  store=1:1 range=10 [/Table/51/1/2250817352697598038/"cbc9866f-4304-4d00-bd6b-4559a202b6f7"/200-/Table/51/1/4513707353089095697/"8e8d8c40-dd3c-4b96-9161-7ae75dc43bb4"/117): applying command with forced error: storage/replica.go:1833: command observed at lease index 760, but required < 758
E160801 18:42:12.758939 storage/replica.go:1923  store=1:1 range=10 [/Table/51/1/2250817352697598038/"cbc9866f-4304-4d00-bd6b-4559a202b6f7"/200-/Table/51/1/4513707353089095697/"8e8d8c40-dd3c-4b96-9161-7ae75dc43bb4"/117): error executing raft command: storage/replica.go:1833: command observed at lease index 760, but required < 758

and some of these:

E160801 18:44:51.872548 storage/replica.go:1923  store=1:1 range=13 [/Table/51/1/1119228282337153470/"bd678e46-19a3-4e21-817d-802b14e56bd1"/390-/Table/51/1/1682660493493272475/"b6bf6e54-43b8-403a-b092-f6b44b60263a"/480): error executing raft command: cannot replace lease repli
ca {2 2 2} 2016-08-01 18:43:42.39559981 +0000 UTC 18.934678912s with replica {3 3 3} 2016-08-01 18:43:02.077579421 +0000 UTC 4.75s: requested lease overlaps previous lease

Full logs (node.0 has the most errors):
node.0.cockroach.stderr.txt
node.1.cockroach.stderr.txt
node.2.cockroach.stderr.txt
node.3.cockroach.stderr.txt
writer.0.block_writer.stdout.txt

The text was updated successfully, but these errors were encountered:

tamird · 2016-08-01T19:25:26Z

Just jumping in to point out that cockroach.storage.MultiRaft/RaftMessage is expected to be long-lived, since it is the entire life of the stream.

While we're testing whether the various processes were alive, we weren't testing whether queries were being processed. Now, we ensure that QPS is non-zero. This would've caught cockroachdb#8192 automatically (I just happened to notice the 0 QPS manually). Also added a load test for photos.

cuongdo · 2016-08-08T18:50:50Z

After enabling SQL tracing, I've caught several traces for SQL queries that take longer than 2 minutes (!). Here's one of those traces.

This error happens upon every retry:

 82826.744ms      1.466ms    error, trying next peer: range 55: replica {0 0 0} not lease holder; node_id:4 store_id:4 replica_id:3  is

Somehow we're repeatedly directing the query to a replica that's not the range leaseholder?

cc @bdarnell

cuongdo · 2016-08-08T18:57:24Z

this block of errors seems to happen with every retry (with some variation):

  9059.429ms      0.065ms    sending RPC to [/Table/51/1/2735983279382156089/"4dfdba17-db83-43e2-ba96-c4fea2c400a4"/4068, /Table/51/1/2768883119070632199/"091b14ee-72fd-4ab8-81b5-77d1ad012714"/13819)
 12059.653ms   3000.225ms    error, trying next peer: rpc error: code = 4 desc = context deadline exceeded
 15060.064ms   3000.411ms    error, trying next peer: range 55: replica {1 1 1} not lease holder; <nil> is
 15061.564ms      1.500ms    error, trying next peer: range 55: replica {0 0 0} not lease holder; node_id:4 store_id:4 replica_id:3  is
 18061.964ms   3000.401ms    reply error BeginTransaction [/Table/51/1/2760458626219534663/"41957c6d-c074-4d73-a316-0017e4455073"/22818/0,/Min), ConditionalPut [/Table/51/1/2760458626219534663/"41957c6d-c074-4d73-a316-0017e4455073"/22818/0,/Min), EndTransaction [/Table/51/1/2760458626219534663/"41957c6d-c074-4d73-a316-0017e4455073"/22818/0,/Min): failed to send RPC: sending to all 4 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded

petermattis · 2016-08-08T19:11:34Z

We preferentially send RPCs to the healthy replicas. I saw that problem once during some of my testing but never investigated further. Perhaps that is fouling things up here. I think you'll need to add some more logging/tracing to test this hypothesis.

bdarnell · 2016-08-08T19:12:33Z

We're trying the lease holder (or there is no lease holder), but it's not responding within the 3 second timeout, so we never officially notice that that node is the lease holder (it wouldn't matter anyway, since 3s would be enough for a lease to expire). We're cycling through all the replicas on each attempt. So the main question is why the lease holder is timing out all the time. (we could probably also reorganize our retry loops here to hold the original request to the lease holder open instead of cancelling it and restarting every cycle).

cuongdo · 2016-08-09T14:50:23Z

This issue seems to correlate to two other issues:

storage: range for RangeID <ID> already exists on store #8408: panic: range for Range ID 96 already exists on store
a burst of these errors (also mentioned in storage: Re-added replica can't recover #8111):

W160809 14:33:40.678747 storage/replica.go:407  store=2:2 range=18 [/Table/51/1/614236447022740779/"df08cbcb-7e82-425d-b0eb-7c309e150371"/503-/Table/51/1/1213563446305371573/"90b953b6-332d-4aa2-9a75-393f250a7ed8"/538): outgoing raft transport stream to {3 3 0} closed by the remote: rpc error: code = 2 desc = replicaID cannot move backwards from 4 to 0

I'll take a look at #1 first unless someone has a better idea.

petermattis · 2016-08-09T14:53:06Z

The second issue is related to preemptive snapshots (replica ID == 0).

cuongdo · 2016-08-09T15:14:25Z

It also turns out that at least one node in the test cluster reliably panics or slows to a halt (~0 QPS) , when I kill -9 any of the CockroachDB instances after the load has started.

petermattis · 2016-08-09T15:16:33Z

Really? My stress tool performs that exact action (kill -9) and I do see blips in QPS when a node is killed, but the cluster does recover.

cuongdo · 2016-08-09T15:38:14Z

The first 3 times I performed the kill -9, at least one other node also
went down. Now, I can't reproduce any more.

That said, the QPS blip is more than a blip. block_writer doesn't recover
even after 5 minutes. It looks like it's contending with all the replica
repair activity.

On Tue, Aug 9, 2016 at 11:16 AM Peter Mattis [email protected]
wrote:

Really? My stress tool performs that exact action (kill -9) and I do see
blips in QPS when a node is killed, but the cluster does recover.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#8192 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffpgGCZkDiUXLTxHE7itx9nZUa28KGks5qeJnYgaJpZM4JZ4_j
.

petermattis · 2016-08-09T15:39:51Z

The blips I see are for 2-10 seconds. What you're describing is significantly more severe.

petermattis · 2016-08-11T00:55:10Z

The original messages says that this performance problem occurs after 1-3min. Is that still happening? I just tried to reproduce locally using a 3 node roachdemo and I'm not seeing the problem after running for 10min.

petermattis · 2016-08-11T00:56:48Z

PS Mentioning this because it would be an interesting clue if the problem occurs on real networks but not when the nodes are running on the loopback network.

petermattis · 2016-08-11T01:07:43Z

The continuous load test also has a block_writer per node while my brief testing was only using a single block_writer. I wonder if that makes a difference. More specifically, does it make a difference for all nodes in the cluster to be receiving load vs only a single one?

cuongdo · 2016-08-11T02:19:29Z

The context deadline exceeded errors on the gRPC sender side seem to correlate with these messages on the receiving end of the gRPC call:

I160811 02:03:18.022271 /go/src/google.golang.org/grpc/server.go:559  grpc: Server.processUnaryRPC failed to write status stream error: code = 4 desc = "context deadline exceeded"

So, it seems both sides are timing out. Why though?

Still digging...

cuongdo · 2016-08-11T02:32:56Z

also preceding these context deadline exceeded errors is this error on the host that's timing out:

I160811 02:20:54.061891 storage/replica.go:394  store=2:2 range=158 [/Table/51/1/7634898010108526638/"23e5cb19-fe98-41e4-bfdd-204f01848841"/1690-/Table/51/1/7654371180957890736/"23e5cb19-fe98-41e4-bfdd-204f01848841"/11008): replica 2 too old, adding to replica GC queue

This same range is involved in the INSERTs that are taking 3-4 minutes to finish.

cuongdo · 2016-08-11T10:41:11Z

True, though having multiple writers was causing enough issues that I've
been using just one writer for the last couple of days.

On Wed, Aug 10, 2016 at 9:07 PM Peter Mattis [email protected]
wrote:

The continuous load test also has a block_writer per node while my brief
testing was only using a single block_writer. I wonder if that makes a
difference. More specifically, does it make a difference for all nodes in
the cluster to be receiving load vs only a single one?

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#8192 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffpgf5iXNd8-Io-3ahi-Mh6IHe8XqRks5qenXngaJpZM4JZ4_j
.

petermattis · 2016-08-11T16:35:20Z

Ah, the difference is running 4 nodes vs running 3. I switched to running a 4 node cluster and it hit the 0 QPS problem in 1m57s starting from a clean slate.

cuongdo · 2016-08-15T18:33:23Z

After just rebasing, when queries take >2 minutes to execute, we get bursts of these messages:

 31677.237ms      0.071ms    sending request to bwriter6h-furiouspare-cockroach-1:26257: BeginTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), ConditionalPut [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), EndTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min)
 34677.785ms   3000.548ms    error, trying next peer: range 39: replica {1 1 1} not lease holder; <nil> is
 34677.794ms      0.009ms    error, trying next peer: range 39: replica {1 1 1} not lease holder; <nil> is
 34677.852ms      0.058ms    sending request to bwriter6h-furiouspare-cockroach-2:26257: BeginTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), ConditionalPut [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), EndTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min)
 34679.394ms      1.542ms    error, trying next peer: range 39: replica {0 0 0} not lease holder; node_id:1 store_id:1 replica_id:1  is
 34679.409ms      0.015ms    error, trying next peer: range 39: replica {0 0 0} not lease holder; node_id:1 store_id:1 replica_id:1  is
 34679.457ms      0.048ms    sending request to bwriter6h-furiouspare-cockroach-3:26257: BeginTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), ConditionalPut [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), EndTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min)
 37680.166ms   3000.710ms    reply error BeginTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), ConditionalPut [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), EndTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min): failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded
 37680.184ms      0.018ms    evicting cached range descriptor
 39385.417ms   1705.233ms    meta descriptor lookup
 39385.429ms      0.012ms    looked up range descriptor from cache
 39385.474ms      0.045ms    sending RPC to [/Table/51/1/4758645612977735084/"05edafd5-7606-4b27-890e-912cf875fe30"/433, /Table/51/1/4911345998852779755/"294d0a31-2d4f-4819-8001-aef8805f4abf"/366)
 39385.528ms      0.054ms    sending request to bwriter6h-furiouspare-cockroach-1:26257: BeginTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), ConditionalPut [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), EndTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min)
 42385.972ms   3000.443ms    error, trying next peer: range 39: replica {1 1 1} not lease holder; <nil> is
 42385.980ms      0.008ms    error, trying next peer: range 39: replica {1 1 1} not lease holder; <nil> is

So, it looks like the range doesn't have a leader for a long time?

cuongdo · 2016-08-15T18:44:26Z

At the same time as the above transaction, there are replica changes going on with range 39:

I160815 18:22:11.074447 storage/replica.go:2121  store=1:1 range=39 [/Table/51/1/4758645612977735084/"05edafd5-7606-4b27-890e-912cf875fe30"/433-/Table/51/1/4911345998852779755/"294d0a31-2d4f-4819-8001-aef8805f4abf"/366): applied part of replica change txn: BeginTransaction [/Local/Range/"\xbb\x89\xfdB\n\x1a\xf8\xfd\x82E\xac\x1205edafd5-7606-4b27-890e-912cf875fe30\x00\x01\xf7\x01\xb1"/RangeDescriptor,/Min), ConditionalPut [/Local/Range/"\xbb\x89\xfdB\n\x1a\xf8\xfd\x82E\xac\x1205edafd5-7606-4b27-890e-912cf875fe30\x00\x01\xf7\x01\xb1"/RangeDescriptor,/Min), Noop [/Min,/Min), pErr=txn aborted "change-replica" id=5b90b056 key=/Local/Range/"\xbb\x89\xfdB\n\x1a\xf8\xfd\x82E\xac\x1205edafd5-7606-4b27-890e-912cf875fe30\x00\x01\xf7\x01\xb1"/RangeDescriptor rw=false pri=0.04147142 iso=SERIALIZABLE stat=PENDING epo=0 ts=1471285191.874227845,0 orig=1471285191.874227845,0 max=1471285191.874227845,0 wto=false rop=false
I160815 18:22:11.129756 storage/replica_command.go:3059  store=1:1 range=39 [/Table/51/1/4758645612977735084/"05edafd5-7606-4b27-890e-912cf875fe30"/433-/Table/51/1/4911345998852779755/"294d0a31-2d4f-4819-8001-aef8805f4abf"/366): change replicas of 39: read existing descriptor range_id:39 start_key:"\273\211\375B\n\032\370\375\202E\254\02205edafd5-7606-4b27-890e-912cf875fe30\000\001\367\001\261" end_key:"\273\211\375D(\233*\036\247\356\353\022294d0a31-2d4f-4819-8001-aef8805f4abf\000\001\367\001n" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:4 store_id:4 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > replicas:<node_id:2 store_id:2 replica_id:4 > next_replica_id:5
I160815 18:22:11.135628 storage/replica.go:2121  store=1:1 range=39 [/Table/51/1/4758645612977735084/"05edafd5-7606-4b27-890e-912cf875fe30"/433-/Table/51/1/4911345998852779755/"294d0a31-2d4f-4819-8001-aef8805f4abf"/366): applied part of replica change txn: BeginTransaction [/Local/Range/"\xbb\x89\xfdB\n\x1a\xf8\xfd\x82E\xac\x1205edafd5-7606-4b27-890e-912cf875fe30\x00\x01\xf7\x01\xb1"/RangeDescriptor,/Min), ConditionalPut [/Local/Range/"\xbb\x89\xfdB\n\x1a\xf8\xfd\x82E\xac\x1205edafd5-7606-4b27-890e-912cf875fe30\x00\x01\xf7\x01\xb1"/RangeDescriptor,/Min), Noop [/Min,/Min), pErr=<nil>
I160815 18:22:11.144642 storage/replica.go:1497  store=1:1 range=39 [/Table/51/1/4758645612977735084/"05edafd5-7606-4b27-890e-912cf875fe30"/433-/Table/51/1/4911345998852779755/"294d0a31-2d4f-4819-8001-aef8805f4abf"/366): proposing REMOVE_REPLICA {NodeID:4 StoreID:4 ReplicaID:2} for range 39: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:4} {NodeID:3 StoreID:3 ReplicaID:3}]
I160815 18:22:11.146836 storage/replica.go:2121  store=1:1 range=39 [/Table/51/1/4758645612977735084/"05edafd5-7606-4b27-890e-912cf875fe30"/433-/Table/51/1/4911345998852779755/"294d0a31-2d4f-4819-8001-aef8805f4abf"/366): applied part of replica change txn: EndTransaction [/Local/Range/"\xbb\x89\xfdB\n\x1a\xf8\xfd\x82E\xac\x1205edafd5-7606-4b27-890e-912cf875fe30\x00\x01\xf7\x01\xb1"/RangeDescriptor,/Min), pErr=<nil>

Is some kind of cache or gossip info not being invalidated?

petermattis · 2016-08-15T20:13:39Z

The trace is showing that we're timing out trying to figure out who the lease holder is when talking to replica {1 1 1}. We then talk to replica {0 0 0} (which I believe means a replica which isn't initialized) and that replica sends us back to replica {1 1 1}. Seems like we've entered some funky retry loop. I'm surprised that the uninitialized replica ({0 0 0}) would even get the request, but perhaps that can happen when a replica is removed and then re-added to a node fairly quickly. @bdarnell?

bdarnell · 2016-08-15T22:12:34Z

If a replica is removed and then re-added, clients may have a stale copy of the range descriptor that still points to the replica under its old identity, so that's one way for an uninitialized replica to get traffic.

If it's still {0 0 0} then it must have just gotten a preemptive snapshot and not a full raft message yet. Replicas in this state have a creatingReplica which they use as a guess about the new leader, although this is never updated or invalidated after creation, so it could be stale.

So one scenario that I think is possible is that replica {1 1 1} (the original leader) adds another node (seen here as {0 0 0}) to the range, sends it a preemptive snapshot, then is itself removed before it can complete the ChangeReplicas. Then the new node still believes that {1 1 1} is leader, node {1 1 1} doesn't know anything.

The NotLeaseHolderErrors returned here are considered per-replica errors by DistSender, so if we get one from every replica in our cached range descriptor we invalidate our cache and try again. Since there was a failed ChangeReplicas call above, there is likely to be an intent on the range descriptor. The kv.rangeCache has some special logic to flip back and forth between pre- and post-intent versions of a range descriptor. I recall being suspicious of this code and thinking that it was possible to get stuck in a pathological case in which only one of the two versions would actually get used. I don't remember the details now, though.

bdarnell · 2016-08-15T22:22:13Z

Speaking of rangeCache, @spencerkimball just pointed out that it is not safe to mutate an OrderedCache from inside a DoRange callback. rangeCache does exactly this in clearOverlappingCachedRangeDescriptors.

petermattis · 2016-08-16T00:35:14Z

@cuongdo I see the logic @bdarnell is referring to in DistSender.getDescriptors. I think it it is worthwhile to add some additional tracing to that method and around the range cache in general.

cuongdo · 2016-08-16T01:33:17Z

Yes, the range cache is a black box in terms of tracing. I've already added
tracing to my branch for debugging. I'll clean up my tracing and include it
in a PR.

On Mon, Aug 15, 2016 at 8:35 PM Peter Mattis [email protected]
wrote:

@cuongdo https://github.com/cuongdo I see the logic @bdarnell
https://github.com/bdarnell is referring to in DistSender.getDescriptors.
I think it it is worthwhile to add some additional tracing to that method
and around the range cache in general.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#8192 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffplQgkTDD6MiEQulZNlMTeYjYaLkhks5qgQXKgaJpZM4JZ4_j
.

cuongdo · 2016-08-16T15:11:27Z

With some additional logging, it looks like somehow, some lease requests are taking a very long time.

Here's how successful lease requests tend to look:

160816 14:43:59.776660 storage/replica_range_lease.go:133  CDO: store=1:1 range=10 [/Table/51/1/2187990796066310151/"05baf108-62ca-4625-abb9-ab8323c81fe4"/222-/Table/51/1/4636959246303910691/"63ddd778-e765-40b2-aac8-4296c55bb29c"/39): proposing lease request
E160816 14:43:59.783776 storage/replica_range_lease.go:149  CDO: store=1:1 range=10 [/Table/51/1/2187990796066310151/"05baf108-62ca-4625-abb9-ab8323c81fe4"/222-/Table/51/1/4636959246303910691/"63ddd778-e765-40b2-aac8-4296c55bb29c"/39): lease request succeeded

Here's an unhealthy lease renewal request for that same range:

E160816 14:47:46.905070 storage/replica_range_lease.go:133  CDO: store=1:1 range=10 [/Table/51/1/2187990796066310151/"05baf108-62ca-4625-abb9-ab8323c81fe4"/222-/Table/51/1/2340439139283957407/"63ddd778-e765-40b2-aac8-4296c55bb29c"/2235): proposing lease request
E160816 14:48:42.330343 storage/replica_range_lease.go:149  CDO: store=1:1 range=10 [/Table/51/1/2187990796066310151/"05baf108-62ca-4625-abb9-ab8323c81fe4"/222-/Table/51/1/2340439139283957407/"63ddd778-e765-40b2-aac8-4296c55bb29c"/2235): lease request succeeded

Note that this lease request took 56 seconds to complete. This period overlaps with queries involving this range that took a bit over 60 seconds to finish. During this lease-less time, queries to range 10 get stuck in a vicious retry loop because there's no lease holder to execute the BatchRequest.

Now, the question is: why did this lease request take 56 seconds??

cuongdo · 2016-08-16T15:28:10Z

Interestingly, an ADD_REPLICA for range 10 overlaps with this lengthy lease renewal process:

I160816 14:47:46.528284 storage/replica.go:1497  store=1:1 range=10 [/Table/51/1/2187990796066310151/"05baf108-62ca-4625-abb9ab8323c81fe4"/222-/Table/51/1/2340439139283957407/"63ddd778-e765-40b2-aac8-4296c55bb29c"/2235): proposing ADD_REPLICA {NodeID:4 StoreID:4 ReplicaID:6} for range 10: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:3 StoreID:3 ReplicaID:5} {NodeID:2 StoreID:2 ReplicaID:4} {NodeID:4 StoreID:4 ReplicaID:6}]

followed by a 3 second timeout:

E160816 14:47:49.540360 storage/queue.go:497  [replicate] on store=1:1 range=10 [/Table/51/1/2187990796066310151/"05baf108-62ca-4625-abb9ab8323c81fe4"/222-/Table/51/1/2340439139283957407/"63ddd778-e765-40b2-aac84296c55bb29c"/2235): change replicas of range 10 failed: context deadline exceeded

Why and where are we timing out this replica change operation? I've noticed that some successful Raft operations take close to 3 seconds under load, so it seems easy to trigger change replica failures.

petermattis · 2016-08-16T15:29:47Z

Lease requests are raft commands. So there needs to be a raft leader for the range in question and there needs to be a quorum of nodes available to execute the command. If you turn on raft logging --vmodule=raft=5 you should see what raft activity is taking place for the problematic range. Are we unable to elect a leader for some reason?

petermattis · 2016-08-16T15:32:32Z

Another possibility is that we do have a leader for the range, but are not quickly re-proposing the lease request command to it. See Replica.refreshPendingCommandsLocked.

cuongdo · 2016-08-16T16:19:13Z

vmodule=raft=5 is very helpful.

In the span of a minute, the term for a problematic range is increasing by 32. There seems to be rapid-fire succession of Raft elections for the same range. So, we're throwing away a lot of unstable log entries during that time and shifting the lease holder a ton.

bdarnell · 2016-08-16T16:21:35Z

Can you post that vmodule=raft=5 log?

cuongdo · 2016-08-16T18:00:19Z

Here are some logs for a test run that had this issue:

logs.zip

SQL tracing is on:

grep -H " done" cockroach.stderr*  | sort -nk 2 | tail -10
cockroach.stderr: 103351.836ms      0.126ms    INSERT 1 done
cockroach.stderr: 103468.860ms      0.162ms    INSERT 1 done
cockroach.stderr: 103700.492ms      0.114ms    INSERT 1 done
cockroach.stderr: 103878.283ms      0.108ms    INSERT 1 done
cockroach.stderr: 104028.905ms      0.222ms    INSERT 1 done
cockroach.stderr: 104071.897ms      0.104ms    INSERT 1 done
cockroach.stderr: 104097.830ms      0.104ms    INSERT 1 done
cockroach.stderr: 104119.329ms      0.208ms    INSERT 1 done
cockroach.stderr: 104253.536ms      0.098ms    INSERT 1 done
cockroach.stderr: 104373.214ms      0.122ms    INSERT 1 done

In these logs, range 11 is a problematic range that has frequent elections, sometimes multiple within a second.

cuongdo · 2016-08-16T19:05:15Z

Running with a --raft-tick-interval of 1s on three test runs. I'll let them go for a while and see what happens. So far, things are better than before.

cuongdo · 2016-08-16T20:12:34Z

Here are logs with a 1 second Raft interval:

raft_tick_1s.zip

I don't see election storms any more, but there are still instances of leaderless ranges resulting in dropped Raft proposals, and there are still lease holder changes (albeit much slower).

cuongdo · 2016-08-16T21:31:35Z

It's worth noting that the 0 QPS periods are far rarer with a 1s raft tick interval. However, the still happening but rarer 3-4 minute pauses still happen after 1-2 hours of load.

So, it's a possibility that queued up heartbeats are behind the election storms I was seeing with the default raft tick interval.

Definitely open to suggestions on what to look at next. Was thinking of going back to the default tick interval and digging further.

petermattis · 2016-08-16T22:53:28Z

Let's try throttling rebalancing. I'm suspicious about how we rebalance by adding and then immediately removing a replica without waiting for the new replica to be brought up to date. Is there some scenario there were a range can't achieve quorum until the new replica comes up to date?

diff --git a/storage/replicate_queue.go b/storage/replicate_queue.go
index ee0311d..139423c 100644
--- a/storage/replicate_queue.go
+++ b/storage/replicate_queue.go
@@ -205,6 +205,7 @@ func (rq *replicateQueue) process(
                if err = repl.ChangeReplicas(ctx, roachpb.ADD_REPLICA, rebalanceReplica, desc); err != nil {
                        return err
                }
+               time.Sleep(10 * time.Second)
        }

        // Enqueue this replica again to see if there are more changes to be made.

petermattis · 2016-08-16T23:06:33Z

Here is a better hack to try:

diff --git a/storage/replica_command.go b/storage/replica_command.go
index 3b7634d..6da9cba 100644
--- a/storage/replica_command.go
+++ b/storage/replica_command.go
@@ -3029,6 +3029,17 @@ func (r *Replica) ChangeReplicas(
                        },
                })

+               // TODO(peter): Wait for the preemptive snapshot to be received and
+               // applied. We estimate that a 64 MB snapshot can be sent/applied in
+               // 20s. Remove this hack.
+               waitTime := time.Duration(len(snap.Data)) * 20 * time.Second / (64 << 20)
+               if waitTime < time.Second {
+                       waitTime = time.Second
+               }
+               log.Infof(ctx, "%s: waiting for preemptive snapshot: %d %.1fs",
+                       r, len(snap.Data), waitTime.Seconds())
+               time.Sleep(waitTime)
+
                repDesc.ReplicaID = updatedDesc.NextReplicaID
                updatedDesc.NextReplicaID++
                updatedDesc.Replicas = append(updatedDesc.Replicas, repDesc)

This will simulate what will happen if we waited for preemptive snapshots to apply. I've run a test for 5m with no blips using this patch.

cuongdo · 2016-08-17T12:10:52Z

Will try this today.

As I mentioned in person to you, I believe that we need end-to-end tracing.
That includes distributed tracing. An intermediate and still important step
would be to extend tracing into the Raft layer. Diagnosing this issue thus
far has required an inordinate amount of log deciphering and vmodule
twiddling.
On Tue, Aug 16, 2016 at 7:06 PM Peter Mattis [email protected]
wrote:

Here is a better hack to try:

diff --git a/storage/replica_command.go b/storage/replica_command.go
index 3b7634d..6da9cba 100644
--- a/storage/replica_command.go
+++ b/storage/replica_command.go
@@ -3029,6 +3029,17 @@ func (r *Replica) ChangeReplicas(
},
})
          // TODO(peter): Wait for the preemptive snapshot to be received and
          // applied. We estimate that a 64 MB snapshot can be sent/applied in
          // 20s. Remove this hack.
          waitTime := time.Duration(len(snap.Data)) \* 20 \* time.Second / (64 << 20)
          if waitTime < time.Second {
                  waitTime = time.Second
          }
          log.Infof(ctx, "%s: waiting for preemptive snapshot: %d %.1fs",
                  r, len(snap.Data), waitTime.Seconds())
          time.Sleep(waitTime)
      repDesc.ReplicaID = updatedDesc.NextReplicaID
      updatedDesc.NextReplicaID++
      updatedDesc.Replicas = append(updatedDesc.Replicas, repDesc)
This will simulate what will happen if we waited for preemptive snapshots
to apply. I've run a test for 5m with no blips using this patch.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#8192 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffpttq14JsYEcFEt3ps3xKn22f77_rks5qgkKBgaJpZM4JZ4_j
.

petermattis · 2016-08-17T12:13:07Z

Agreed. I'm not suggesting my patch as a fix, just another means of narrowing in on the problem. If excessive rebalancing is causing the issue it could focus the diagnostic efforts.

cuongdo · 2016-08-17T15:20:37Z

I tried a build with #8604 applied, and the election storm & 2-3 minute queries still happen after 4-5 minutes on two separate runs.

cuongdo · 2016-08-17T15:53:59Z

next task is to dig into voting mechanism and understand what's triggering all these elections and why we get so many of these messages, which overlap with the issues:

I160817 15:16:14.184639 raft/raft.go:777  store=1:1 range=33 [/Table/51/1/7805692795270769010/"74f112bc-0758-453c-a573-4f02b88b4bdd"/1764-/Table/51/1/8087082408498019354/"d579a506-c6ef-4d4a-9af7-b2e26740a59f"/141): 1 no leader at term 11; dropping proposal

petermattis · 2016-08-17T17:06:16Z

Perhaps related to what you're looking into, but when sending a large snapshot I'm not seeing any activity (msgApp, msgHeartbeat, etc) for a range for over a second which then triggers an election. On the node sending the snapshot the logs contain:

I160817 12:50:26.273763 raft/raft.go:310  store=1:1 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830): 1 [firstindex: 95449, commit: 95525] sent snapshot[index: 95426, term: 6] to 4 [next = 1, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I160817 12:50:26.273801 raft/raft.go:312  store=1:1 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830): 1 paused sending replication messages to 4 [next = 1, match = 0, state = ProgressStateSnapshot, waiting = true, pendingSnapshot = 95426]
I160817 12:50:30.639417 storage/raft.go:145  store=1:1 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830) raft ready

On another node I see:

I160817 12:50:26.274779 storage/raft.go:145  store=3:3 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830) raft ready
I160817 12:50:28.340271 raft/raft.go:549  store=3:3 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830): 2 is starting a new election at term 6

Hmm, this is curious. I see a 3 second gap in the logs for that first node. Note only is there no activity for range 7, but there is no logging activity period. On other nodes I still see activity during that 3 second gap. I wonder what is happening there.

cuongdo · 2016-08-17T17:45:42Z

Interesting. I could only find one instance of that in the logs for my
latest run.

On Wed, Aug 17, 2016 at 1:06 PM Peter Mattis [email protected]
wrote:

Perhaps related to what you're looking into, but when sending a large
snapshot I'm not seeing any activity (msgApp, msgHeartbeat, etc) for a
range for over a second which then triggers an election. On the node
sending the snapshot the logs contain:

I160817 12:50:26.273763 raft/raft.go:310 store=1:1 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830): 1 [firstindex: 95449, commit: 95525] sent snapshot[index: 95426, term: 6] to 4 [next = 1, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I160817 12:50:26.273801 raft/raft.go:312 store=1:1 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830): 1 paused sending replication messages to 4 [next = 1, match = 0, state = ProgressStateSnapshot, waiting = true, pendingSnapshot = 95426]
I160817 12:50:30.639417 storage/raft.go:145 store=1:1 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830) raft ready

On another node I see:

I160817 12:50:26.274779 storage/raft.go:145 store=3:3 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830) raft ready
I160817 12:50:28.340271 raft/raft.go:549 store=3:3 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830): 2 is starting a new election at term 6

Hmm, this is curious. I see a 3 second gap in the logs for that first
node. Note only is there no activity for range 7, but there is no logging
activity period. On other nodes I still see activity during that 3 second
gap. I wonder what is happening there.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#8192 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffppj2Q_5U47rcrAzKxumy_v3fdZ8Eks5qgz-QgaJpZM4JZ4_j
.

Raft ready logging (enabled via -vmodule=raft=5) is extraordinarily expensive when logging a large snapshot. In particular, if an outgoing message contained a snapshot or raft.Ready.Snapshot was non-empty we would see multiple seconds being spent formatting the log message only for most of it to be thrown away. These delays horked up the raft processing goroutine Added warning logs when raft ready processing and raft ticking take longer than a second. When that happens something bad is usually going on elsehwere. See cockroachdb#8192.

cuongdo · 2016-08-17T20:35:36Z

So, it looks like Pete's earlier theory about the quorum being 3 is correct:

I160817 19:51:42.700488 raft/raft.go:763  store=1:1 range=16 [/Table/51/1/3999096291460540449/"cfb33dd5-8228-4be2-a2fa-990241850660"/44-/Table/51/1/4286884457940182869/"09e03c83-704e-4726-b404-ad424d3ad7e6"/466): 1 [quorum:3] has received 2 votes and 0 vote rejections

This is combined with some kind of blip:

I160817 19:51:39.298374 raft/raft.go:588  store=1:1 range=16 [/Table/51/1/3999096291460540449/"cfb33dd5-8228-4be2-a2fa-990241850660"/44-/Table/51/1/4286884457940182869/"09e03c83-704e-4726-b404-ad424d3ad7e6"/466): CDO: 1 (leader) received Raft message type MsgUnreachable
I160817 19:51:39.298419 raft/raft.go:712  store=1:1 range=16 [/Table/51/1/3999096291460540449/"cfb33dd5-8228-4be2-a2fa-990241850660"/44-/Table/51/1/4286884457940182869/"09e03c83-704e-4726-b404-ad424d3ad7e6"/466): 1 failed to send message to 4 because it is unreachable [next = 930, match = 929, state = ProgressStateSnapshot, waiting = true, pendingSnapshot = 1202]

This is enough to prevent forward progress with the election. I lost some logs from supervisord's default log truncation, so I'll continue to dig further to figure out what was going on during this blip.

bdarnell · 2016-08-17T21:04:06Z

#8613 should help with this, since it delays changing the quorum requirement until the preemptive snapshot has been applied.

cuongdo · 2016-08-18T03:14:01Z

There's still an election storm with the latest master (0d11f31), though it does seem quite a bit better now. I'll verify more tomorrow.

It looks like ADD_REPLICA is sometimes quickly followed by the Raft leader stepping down (in this case ~2 seconds):

I160818 02:10:54.816511 storage/replica.go:1505  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): proposing ADD_REPLICA {NodeID:4 StoreID:4 ReplicaID:5} for range 7: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:4} {NodeID:4 StoreID:4 ReplicaID:5}]
I160818 02:10:54.819044 storage/replica.go:2155  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): applied part of replica change txn: EndTransaction [/Local/Range/"\xbb"/RangeDescriptor,/Min), pErr=<nil>
I160818 02:10:54.820230 storage/replica_command.go:3062  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): change replicas of 7: read existing descriptor range_id:7 start_key:"\273" end_key:"\273\211\375?\331\222@\342\271\350\215\02275ba1de4-58d6-4cf8-bce4-bf58e509eff6\000\001\367\023u" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:4 > replicas:<node_id:4 store_id:4 replica_id:5 > next_replica_id:6
I160818 02:10:54.822376 raft/raft.go:642  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): 1 received msgApp rejection(lastindex: 116472) from 5 for index 117037
I160818 02:10:54.822423 raft/raft.go:644  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): 1 decreased progress of 5 to [next = 116473, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I160818 02:10:55.211326 raft/raft.go:310  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): 1 [firstindex: 117027, commit: 117036] sent snapshot[index: 117036, term: 6] to 5 [next = 116473, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I160818 02:10:55.211388 raft/raft.go:312  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): 1 paused sending replication messages to 5 [next = 116473, match = 0, state = ProgressStateSnapshot, waiting = true, pendingSnapshot = 117036]
I160818 02:10:55.477217 raft/raft.go:693  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): 1 snapshot succeeded, resumed sending replication messages to 5 [next = 117037, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
W160818 02:10:56.710243 raft/raft.go:593  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): 1 stepped down to follower since quorum is not active

After that, there's a series of 18 contested elections for the same range over 10 seconds (it seems like it can take a while for elections to converge when quorum=3). After the dust settles, a range lease command must be committed before write operations can succeed for that range. All this can take a while and causes BatchRequest timeouts (and retries) for that range during this period of election drama.

cuongdo · 2016-08-18T19:46:26Z

These anomalies seem to have greatly reduced given the recent burst of Raft improvements. Will continue to monitor and update later.

cuongdo · 2016-08-22T17:55:20Z

The last 5 runs of the block_writer load test have passed, so closing this:

https://teamcity.cockroachdb.com/viewType.html?buildTypeId=C_BW_4Nodes_6Hours&tab=buildTypeHistoryList&branch_Cockroach_Nightlies=__all_branches__

cuongdo mentioned this issue Aug 2, 2016

acceptance: add QPS checking & photos load test #8211

Merged

cuongdo self-assigned this Aug 4, 2016

cuongdo mentioned this issue Aug 9, 2016

storage: Re-added replica can't recover #8111

Closed

petermattis mentioned this issue Aug 16, 2016

kv: RangeCache.clearOverlappingCachedRangeDescriptors unsafe #8564

Closed

cuongdo closed this as completed Aug 22, 2016

acceptance: qps drops to 0 under block_writer load #8192

acceptance: qps drops to 0 under block_writer load #8192

Comments

cuongdo commented Aug 1, 2016

tamird commented Aug 1, 2016

cuongdo commented Aug 8, 2016

cuongdo commented Aug 8, 2016 • edited Loading

petermattis commented Aug 8, 2016

bdarnell commented Aug 8, 2016

cuongdo commented Aug 9, 2016

petermattis commented Aug 9, 2016

cuongdo commented Aug 9, 2016 • edited Loading

petermattis commented Aug 9, 2016

cuongdo commented Aug 9, 2016

petermattis commented Aug 9, 2016

petermattis commented Aug 11, 2016

petermattis commented Aug 11, 2016

petermattis commented Aug 11, 2016

cuongdo commented Aug 11, 2016 • edited Loading

cuongdo commented Aug 11, 2016

cuongdo commented Aug 11, 2016

petermattis commented Aug 11, 2016

cuongdo commented Aug 15, 2016 • edited Loading

cuongdo commented Aug 15, 2016

petermattis commented Aug 15, 2016

bdarnell commented Aug 15, 2016

bdarnell commented Aug 15, 2016

petermattis commented Aug 16, 2016

cuongdo commented Aug 16, 2016

cuongdo commented Aug 16, 2016 • edited Loading

cuongdo commented Aug 16, 2016

petermattis commented Aug 16, 2016

petermattis commented Aug 16, 2016

cuongdo commented Aug 16, 2016

bdarnell commented Aug 16, 2016

cuongdo commented Aug 16, 2016 • edited Loading

cuongdo commented Aug 16, 2016 • edited Loading

cuongdo commented Aug 16, 2016

cuongdo commented Aug 16, 2016

petermattis commented Aug 16, 2016

petermattis commented Aug 16, 2016

cuongdo commented Aug 17, 2016

petermattis commented Aug 17, 2016

cuongdo commented Aug 17, 2016

cuongdo commented Aug 17, 2016

petermattis commented Aug 17, 2016

cuongdo commented Aug 17, 2016

cuongdo commented Aug 17, 2016

bdarnell commented Aug 17, 2016

cuongdo commented Aug 18, 2016 • edited Loading

cuongdo commented Aug 18, 2016

cuongdo commented Aug 22, 2016

cuongdo commented Aug 8, 2016 •

edited

Loading

cuongdo commented Aug 9, 2016 •

edited

Loading

cuongdo commented Aug 11, 2016 •

edited

Loading

cuongdo commented Aug 15, 2016 •

edited

Loading

cuongdo commented Aug 16, 2016 •

edited

Loading

cuongdo commented Aug 16, 2016 •

edited

Loading

cuongdo commented Aug 16, 2016 •

edited

Loading

cuongdo commented Aug 18, 2016 •

edited

Loading