Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acceptance: qps drops to 0 under block_writer load #8192

Closed
cuongdo opened this issue Aug 1, 2016 · 48 comments
Closed

acceptance: qps drops to 0 under block_writer load #8192

cuongdo opened this issue Aug 1, 2016 · 48 comments
Assignees

Comments

@cuongdo
Copy link
Contributor

cuongdo commented Aug 1, 2016

Using this command:

make test \
  PKG=./acceptance \
  TESTTIMEOUT=6h \
  TESTS=ContinuousLoad_BlockWriter \
  TESTFLAGS='-v -remote -key-name google_compute_engine -cwd allocator_terraform -nodes 4 -tf.keep-cluster=failed'

block_writer qps drops to 0 and stays there after 1-3 minutes of load.

There are some stuck tasks:

screenshot 2016-08-01 13 42 01

Some Raft requests are taking a while:
screenshot 2016-08-01 13 40 15

I see some interesting log entries after about 10 seconds of load:

I160801 18:41:51.593554 kv/dist_sender.go:953  application error: key range /Local/Range/"\x96"/RangeDescriptor-/Min outside of bounds of range /Table/13-/Table/14
W160801 18:41:51.593601 kv/dist_sender.go:748  failed to invoke Get [/Local/Range/"\x96"/RangeDescriptor,/Min): key range /Local/Range/"\x96"/RangeDescriptor-/Min outside of bounds of range /Table/13-/Table/14

Then I see a lot of these entries:

I160801 18:41:56.881117 storage/replica.go:1660  store=1:1 range=7 [/Table/51-/Max): pending commands: refurbished 0, reproposing 7 (at 894.883); reasonTicks
I160801 18:41:56.895153 storage/replica.go:1881  store=1:1 range=7 [/Table/51-/Max): applying command with forced error: storage/replica.go:1833: command observed at lease index 890, but required < 884
E160801 18:41:56.895405 storage/replica.go:1923  store=1:1 range=7 [/Table/51-/Max): error executing raft command: storage/replica.go:1833: command observed at lease index 890, but required < 884
I160801 18:41:56.895476 storage/replica.go:1881  store=1:1 range=7 [/Table/51-/Max): applying command with forced error: storage/replica.go:1833: command observed at lease index 890, but required < 885

then lots of:

I160801 18:42:12.683543 storage/replica.go:1660  store=1:1 range=10 [/Table/51/1/2250817352697598038/"cbc9866f-4304-4d00-bd6b-4559a202b6f7"/200-/Table/51/1/4513707353089095697/"8e8d8c40-dd3c-4b96-9161-7ae75dc43bb4"/117): pending commands: refurbished 0, reproposing 3 (at 785.757); reasonTicks
I160801 18:42:12.755140 storage/replica.go:1881  store=1:1 range=10 [/Table/51/1/2250817352697598038/"cbc9866f-4304-4d00-bd6b-4559a202b6f7"/200-/Table/51/1/4513707353089095697/"8e8d8c40-dd3c-4b96-9161-7ae75dc43bb4"/117): applying command with forced error: storage/replica.go:1833: command observed at lease index 760, but required < 758
E160801 18:42:12.758939 storage/replica.go:1923  store=1:1 range=10 [/Table/51/1/2250817352697598038/"cbc9866f-4304-4d00-bd6b-4559a202b6f7"/200-/Table/51/1/4513707353089095697/"8e8d8c40-dd3c-4b96-9161-7ae75dc43bb4"/117): error executing raft command: storage/replica.go:1833: command observed at lease index 760, but required < 758

and some of these:

E160801 18:44:51.872548 storage/replica.go:1923  store=1:1 range=13 [/Table/51/1/1119228282337153470/"bd678e46-19a3-4e21-817d-802b14e56bd1"/390-/Table/51/1/1682660493493272475/"b6bf6e54-43b8-403a-b092-f6b44b60263a"/480): error executing raft command: cannot replace lease repli
ca {2 2 2} 2016-08-01 18:43:42.39559981 +0000 UTC 18.934678912s with replica {3 3 3} 2016-08-01 18:43:02.077579421 +0000 UTC 4.75s: requested lease overlaps previous lease

Full logs (node.0 has the most errors):
node.0.cockroach.stderr.txt
node.1.cockroach.stderr.txt
node.2.cockroach.stderr.txt
node.3.cockroach.stderr.txt
writer.0.block_writer.stdout.txt

@tamird
Copy link
Contributor

tamird commented Aug 1, 2016

Just jumping in to point out that cockroach.storage.MultiRaft/RaftMessage is expected to be long-lived, since it is the entire life of the stream.

cuongdo pushed a commit to cuongdo/cockroach that referenced this issue Aug 2, 2016
While we're testing whether the various processes were alive, we weren't
testing whether queries were being processed. Now, we ensure that
QPS is non-zero. This would've caught cockroachdb#8192 automatically (I just
happened to notice the 0 QPS manually).

Also added a load test for photos.
@cuongdo cuongdo self-assigned this Aug 4, 2016
@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 8, 2016

After enabling SQL tracing, I've caught several traces for SQL queries that take longer than 2 minutes (!). Here's one of those traces.

This error happens upon every retry:

 82826.744ms      1.466ms    error, trying next peer: range 55: replica {0 0 0} not lease holder; node_id:4 store_id:4 replica_id:3  is

Somehow we're repeatedly directing the query to a replica that's not the range leaseholder?

cc @bdarnell

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 8, 2016

this block of errors seems to happen with every retry (with some variation):

  9059.429ms      0.065ms    sending RPC to [/Table/51/1/2735983279382156089/"4dfdba17-db83-43e2-ba96-c4fea2c400a4"/4068, /Table/51/1/2768883119070632199/"091b14ee-72fd-4ab8-81b5-77d1ad012714"/13819)
 12059.653ms   3000.225ms    error, trying next peer: rpc error: code = 4 desc = context deadline exceeded
 15060.064ms   3000.411ms    error, trying next peer: range 55: replica {1 1 1} not lease holder; <nil> is
 15061.564ms      1.500ms    error, trying next peer: range 55: replica {0 0 0} not lease holder; node_id:4 store_id:4 replica_id:3  is
 18061.964ms   3000.401ms    reply error BeginTransaction [/Table/51/1/2760458626219534663/"41957c6d-c074-4d73-a316-0017e4455073"/22818/0,/Min), ConditionalPut [/Table/51/1/2760458626219534663/"41957c6d-c074-4d73-a316-0017e4455073"/22818/0,/Min), EndTransaction [/Table/51/1/2760458626219534663/"41957c6d-c074-4d73-a316-0017e4455073"/22818/0,/Min): failed to send RPC: sending to all 4 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded

@petermattis
Copy link
Collaborator

We preferentially send RPCs to the healthy replicas. I saw that problem once during some of my testing but never investigated further. Perhaps that is fouling things up here. I think you'll need to add some more logging/tracing to test this hypothesis.

@bdarnell
Copy link
Contributor

bdarnell commented Aug 8, 2016

We're trying the lease holder (or there is no lease holder), but it's not responding within the 3 second timeout, so we never officially notice that that node is the lease holder (it wouldn't matter anyway, since 3s would be enough for a lease to expire). We're cycling through all the replicas on each attempt. So the main question is why the lease holder is timing out all the time. (we could probably also reorganize our retry loops here to hold the original request to the lease holder open instead of cancelling it and restarting every cycle).

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 9, 2016

This issue seems to correlate to two other issues:

  1. storage: range for RangeID <ID> already exists on store #8408: panic: range for Range ID 96 already exists on store
  2. a burst of these errors (also mentioned in storage: Re-added replica can't recover #8111):
W160809 14:33:40.678747 storage/replica.go:407  store=2:2 range=18 [/Table/51/1/614236447022740779/"df08cbcb-7e82-425d-b0eb-7c309e150371"/503-/Table/51/1/1213563446305371573/"90b953b6-332d-4aa2-9a75-393f250a7ed8"/538): outgoing raft transport stream to {3 3 0} closed by the remote: rpc error: code = 2 desc = replicaID cannot move backwards from 4 to 0

I'll take a look at #1 first unless someone has a better idea.

@petermattis
Copy link
Collaborator

The second issue is related to preemptive snapshots (replica ID == 0).

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 9, 2016

It also turns out that at least one node in the test cluster reliably panics or slows to a halt (~0 QPS) , when I kill -9 any of the CockroachDB instances after the load has started.

@petermattis
Copy link
Collaborator

Really? My stress tool performs that exact action (kill -9) and I do see blips in QPS when a node is killed, but the cluster does recover.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 9, 2016

The first 3 times I performed the kill -9, at least one other node also
went down. Now, I can't reproduce any more.

That said, the QPS blip is more than a blip. block_writer doesn't recover
even after 5 minutes. It looks like it's contending with all the replica
repair activity.

On Tue, Aug 9, 2016 at 11:16 AM Peter Mattis [email protected]
wrote:

Really? My stress tool performs that exact action (kill -9) and I do see
blips in QPS when a node is killed, but the cluster does recover.


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#8192 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffpgGCZkDiUXLTxHE7itx9nZUa28KGks5qeJnYgaJpZM4JZ4_j
.

@petermattis
Copy link
Collaborator

The blips I see are for 2-10 seconds. What you're describing is significantly more severe.

@petermattis
Copy link
Collaborator

The original messages says that this performance problem occurs after 1-3min. Is that still happening? I just tried to reproduce locally using a 3 node roachdemo and I'm not seeing the problem after running for 10min.

screen shot 2016-08-10 at 8 55 08 pm

@petermattis
Copy link
Collaborator

PS Mentioning this because it would be an interesting clue if the problem occurs on real networks but not when the nodes are running on the loopback network.

@petermattis
Copy link
Collaborator

The continuous load test also has a block_writer per node while my brief testing was only using a single block_writer. I wonder if that makes a difference. More specifically, does it make a difference for all nodes in the cluster to be receiving load vs only a single one?

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 11, 2016

The context deadline exceeded errors on the gRPC sender side seem to correlate with these messages on the receiving end of the gRPC call:

I160811 02:03:18.022271 /go/src/google.golang.org/grpc/server.go:559  grpc: Server.processUnaryRPC failed to write status stream error: code = 4 desc = "context deadline exceeded"

So, it seems both sides are timing out. Why though?

Still digging...

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 11, 2016

also preceding these context deadline exceeded errors is this error on the host that's timing out:

I160811 02:20:54.061891 storage/replica.go:394  store=2:2 range=158 [/Table/51/1/7634898010108526638/"23e5cb19-fe98-41e4-bfdd-204f01848841"/1690-/Table/51/1/7654371180957890736/"23e5cb19-fe98-41e4-bfdd-204f01848841"/11008): replica 2 too old, adding to replica GC queue

This same range is involved in the INSERTs that are taking 3-4 minutes to finish.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 11, 2016

True, though having multiple writers was causing enough issues that I've
been using just one writer for the last couple of days.

On Wed, Aug 10, 2016 at 9:07 PM Peter Mattis [email protected]
wrote:

The continuous load test also has a block_writer per node while my brief
testing was only using a single block_writer. I wonder if that makes a
difference. More specifically, does it make a difference for all nodes in
the cluster to be receiving load vs only a single one?


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#8192 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffpgf5iXNd8-Io-3ahi-Mh6IHe8XqRks5qenXngaJpZM4JZ4_j
.

@petermattis
Copy link
Collaborator

Ah, the difference is running 4 nodes vs running 3. I switched to running a 4 node cluster and it hit the 0 QPS problem in 1m57s starting from a clean slate.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 15, 2016

After just rebasing, when queries take >2 minutes to execute, we get bursts of these messages:

 31677.237ms      0.071ms    sending request to bwriter6h-furiouspare-cockroach-1:26257: BeginTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), ConditionalPut [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), EndTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min)
 34677.785ms   3000.548ms    error, trying next peer: range 39: replica {1 1 1} not lease holder; <nil> is
 34677.794ms      0.009ms    error, trying next peer: range 39: replica {1 1 1} not lease holder; <nil> is
 34677.852ms      0.058ms    sending request to bwriter6h-furiouspare-cockroach-2:26257: BeginTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), ConditionalPut [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), EndTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min)
 34679.394ms      1.542ms    error, trying next peer: range 39: replica {0 0 0} not lease holder; node_id:1 store_id:1 replica_id:1  is
 34679.409ms      0.015ms    error, trying next peer: range 39: replica {0 0 0} not lease holder; node_id:1 store_id:1 replica_id:1  is
 34679.457ms      0.048ms    sending request to bwriter6h-furiouspare-cockroach-3:26257: BeginTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), ConditionalPut [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), EndTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min)
 37680.166ms   3000.710ms    reply error BeginTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), ConditionalPut [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), EndTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min): failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded
 37680.184ms      0.018ms    evicting cached range descriptor
 39385.417ms   1705.233ms    meta descriptor lookup
 39385.429ms      0.012ms    looked up range descriptor from cache
 39385.474ms      0.045ms    sending RPC to [/Table/51/1/4758645612977735084/"05edafd5-7606-4b27-890e-912cf875fe30"/433, /Table/51/1/4911345998852779755/"294d0a31-2d4f-4819-8001-aef8805f4abf"/366)
 39385.528ms      0.054ms    sending request to bwriter6h-furiouspare-cockroach-1:26257: BeginTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), ConditionalPut [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min), EndTransaction [/Table/51/1/4839052094214986014/"dc9f6fa3-1779-4e4c-a13b-dbd2b28f9364"/6589/0,/Min)
 42385.972ms   3000.443ms    error, trying next peer: range 39: replica {1 1 1} not lease holder; <nil> is
 42385.980ms      0.008ms    error, trying next peer: range 39: replica {1 1 1} not lease holder; <nil> is

So, it looks like the range doesn't have a leader for a long time?

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 15, 2016

At the same time as the above transaction, there are replica changes going on with range 39:

I160815 18:22:11.074447 storage/replica.go:2121  store=1:1 range=39 [/Table/51/1/4758645612977735084/"05edafd5-7606-4b27-890e-912cf875fe30"/433-/Table/51/1/4911345998852779755/"294d0a31-2d4f-4819-8001-aef8805f4abf"/366): applied part of replica change txn: BeginTransaction [/Local/Range/"\xbb\x89\xfdB\n\x1a\xf8\xfd\x82E\xac\x1205edafd5-7606-4b27-890e-912cf875fe30\x00\x01\xf7\x01\xb1"/RangeDescriptor,/Min), ConditionalPut [/Local/Range/"\xbb\x89\xfdB\n\x1a\xf8\xfd\x82E\xac\x1205edafd5-7606-4b27-890e-912cf875fe30\x00\x01\xf7\x01\xb1"/RangeDescriptor,/Min), Noop [/Min,/Min), pErr=txn aborted "change-replica" id=5b90b056 key=/Local/Range/"\xbb\x89\xfdB\n\x1a\xf8\xfd\x82E\xac\x1205edafd5-7606-4b27-890e-912cf875fe30\x00\x01\xf7\x01\xb1"/RangeDescriptor rw=false pri=0.04147142 iso=SERIALIZABLE stat=PENDING epo=0 ts=1471285191.874227845,0 orig=1471285191.874227845,0 max=1471285191.874227845,0 wto=false rop=false
I160815 18:22:11.129756 storage/replica_command.go:3059  store=1:1 range=39 [/Table/51/1/4758645612977735084/"05edafd5-7606-4b27-890e-912cf875fe30"/433-/Table/51/1/4911345998852779755/"294d0a31-2d4f-4819-8001-aef8805f4abf"/366): change replicas of 39: read existing descriptor range_id:39 start_key:"\273\211\375B\n\032\370\375\202E\254\02205edafd5-7606-4b27-890e-912cf875fe30\000\001\367\001\261" end_key:"\273\211\375D(\233*\036\247\356\353\022294d0a31-2d4f-4819-8001-aef8805f4abf\000\001\367\001n" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:4 store_id:4 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > replicas:<node_id:2 store_id:2 replica_id:4 > next_replica_id:5
I160815 18:22:11.135628 storage/replica.go:2121  store=1:1 range=39 [/Table/51/1/4758645612977735084/"05edafd5-7606-4b27-890e-912cf875fe30"/433-/Table/51/1/4911345998852779755/"294d0a31-2d4f-4819-8001-aef8805f4abf"/366): applied part of replica change txn: BeginTransaction [/Local/Range/"\xbb\x89\xfdB\n\x1a\xf8\xfd\x82E\xac\x1205edafd5-7606-4b27-890e-912cf875fe30\x00\x01\xf7\x01\xb1"/RangeDescriptor,/Min), ConditionalPut [/Local/Range/"\xbb\x89\xfdB\n\x1a\xf8\xfd\x82E\xac\x1205edafd5-7606-4b27-890e-912cf875fe30\x00\x01\xf7\x01\xb1"/RangeDescriptor,/Min), Noop [/Min,/Min), pErr=<nil>
I160815 18:22:11.144642 storage/replica.go:1497  store=1:1 range=39 [/Table/51/1/4758645612977735084/"05edafd5-7606-4b27-890e-912cf875fe30"/433-/Table/51/1/4911345998852779755/"294d0a31-2d4f-4819-8001-aef8805f4abf"/366): proposing REMOVE_REPLICA {NodeID:4 StoreID:4 ReplicaID:2} for range 39: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:4} {NodeID:3 StoreID:3 ReplicaID:3}]
I160815 18:22:11.146836 storage/replica.go:2121  store=1:1 range=39 [/Table/51/1/4758645612977735084/"05edafd5-7606-4b27-890e-912cf875fe30"/433-/Table/51/1/4911345998852779755/"294d0a31-2d4f-4819-8001-aef8805f4abf"/366): applied part of replica change txn: EndTransaction [/Local/Range/"\xbb\x89\xfdB\n\x1a\xf8\xfd\x82E\xac\x1205edafd5-7606-4b27-890e-912cf875fe30\x00\x01\xf7\x01\xb1"/RangeDescriptor,/Min), pErr=<nil>

Is some kind of cache or gossip info not being invalidated?

@petermattis
Copy link
Collaborator

The trace is showing that we're timing out trying to figure out who the lease holder is when talking to replica {1 1 1}. We then talk to replica {0 0 0} (which I believe means a replica which isn't initialized) and that replica sends us back to replica {1 1 1}. Seems like we've entered some funky retry loop. I'm surprised that the uninitialized replica ({0 0 0}) would even get the request, but perhaps that can happen when a replica is removed and then re-added to a node fairly quickly. @bdarnell?

@bdarnell
Copy link
Contributor

If a replica is removed and then re-added, clients may have a stale copy of the range descriptor that still points to the replica under its old identity, so that's one way for an uninitialized replica to get traffic.

If it's still {0 0 0} then it must have just gotten a preemptive snapshot and not a full raft message yet. Replicas in this state have a creatingReplica which they use as a guess about the new leader, although this is never updated or invalidated after creation, so it could be stale.

So one scenario that I think is possible is that replica {1 1 1} (the original leader) adds another node (seen here as {0 0 0}) to the range, sends it a preemptive snapshot, then is itself removed before it can complete the ChangeReplicas. Then the new node still believes that {1 1 1} is leader, node {1 1 1} doesn't know anything.

The NotLeaseHolderErrors returned here are considered per-replica errors by DistSender, so if we get one from every replica in our cached range descriptor we invalidate our cache and try again. Since there was a failed ChangeReplicas call above, there is likely to be an intent on the range descriptor. The kv.rangeCache has some special logic to flip back and forth between pre- and post-intent versions of a range descriptor. I recall being suspicious of this code and thinking that it was possible to get stuck in a pathological case in which only one of the two versions would actually get used. I don't remember the details now, though.

@bdarnell
Copy link
Contributor

Speaking of rangeCache, @spencerkimball just pointed out that it is not safe to mutate an OrderedCache from inside a DoRange callback. rangeCache does exactly this in clearOverlappingCachedRangeDescriptors.

@petermattis
Copy link
Collaborator

@cuongdo I see the logic @bdarnell is referring to in DistSender.getDescriptors. I think it it is worthwhile to add some additional tracing to that method and around the range cache in general.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 16, 2016

Yes, the range cache is a black box in terms of tracing. I've already added
tracing to my branch for debugging. I'll clean up my tracing and include it
in a PR.

On Mon, Aug 15, 2016 at 8:35 PM Peter Mattis [email protected]
wrote:

@cuongdo https://github.com/cuongdo I see the logic @bdarnell
https://github.com/bdarnell is referring to in DistSender.getDescriptors.
I think it it is worthwhile to add some additional tracing to that method
and around the range cache in general.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#8192 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffplQgkTDD6MiEQulZNlMTeYjYaLkhks5qgQXKgaJpZM4JZ4_j
.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 16, 2016

With some additional logging, it looks like somehow, some lease requests are taking a very long time.

Here's how successful lease requests tend to look:

160816 14:43:59.776660 storage/replica_range_lease.go:133  CDO: store=1:1 range=10 [/Table/51/1/2187990796066310151/"05baf108-62ca-4625-abb9-ab8323c81fe4"/222-/Table/51/1/4636959246303910691/"63ddd778-e765-40b2-aac8-4296c55bb29c"/39): proposing lease request
E160816 14:43:59.783776 storage/replica_range_lease.go:149  CDO: store=1:1 range=10 [/Table/51/1/2187990796066310151/"05baf108-62ca-4625-abb9-ab8323c81fe4"/222-/Table/51/1/4636959246303910691/"63ddd778-e765-40b2-aac8-4296c55bb29c"/39): lease request succeeded

Here's an unhealthy lease renewal request for that same range:

E160816 14:47:46.905070 storage/replica_range_lease.go:133  CDO: store=1:1 range=10 [/Table/51/1/2187990796066310151/"05baf108-62ca-4625-abb9-ab8323c81fe4"/222-/Table/51/1/2340439139283957407/"63ddd778-e765-40b2-aac8-4296c55bb29c"/2235): proposing lease request
E160816 14:48:42.330343 storage/replica_range_lease.go:149  CDO: store=1:1 range=10 [/Table/51/1/2187990796066310151/"05baf108-62ca-4625-abb9-ab8323c81fe4"/222-/Table/51/1/2340439139283957407/"63ddd778-e765-40b2-aac8-4296c55bb29c"/2235): lease request succeeded

Note that this lease request took 56 seconds to complete. This period overlaps with queries involving this range that took a bit over 60 seconds to finish. During this lease-less time, queries to range 10 get stuck in a vicious retry loop because there's no lease holder to execute the BatchRequest.

Now, the question is: why did this lease request take 56 seconds??

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 16, 2016

Interestingly, an ADD_REPLICA for range 10 overlaps with this lengthy lease renewal process:

I160816 14:47:46.528284 storage/replica.go:1497  store=1:1 range=10 [/Table/51/1/2187990796066310151/"05baf108-62ca-4625-abb9ab8323c81fe4"/222-/Table/51/1/2340439139283957407/"63ddd778-e765-40b2-aac8-4296c55bb29c"/2235): proposing ADD_REPLICA {NodeID:4 StoreID:4 ReplicaID:6} for range 10: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:3 StoreID:3 ReplicaID:5} {NodeID:2 StoreID:2 ReplicaID:4} {NodeID:4 StoreID:4 ReplicaID:6}]

followed by a 3 second timeout:

E160816 14:47:49.540360 storage/queue.go:497  [replicate] on store=1:1 range=10 [/Table/51/1/2187990796066310151/"05baf108-62ca-4625-abb9ab8323c81fe4"/222-/Table/51/1/2340439139283957407/"63ddd778-e765-40b2-aac84296c55bb29c"/2235): change replicas of range 10 failed: context deadline exceeded

Why and where are we timing out this replica change operation? I've noticed that some successful Raft operations take close to 3 seconds under load, so it seems easy to trigger change replica failures.

@petermattis
Copy link
Collaborator

Lease requests are raft commands. So there needs to be a raft leader for the range in question and there needs to be a quorum of nodes available to execute the command. If you turn on raft logging --vmodule=raft=5 you should see what raft activity is taking place for the problematic range. Are we unable to elect a leader for some reason?

@petermattis
Copy link
Collaborator

Another possibility is that we do have a leader for the range, but are not quickly re-proposing the lease request command to it. See Replica.refreshPendingCommandsLocked.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 16, 2016

vmodule=raft=5 is very helpful.

In the span of a minute, the term for a problematic range is increasing by 32. There seems to be rapid-fire succession of Raft elections for the same range. So, we're throwing away a lot of unstable log entries during that time and shifting the lease holder a ton.

@bdarnell
Copy link
Contributor

Can you post that vmodule=raft=5 log?

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 16, 2016

Here are some logs for a test run that had this issue:

logs.zip

SQL tracing is on:

grep -H " done" cockroach.stderr*  | sort -nk 2 | tail -10
cockroach.stderr: 103351.836ms      0.126ms    INSERT 1 done
cockroach.stderr: 103468.860ms      0.162ms    INSERT 1 done
cockroach.stderr: 103700.492ms      0.114ms    INSERT 1 done
cockroach.stderr: 103878.283ms      0.108ms    INSERT 1 done
cockroach.stderr: 104028.905ms      0.222ms    INSERT 1 done
cockroach.stderr: 104071.897ms      0.104ms    INSERT 1 done
cockroach.stderr: 104097.830ms      0.104ms    INSERT 1 done
cockroach.stderr: 104119.329ms      0.208ms    INSERT 1 done
cockroach.stderr: 104253.536ms      0.098ms    INSERT 1 done
cockroach.stderr: 104373.214ms      0.122ms    INSERT 1 done

In these logs, range 11 is a problematic range that has frequent elections, sometimes multiple within a second.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 16, 2016

Running with a --raft-tick-interval of 1s on three test runs. I'll let them go for a while and see what happens. So far, things are better than before.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 16, 2016

Here are logs with a 1 second Raft interval:

raft_tick_1s.zip

I don't see election storms any more, but there are still instances of leaderless ranges resulting in dropped Raft proposals, and there are still lease holder changes (albeit much slower).

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 16, 2016

It's worth noting that the 0 QPS periods are far rarer with a 1s raft tick interval. However, the still happening but rarer 3-4 minute pauses still happen after 1-2 hours of load.

So, it's a possibility that queued up heartbeats are behind the election storms I was seeing with the default raft tick interval.

Definitely open to suggestions on what to look at next. Was thinking of going back to the default tick interval and digging further.

@petermattis
Copy link
Collaborator

Let's try throttling rebalancing. I'm suspicious about how we rebalance by adding and then immediately removing a replica without waiting for the new replica to be brought up to date. Is there some scenario there were a range can't achieve quorum until the new replica comes up to date?

diff --git a/storage/replicate_queue.go b/storage/replicate_queue.go
index ee0311d..139423c 100644
--- a/storage/replicate_queue.go
+++ b/storage/replicate_queue.go
@@ -205,6 +205,7 @@ func (rq *replicateQueue) process(
                if err = repl.ChangeReplicas(ctx, roachpb.ADD_REPLICA, rebalanceReplica, desc); err != nil {
                        return err
                }
+               time.Sleep(10 * time.Second)
        }

        // Enqueue this replica again to see if there are more changes to be made.

@petermattis
Copy link
Collaborator

Here is a better hack to try:

diff --git a/storage/replica_command.go b/storage/replica_command.go
index 3b7634d..6da9cba 100644
--- a/storage/replica_command.go
+++ b/storage/replica_command.go
@@ -3029,6 +3029,17 @@ func (r *Replica) ChangeReplicas(
                        },
                })

+               // TODO(peter): Wait for the preemptive snapshot to be received and
+               // applied. We estimate that a 64 MB snapshot can be sent/applied in
+               // 20s. Remove this hack.
+               waitTime := time.Duration(len(snap.Data)) * 20 * time.Second / (64 << 20)
+               if waitTime < time.Second {
+                       waitTime = time.Second
+               }
+               log.Infof(ctx, "%s: waiting for preemptive snapshot: %d %.1fs",
+                       r, len(snap.Data), waitTime.Seconds())
+               time.Sleep(waitTime)
+
                repDesc.ReplicaID = updatedDesc.NextReplicaID
                updatedDesc.NextReplicaID++
                updatedDesc.Replicas = append(updatedDesc.Replicas, repDesc)

This will simulate what will happen if we waited for preemptive snapshots to apply. I've run a test for 5m with no blips using this patch.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 17, 2016

Will try this today.

As I mentioned in person to you, I believe that we need end-to-end tracing.
That includes distributed tracing. An intermediate and still important step
would be to extend tracing into the Raft layer. Diagnosing this issue thus
far has required an inordinate amount of log deciphering and vmodule
twiddling.
On Tue, Aug 16, 2016 at 7:06 PM Peter Mattis [email protected]
wrote:

Here is a better hack to try:

diff --git a/storage/replica_command.go b/storage/replica_command.go
index 3b7634d..6da9cba 100644
--- a/storage/replica_command.go
+++ b/storage/replica_command.go
@@ -3029,6 +3029,17 @@ func (r *Replica) ChangeReplicas(
},
})

  •           // TODO(peter): Wait for the preemptive snapshot to be received and
    
  •           // applied. We estimate that a 64 MB snapshot can be sent/applied in
    
  •           // 20s. Remove this hack.
    
  •           waitTime := time.Duration(len(snap.Data)) \* 20 \* time.Second / (64 << 20)
    
  •           if waitTime < time.Second {
    
  •                   waitTime = time.Second
    
  •           }
    
  •           log.Infof(ctx, "%s: waiting for preemptive snapshot: %d %.1fs",
    
  •                   r, len(snap.Data), waitTime.Seconds())
    
  •           time.Sleep(waitTime)
    
    •       repDesc.ReplicaID = updatedDesc.NextReplicaID
            updatedDesc.NextReplicaID++
            updatedDesc.Replicas = append(updatedDesc.Replicas, repDesc)
      

This will simulate what will happen if we waited for preemptive snapshots
to apply. I've run a test for 5m with no blips using this patch.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#8192 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffpttq14JsYEcFEt3ps3xKn22f77_rks5qgkKBgaJpZM4JZ4_j
.

@petermattis
Copy link
Collaborator

Agreed. I'm not suggesting my patch as a fix, just another means of narrowing in on the problem. If excessive rebalancing is causing the issue it could focus the diagnostic efforts.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 17, 2016

I tried a build with #8604 applied, and the election storm & 2-3 minute queries still happen after 4-5 minutes on two separate runs.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 17, 2016

next task is to dig into voting mechanism and understand what's triggering all these elections and why we get so many of these messages, which overlap with the issues:

I160817 15:16:14.184639 raft/raft.go:777  store=1:1 range=33 [/Table/51/1/7805692795270769010/"74f112bc-0758-453c-a573-4f02b88b4bdd"/1764-/Table/51/1/8087082408498019354/"d579a506-c6ef-4d4a-9af7-b2e26740a59f"/141): 1 no leader at term 11; dropping proposal

@petermattis
Copy link
Collaborator

Perhaps related to what you're looking into, but when sending a large snapshot I'm not seeing any activity (msgApp, msgHeartbeat, etc) for a range for over a second which then triggers an election. On the node sending the snapshot the logs contain:

I160817 12:50:26.273763 raft/raft.go:310  store=1:1 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830): 1 [firstindex: 95449, commit: 95525] sent snapshot[index: 95426, term: 6] to 4 [next = 1, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I160817 12:50:26.273801 raft/raft.go:312  store=1:1 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830): 1 paused sending replication messages to 4 [next = 1, match = 0, state = ProgressStateSnapshot, waiting = true, pendingSnapshot = 95426]
I160817 12:50:30.639417 storage/raft.go:145  store=1:1 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830) raft ready

On another node I see:

I160817 12:50:26.274779 storage/raft.go:145  store=3:3 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830) raft ready
I160817 12:50:28.340271 raft/raft.go:549  store=3:3 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830): 2 is starting a new election at term 6

Hmm, this is curious. I see a 3 second gap in the logs for that first node. Note only is there no activity for range 7, but there is no logging activity period. On other nodes I still see activity during that 3 second gap. I wonder what is happening there.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 17, 2016

Interesting. I could only find one instance of that in the logs for my
latest run.

On Wed, Aug 17, 2016 at 1:06 PM Peter Mattis [email protected]
wrote:

Perhaps related to what you're looking into, but when sending a large
snapshot I'm not seeing any activity (msgApp, msgHeartbeat, etc) for a
range for over a second which then triggers an election. On the node
sending the snapshot the logs contain:

I160817 12:50:26.273763 raft/raft.go:310 store=1:1 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830): 1 [firstindex: 95449, commit: 95525] sent snapshot[index: 95426, term: 6] to 4 [next = 1, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I160817 12:50:26.273801 raft/raft.go:312 store=1:1 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830): 1 paused sending replication messages to 4 [next = 1, match = 0, state = ProgressStateSnapshot, waiting = true, pendingSnapshot = 95426]
I160817 12:50:30.639417 storage/raft.go:145 store=1:1 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830) raft ready

On another node I see:

I160817 12:50:26.274779 storage/raft.go:145 store=3:3 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830) raft ready
I160817 12:50:28.340271 raft/raft.go:549 store=3:3 range=7 [/Table/51-/Table/51/1/4590058141872891762/"01117443-cd00-4980-b0c5-33399b34d9b7"/2830): 2 is starting a new election at term 6

Hmm, this is curious. I see a 3 second gap in the logs for that first
node. Note only is there no activity for range 7, but there is no logging
activity period. On other nodes I still see activity during that 3 second
gap. I wonder what is happening there.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#8192 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffppj2Q_5U47rcrAzKxumy_v3fdZ8Eks5qgz-QgaJpZM4JZ4_j
.

petermattis added a commit to petermattis/cockroach that referenced this issue Aug 17, 2016
Raft ready logging (enabled via -vmodule=raft=5) is extraordinarily
expensive when logging a large snapshot. In particular, if an outgoing
message contained a snapshot or raft.Ready.Snapshot was non-empty we
would see multiple seconds being spent formatting the log message only
for most of it to be thrown away. These delays horked up the raft
processing goroutine

Added warning logs when raft ready processing and raft ticking take
longer than a second. When that happens something bad is usually going
on elsehwere.

See cockroachdb#8192.
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 17, 2016
Raft ready logging (enabled via -vmodule=raft=5) is extraordinarily
expensive when logging a large snapshot. In particular, if an outgoing
message contained a snapshot or raft.Ready.Snapshot was non-empty we
would see multiple seconds being spent formatting the log message only
for most of it to be thrown away. These delays horked up the raft
processing goroutine

Added warning logs when raft ready processing and raft ticking take
longer than a second. When that happens something bad is usually going
on elsehwere.

See cockroachdb#8192.
@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 17, 2016

So, it looks like Pete's earlier theory about the quorum being 3 is correct:

I160817 19:51:42.700488 raft/raft.go:763  store=1:1 range=16 [/Table/51/1/3999096291460540449/"cfb33dd5-8228-4be2-a2fa-990241850660"/44-/Table/51/1/4286884457940182869/"09e03c83-704e-4726-b404-ad424d3ad7e6"/466): 1 [quorum:3] has received 2 votes and 0 vote rejections

This is combined with some kind of blip:

I160817 19:51:39.298374 raft/raft.go:588  store=1:1 range=16 [/Table/51/1/3999096291460540449/"cfb33dd5-8228-4be2-a2fa-990241850660"/44-/Table/51/1/4286884457940182869/"09e03c83-704e-4726-b404-ad424d3ad7e6"/466): CDO: 1 (leader) received Raft message type MsgUnreachable
I160817 19:51:39.298419 raft/raft.go:712  store=1:1 range=16 [/Table/51/1/3999096291460540449/"cfb33dd5-8228-4be2-a2fa-990241850660"/44-/Table/51/1/4286884457940182869/"09e03c83-704e-4726-b404-ad424d3ad7e6"/466): 1 failed to send message to 4 because it is unreachable [next = 930, match = 929, state = ProgressStateSnapshot, waiting = true, pendingSnapshot = 1202]

This is enough to prevent forward progress with the election. I lost some logs from supervisord's default log truncation, so I'll continue to dig further to figure out what was going on during this blip.

@bdarnell
Copy link
Contributor

#8613 should help with this, since it delays changing the quorum requirement until the preemptive snapshot has been applied.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 18, 2016

There's still an election storm with the latest master (0d11f31), though it does seem quite a bit better now. I'll verify more tomorrow.

It looks like ADD_REPLICA is sometimes quickly followed by the Raft leader stepping down (in this case ~2 seconds):

I160818 02:10:54.816511 storage/replica.go:1505  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): proposing ADD_REPLICA {NodeID:4 StoreID:4 ReplicaID:5} for range 7: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:4} {NodeID:4 StoreID:4 ReplicaID:5}]
I160818 02:10:54.819044 storage/replica.go:2155  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): applied part of replica change txn: EndTransaction [/Local/Range/"\xbb"/RangeDescriptor,/Min), pErr=<nil>
I160818 02:10:54.820230 storage/replica_command.go:3062  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): change replicas of 7: read existing descriptor range_id:7 start_key:"\273" end_key:"\273\211\375?\331\222@\342\271\350\215\02275ba1de4-58d6-4cf8-bce4-bf58e509eff6\000\001\367\023u" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:4 > replicas:<node_id:4 store_id:4 replica_id:5 > next_replica_id:6
I160818 02:10:54.822376 raft/raft.go:642  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): 1 received msgApp rejection(lastindex: 116472) from 5 for index 117037
I160818 02:10:54.822423 raft/raft.go:644  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): 1 decreased progress of 5 to [next = 116473, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I160818 02:10:55.211326 raft/raft.go:310  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): 1 [firstindex: 117027, commit: 117036] sent snapshot[index: 117036, term: 6] to 5 [next = 116473, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I160818 02:10:55.211388 raft/raft.go:312  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): 1 paused sending replication messages to 5 [next = 116473, match = 0, state = ProgressStateSnapshot, waiting = true, pendingSnapshot = 117036]
I160818 02:10:55.477217 raft/raft.go:693  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): 1 snapshot succeeded, resumed sending replication messages to 5 [next = 117037, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
W160818 02:10:56.710243 raft/raft.go:593  store=1:1 range=7 [/Table/51-/Table/51/1/4600869301715069069/"75ba1de4-58d6-4cf8-bce4-bf58e509eff6"/4981): 1 stepped down to follower since quorum is not active

After that, there's a series of 18 contested elections for the same range over 10 seconds (it seems like it can take a while for elections to converge when quorum=3). After the dust settles, a range lease command must be committed before write operations can succeed for that range. All this can take a while and causes BatchRequest timeouts (and retries) for that range during this period of election drama.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 18, 2016

These anomalies seem to have greatly reduced given the recent burst of Raft improvements. Will continue to monitor and update later.

@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 22, 2016

@cuongdo cuongdo closed this as completed Aug 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants