Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: periods of 0 QPS under chaos #15163

Closed
petermattis opened this issue Apr 19, 2017 · 38 comments
Closed

stability: periods of 0 QPS under chaos #15163

petermattis opened this issue Apr 19, 2017 · 38 comments
Assignees
Milestone

Comments

@petermattis
Copy link
Collaborator

blue, the chaos test cluster, is experiencing periods of 0 QPS:

screen shot 2017-04-19 at 12 34 36 pm

These seem to correspond to a spike in the raft-leader-not-leaseholder metric:

screen shot 2017-04-19 at 12 35 02 pm

Note that we're not seeing unavailable ranges which have been the cause of previous 0 QPS issues on blue. My suspicion is that the raft-leader-not-leaseholder spikes are indicating that we're not able to acquire the lease for certain ranges despite a quorum of replicas being available. Without the lease, no traffic. I haven't dug in any further or have any other evidence to validate my suspicion.

@petermattis petermattis added this to the 1.0 milestone Apr 19, 2017
@petermattis petermattis self-assigned this Apr 19, 2017
@petermattis
Copy link
Collaborator Author

Another incident on blue:

screen shot 2017-04-21 at 9 15 24 am

Note that the queries didn't drop completely to 0, probably because photos was running on a handful of nodes. I believe block_writer traffic was completely wedged. The query drop happened 5min after the start of a chaos event:

screen shot 2017-04-21 at 9 15 55 am

Once again, raft-leader-not-leaseholder spikes during this event:

screen shot 2017-04-21 at 9 15 43 am

KV traffic dropped to near zero for most nodes:

screen shot 2017-04-21 at 9 16 06 am

The slow dist-sender shows growth right at the start of the chaos event and then another jump during the QPS valley:

screen shot 2017-04-21 at 9 24 02 am

The slow raft proposal metric shows activity only during the QPS valley:

screen shot 2017-04-21 at 9 24 13 am

Seems like Raft proposals aren't going through during this time period. But why?

@spencerkimball Any chance this is fixed by #15199? The binary on blue did not have that SHA when this event occurred.

@spencerkimball
Copy link
Member

No that would almost certainly not fix this problem. It's these exact sorts of problems we are worst at debugging. This is why I was advocating for a page that gathers up all of the problems a node is experiencing from the top down.

@spencerkimball
Copy link
Member

Damn mobile experience. I was trying to rewrite that as "from bottom up" because it seems increasingly evident that top down is usually a waste of time in terms of diagnosing underlying problems:

  • query every range on every store:
    • have raft leader?
    • attempting to get lease?
    • last lease request error if yes
    • raft log size
    • command queue size / wait times
    • distribution of recent errors from range
    • raft reproposals

I think our big problem with diagnoses is that we have only store aggregated details and it seems to come down to an individual range that's behaving badly.

@petermattis
Copy link
Collaborator Author

petermattis commented Apr 21, 2017

Let's leave the brainstorming for better troubleshooting mechanisms to a separate issue.

Lots of slow requests recorded by Lightstep during the 0 QPS event. Looks like we're not electing a Raft leader:


0s node.Batch
0s [n10] 1 CPut, 1 BeginTxn, 1 EndTxn
0s [n10,s10] executing 3 requests
0s [n10,s10,r301/8:/Table/51/1/20{42945…-51916…}] read-write path
0s [n10,s10,r301/8:/Table/51/1/20{42945…-51916…}] command queue
4.73s [n10,s10,r301/8:/Table/51/1/20{42945…-51916…}] raft
11.3s [n10,s10,r301/8:/Table/51/1/20{42945…-51916…}] re-submitting command 27c87dff4c6e144d to Raft: reasonNewLeader
35.1s [n10,s10,r301/8:/Table/51/1/20{42945…-51916…}] re-submitting command 27c87dff4c6e144d to Raft: reasonTicks
41.5s [n10,s10,r301/8:/Table/51/1/20{42945…-51916…}] re-submitting command 27c87dff4c6e144d to Raft: reasonTicks
43.6s [n10,s10,r301/8:/Table/51/1/20{42945…-51916…}] re-submitting command 27c87dff4c6e144d to Raft: reasonNewLeader
43.9s [n10,s10,r301/8:/Table/51/1/20{42945…-51916…}] applying command

Strange that we're not re-proposing more frequently for reasonTicks. I wonder if we're starving ranges of Raft processing time somehow.

Cc @bdarnell

@bdarnell
Copy link
Contributor

Most ranges became quiescent during this incident:
image

And time tracked in the raft scheduler correspondingly dropped:
image

So it doesn't look like we're starving any ranges for processing time, unless perhaps there's high contention on Store.mu (which is the only place the raft scheduler could spend time that's not getting tracked, although even if that were the case I don't think the starved ranges could become quiescent).

I suspect we're getting starved for GRPC resources, not processing time. What build was this running? Did it have #14939 ?

@petermattis
Copy link
Collaborator Author

I suspect we're getting starved for GRPC resources, not processing time. What build was this running? Did it have #14939 ?

It was running a development build from yesterday. Yes, it had #14939.

@bdarnell
Copy link
Contributor

image

OK, let's zoom in on the raft messages being sent. This cluster is sending snapshots (or at least MsgSnap messages) fairly regularly even before the chaos event starts at 10:24. That's surprising. MsgSnap messages indicate raft-initiated snapshots (I think; I'm a little fuzzy on the interactions of these metrics), but the snapshot-applied graph shows only preemptive snapshots at this time.

When the node dies, we see a burst of Vote, VoteResp, Prop, and TimeoutNow messages. This is the expected failover process as ranges whose leader was on the dead node elect a new leader and transfer leases. The initial burst finishes after a minute, but we have continued MsgVote traffic without MsgVoteResp. This could mean a partition, but more likely it's a range that quiesced without all followers agreeing on this fact.

At 10:29, the repair process starts, but MsgSnap (raft-initiated snapshot) traffic decreases. We get a lot more elections at this point, which is unexpected, and VoteResp traffic lags significantly behind Vote traffic.

image

There is an increase in snapshots being reserved and applied, so we are sending preemptive snapshots. (without accompanying MsgSnap messages this time).

@petermattis
Copy link
Collaborator Author

That's surprising. MsgSnap messages indicate raft-initiated snapshots (I think; I'm a little fuzzy on the interactions of these metrics), but the snapshot-applied graph shows only preemptive snapshots at this time.

MsgSnap indicates any type of snapshot. See Store.HandleSnapshot.

There are also Raft dropped messages during this time, which is surprising given the QPS.

@bdarnell
Copy link
Contributor

OK, paging back in exactly how the metrics work: MsgSnap is used for all snapshots, not just raft-initiated ones. It's recorded on the receiver side at the start of the process. The generated and applied metrics are recorded at the end. The difference in MsgSnap (almost 30 qps in the pre-chaos steady state) and generated/applied (0.05 qps) indicates that a lot of attempted snapshots are being declined.

The fact that we were seeing a steady rate of snapshot activity before the chaos event makes me think that some range has been in a bad state (underreplicated) since the previous chaos event, and then this event knocks out a second replica of some critical range.

image

Snapshots show a similar pattern in previoius events: snapshot activity hits a plateau at a low level for a long time before reaching zero. This sometimes, but not always, extends to the start of the next chaos event. If we're still doing repair work from the previous chaos even when the next one hits, we have a significant chance of replicas becoming under-replicated.

So here's a theory: After a long chaos event, the restarted node will have shed most of its replicas and be nearly empty. It will gradually fill back up, but if it doesn't reach equilibrium before the next chaos event, it will be the target of a disproportionate number of repaired replicas. The various levels of snapshot throttling will prevent this from finishing before the next chaos event.

image

Here we can see the recovery of one node overlapping with the death of the next. On the other hand, the ranges are not considered under-replicated for the duration of this rebalancing:

image

@bdarnell
Copy link
Contributor

There are also Raft dropped messages during this time, which is surprising given the QPS.

Yeah. There's a large spike of dropped raft messages just after the part I displayed here; I cropped it out so it wouldn't dominate the y axis.

@petermattis
Copy link
Collaborator Author

Snapshots show a similar pattern in previoius events: snapshot activity hits a plateau at a low level for a long time before reaching zero.

Yep, the "plateau" is rebalancing back on to the previously dead node.

This sometimes, but not always, extends to the start of the next chaos event. If we're still doing repair work from the previous chaos even when the next one hits, we have a significant chance of replicas becoming under-replicated.

Under-replicated isn't a problem, though. Unavailable is. And we're not seeing unavailable ranges. It does seem suspicious that rebalancing was occurring when the chaos event happened. Perhaps the under-replicated/unavailable metrics are buggy.

So here's a theory: After a long chaos event, the restarted node will have shed most of its replicas and be nearly empty. It will gradually fill back up, but if it doesn't reach equilibrium before the next chaos event, it will be the target of a disproportionate number of repaired replicas. The various levels of snapshot throttling will prevent this from finishing before the next chaos event.

This definitely appears to be happening, but I'm not seeing how this explains the 0 QPS episodes given the evidence so far.

@petermattis
Copy link
Collaborator Author

Perhaps the under-replicated/unavailable metrics are buggy.

I have to step out for a moment, but quick inspection of calcReplicaMetrics shows it is using len(desc.Replicas) to compute quorum. Perhaps this is #10506.

@bdarnell
Copy link
Contributor

Yeah, I'm suspicious of those metrics too.

image

Here, the orange line is the 10:24 chaos event that caused an outage; the others were fine. We can see that the preceding outage was relatively long, leaving it to lose more of its ranges before coming back up, and taking longer to recover. It was also recovering more slowly for a time.

The number of ranges lost is proportional to the outage duration, which suggests that we haven't been able to move everything away from the dead node and those replicas should have remained in the under-replicated state.

@petermattis
Copy link
Collaborator Author

The number of ranges lost is proportional to the outage duration, which suggests that we haven't been able to move everything away from the dead node and those replicas should have remained in the under-replicated state.

I'm not sure I'm following you here. The orange line dips 5min into the outage which is when recovery starts (and when grafana decides to zero that metric). For the preceding 5min of the chaos event, everything was copacetic.

The preceding outage was longer, but recovery is prioritized after rebalancing (we accomplish this by disallowing rebalances while a node is too far behind in its Raft logs) and the fact that we started rebalancing onto blue-0003 again is a good signal that recovery had completed. Also notice that the under-replicated metric was steady at 0 before the outage on blue-0004.

@bdarnell
Copy link
Contributor

I'm talking about when the node that was killed comes back online. The orange line spikes back up near its previous value, but rapidly drops as the replicas which had been rebalanced away from this node. If the node was down for long enough, we would expect this to drop all the way to zero as all of the replicas had been rebalanced away and the node had to build back up from scratch. But it's not, which means that the node is still a member of some of the ranges it had been a part of before the restart.

But I think I had my graphs misaligned earlier - the under-replicated ranges metric does not go to zero during the outage; it decreases slowly throughout the outage and then drops to zero after the dead node comes back online. This is concerning (it's taking a long time for this node without much data to be recovered - we need to make preemptive snapshots for an under-replicated range use the higher "raft" snapshot rate limit), but wouldn't explain the 0 qps.

@petermattis
Copy link
Collaborator Author

This is concerning (it's taking a long time for this node without much data to be recovered - we need to make preemptive snapshots for an under-replicated range use the higher "raft" snapshot rate limit), but wouldn't explain the 0 qps.

And the rate of recovery during the outage is much slower during the 0 QPS event than the preceding chaos event. Compare the under-replicated metric on the left and right:

screen shot 2017-04-22 at 1 39 26 pm

I'm guessing this is a symptom rather than the cause.

@petermattis
Copy link
Collaborator Author

Another incident just occurred:

screen shot 2017-04-24 at 2 41 48 pm

In this case, we add range unavailability:

screen shot 2017-04-24 at 2 42 04 pm

screen shot 2017-04-24 at 2 42 18 pm

The associated chaos event started at 18:24. This also coincided with a growth in the number of ranges on the cluster. Its interesting that the cluster recovered even before the down node restarted. I have no explanation for that. Still investigating.

@petermattis
Copy link
Collaborator Author

petermattis commented Apr 24, 2017

In this event, everything seemed to be horked up on a single range:

W170424 18:27:04.133661 6135093 storage/replica.go:2260  [n6,s6,r22/3:/Table/51{-/1/14348…}] have been waiting 1m0s for proposing command [txn: c2b2550e], BeginTransaction [/Table/51/1/109251574751109300/"a1f9f684-59ae-41ca-9c6c-2af5bad92c82"/48243/0,/Min), ConditionalPut [/Table/51/1/109251574751109300/"a1f9f684-59ae-41ca-9c6c-2af5bad92c82"/48243/0,/Min), EndTransaction [/Table/51/1/109251574751109300/"a1f9f684-59ae-41ca-9c6c-2af5bad92c82"/48243/0,/Min)

Immediately before this, I see a ChangeReplicas command timed out:

W170424 18:27:04.132865 6135083 storage/replica.go:1719  [n6,s6,r22/3:/Table/51{-/1/14348…}] context deadline exceeded while in command queue: [txn: 68364bdb], BeginTransaction [/Local/Range/"\xbb"/RangeDescriptor,/Min), ConditionalPut [/Local/Range/"\xbb"/RangeDescriptor,/Min)
E170424 18:27:04.133117 250 storage/queue.go:634  [replicate,n6,s6,r22/3:/Table/51{-/1/14348…}] change replicas of r22 failed: context deadline exceeded

About a minute before that:

I170424 18:26:02.755103 250 storage/store.go:3338  [replicate,n6,s6,r22/3:/Table/51{-/1/14348…}] streamed snapshot: kv pairs: 469551, log entries: 4069, rate-limit: 2.0 MiB, 31575ms
...
I170424 18:26:04.109628 250 storage/replica_command.go:3531  [replicate,n6,s6,r22/3:/Table/51{-/1/14348…}] change replicas (ADD_REPLICA {9 9 4}): read existing descriptor r22:/Table/51{-/1/143484301803991845/"15569dd5-02bd-4929-a58b-d830062f97f9"/841} [r1(n8,s8), r2(n4,s4), r3(n6,s6), next=4]
I170424 18:26:04.130010 6135068 storage/replica.go:2582  [n6,s6,r22/3:/Table/51{-/1/14348…}] proposing ADD_REPLICA {NodeID:9 StoreID:9 ReplicaID:4}: [{NodeID:8 StoreID:8 ReplicaID:1} {NodeID:4 StoreID:4 ReplicaID:2} {NodeID:6 StoreID:6 ReplicaID:3} {NodeID:9 StoreID:9 ReplicaID:4}]
I170424 18:26:04.133099 250 storage/replica_command.go:3531  [replicate,n6,s6,r22/3:/Table/51{-/1/14348…}] change replicas (REMOVE_REPLICA {8 8 1}): read existing descriptor r22:/Table/51{-/1/143484301803991845/"15569dd5-02bd-4929-a58b-d830062f97f9"/841} [r1(n8,s8), r2(n4,s4), r3(n6,s6), r4(n9,s9), next=5]

So it seems like we were trying to rebalance a range. We successfully added a new replica and then were trying to remove a replica and that removal failed. Note that the chaos event was for n4 which had one of the replicas for the range.

@bdarnell
Copy link
Contributor

This looks like it might be #10506. We initially had replicas on nodes 4, 6, and 8, while 4 was down. We added a replica on node 9 (with votes from 6 and 8), then tried to remove the replica on 8 (we should try to choose node 4 in this case instead). This would require votes from all three of the live notes (6, 8, 9). This should work (a node can vote for its own removal), but an ill-timed log truncation could have caused this range to get blocked until a raft snapshot could be processed (and we do we some raft snapshots in this time period).

Of course, just after a chaos event a raft snapshot would also block any range that was reduced to 2/3 nodes, so this doesn't really explain things unless we can also figure out why the rebalance might cause a range to fall behind. Maybe setPendingSnapshotIndex isn't doing its job?

@petermattis
Copy link
Collaborator Author

We see raft snapshots during this outage, but nothing for r22 (unless my grep skills are failing me). The preemptive snapshot to n9 took ~30s to send. When n4 was removed from r22 18:26:04, n9 wouldn't be able to vote until it caught up, right? That might have taken a while given all of the other activity on the system (the splits were causing rebalance traffic).

I'm trying to think of whether there is some logging I can add that would verify this scenario in advance of implementing #10506. I suppose when we're about to remove a replica we can have the leader output the Raft log state of the followers.

@bdarnell
Copy link
Contributor

Right, n9 would not be able to vote until it caught up. The most severe form of this would be if the log was truncated and it needed a new snapshot, but I suppose even without that it would need the 30s of logs that were added during the time it took to send the snapshot (which would make it worse off than all the other ranges which also need 100% participation to make progress).

#10506 wouldn't really help here. If that were fixed, we'd remove node 4 instead of node 8, but we'd still need node 9 to catch up and vote. Fixing this appears to require more prioritization of "foreground" raft messages compared to snapshots, although we're already limiting snapshots pretty severely.

@petermattis
Copy link
Collaborator Author

There is a flip side to #10506: we could avoid adding a replica to a range if the range is currently under-replicated. When n9 is added to r22, n4 has been dead for ~2m and is presumably somewhat behind. Currently we're taking a range with 2 of 3 good nodes to 2 of 4 good nodes which is no bueno. We could also look at node liveness here and not perform rebalancing if one of the replicas is on a dead node.

@petermattis
Copy link
Collaborator Author

I can reproduce some sort of related badness locally:

  • Start a new local 3 node cluster.
  • Run kv against the cluster (default args are fine). Let load run for ~1m.
  • Kill the 3rd node. Wait for ~30s.
  • Add a new node.

Boom! 0 QPS.

petermattis added a commit to petermattis/cockroach that referenced this issue Apr 25, 2017
Tweak Allocator.RebalanceTarget to not generate a rebalance target if
adding a new replica would cause quorum to be violated. For example,
rebalancing a 3 replica range requires temporarily up-relicating to 4
replicas which would violate quorum if one of the replicas is on a down
node. Not rebalancing means we'll have to wait for the node to be
declared dead and then go through the dead replica removal process
before adding a new replica (3->2->3) for which we'll be able to
maintain quorum at all times.

Fixes cockroachdb#15163 (at least part of it)
petermattis added a commit to petermattis/cockroach that referenced this issue Apr 25, 2017
Tweak Allocator.RebalanceTarget to not generate a rebalance target if
adding a new replica would cause quorum to be violated. For example,
rebalancing a 3 replica range requires temporarily up-relicating to 4
replicas which would violate quorum if one of the replicas is on a down
node. Not rebalancing means we'll have to wait for the node to be
declared dead and then go through the dead replica removal process
before adding a new replica (3->2->3) for which we'll be able to
maintain quorum at all times.

Fixes cockroachdb#15163 (at least part of it)
petermattis added a commit to petermattis/cockroach that referenced this issue Apr 25, 2017
Tweak Allocator.RebalanceTarget to not generate a rebalance target if
adding a new replica would cause quorum to be violated. For example,
rebalancing a 3 replica range requires temporarily up-relicating to 4
replicas which would violate quorum if one of the replicas is on a down
node. Not rebalancing means we'll have to wait for the node to be
declared dead and then go through the dead replica removal process
before adding a new replica (3->2->3) for which we'll be able to
maintain quorum at all times.

Fixes cockroachdb#15163 (at least part of it)
@petermattis
Copy link
Collaborator Author

Gah, the 0 QPS episodes documented at the start of this issue are likely not fixed. Will get a new binary pushed to blue to confirm.

@petermattis petermattis reopened this Apr 25, 2017
@petermattis
Copy link
Collaborator Author

Here is a 0 QPS event that occurred this morning:

screen shot 2017-04-25 at 12 30 10 pm

Interestingly, the start doesn't line up with a chaos event:

screen shot 2017-04-25 at 12 30 42 pm

Similar to the events described at the start of this issue, the leader-not-leaseholder metric is non-zero during this event:

screen shot 2017-04-25 at 12 34 00 pm

The slow raft proposal shows something is horked up at the Raft left:

screen shot 2017-04-25 at 12 30 19 pm

Those slow raft proposals are all from a single node: blue 7.

Looking at the logs on that node shows that a single range is having problems, r58:

W170425 04:56:28.466620 795267 storage/replica.go:2260  [n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] have been waiting 1m0s for proposing command [txn: 1a6e5339], BeginTransaction [/Table/51/1/433388562844729408/"e3c34960-b616-4d3d-b13c-fd94282ce30a"/705278/0,/Min), ConditionalPut [/Table/51/1/433388562844729408/"e3c34960-b616-4d3d-b13c-fd94282ce30a"/705278/0,/Min), EndTransaction [/Table/51/1/433388562844729408/"e3c34960-b616-4d3d-b13c-fd94282ce30a"/705278/0,/Min)

Preceding that warning are errors about failing to send a snapshot for that range:

I170425 04:54:45.073877 59 storage/replica_raftstorage.go:415  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] generated preemptive snapshot 6eb4aff2 at index 1022243
I170425 04:54:45.077384 59 storage/replicate_queue.go:231  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] snapshot failed: r58: remote declined snapshot: reservation rejected
I170425 04:54:45.138744 59 storage/replica_raftstorage.go:415  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] generated preemptive snapshot 4e92c946 at index 1022244
I170425 04:54:45.140112 59 storage/replicate_queue.go:231  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] snapshot failed: r58: remote declined snapshot: reservation rejected
I170425 04:54:45.233081 59 storage/replica_raftstorage.go:415  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] generated preemptive snapshot 2729c07e at index 1022244
I170425 04:54:45.237757 59 storage/replicate_queue.go:231  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] snapshot failed: r58: remote declined snapshot: reservation rejected
I170425 04:54:45.477449 59 storage/replica_raftstorage.go:415  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] generated preemptive snapshot b873d09a at index 1022245
I170425 04:54:45.486076 59 storage/replicate_queue.go:231  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] snapshot failed: r58: remote declined snapshot: reservation rejected
I170425 04:54:45.874128 59 storage/replica_raftstorage.go:415  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] generated preemptive snapshot 621bd186 at index 1022248
I170425 04:54:45.933565 59 storage/replicate_queue.go:231  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] snapshot failed: r58: remote declined snapshot: reservation rejected
I170425 04:54:46.681988 59 storage/replica_raftstorage.go:415  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] generated preemptive snapshot 9a4614a0 at index 1022251
I170425 04:54:46.688164 59 storage/replicate_queue.go:231  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] snapshot failed: r58: remote declined snapshot: reservation rejected
E170425 04:54:46.688420 59 storage/queue.go:634  [replicate,n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] failed to replicate [n7,s7,r58/12:/Table/51/1/43{04524…-94369…}] after 5 retries

The retries are likely a bit silly. We're probably trying to send the snapshot to the same node over and over. But that is only a performance issue and a small one because generating a snapshot primarily occurs after the remote as accepted the reservation.

I'm not seeing anything so far that would indicate why r58 is experiencing slowness. My guess is that something is going on with lease acquisition. The lease history stuff might help debug, but the binary currently on blue predates it. I'm going to push a new binary with the lease history set to a moderate size.

@petermattis
Copy link
Collaborator Author

@bdarnell Mind taking a look at this as well? Perhaps I'm missing something about what happened to r58.

@petermattis
Copy link
Collaborator Author

Interestingly, the blip in node livness at 05:10:19 does not correspond to any actual node deaths. That was a node liveness hiccup but at the end of that hiccup r58 was ok again.

@bdarnell
Copy link
Contributor

image

Time accounted for in the raft scheduler drops to ~zero. There are two things that we could be waiting on in that loop that are not accounted for: either the store mutex or Store.processRequestQueue. The fact that there's no accounting for the latter looks like an oversight; it could be taking a significant amount of time. This includes sending any error responses on the raft stream, which might have to wait for a while if there's a sudden burst of replica too old errors (which is the main type of error sent here).

We see some changes around 4:48, 5 minutes before the full outage (could the store pool be incorrectly considering something dead at the 5 minute mark? Or maybe it's just taking 5 minutes for all the client threads to hit r58). MsgVotes start then, with a higher number of MsgVotes than MsgVoteResps, which is unexpected (but consistent with past instances of this problem); I'm not sure what it might mean.

@petermattis
Copy link
Collaborator Author

I wonder what is causing the "dropped" Raft messages. I wonder if we're dropping MsgVote or MsgVoteResp messages.

@petermattis
Copy link
Collaborator Author

Another 0 QPS episode just occurred on blue, so this wasn't fixed by #15355.

screen shot 2017-04-27 at 9 36 07 pm

screen shot 2017-04-27 at 9 36 15 pm

screen shot 2017-04-27 at 9 36 28 pm

Notice that the incident occurred outside of a chaos event. There were a handful of slow Raft proposals, but nothing persistent:

screen shot 2017-04-27 at 9 36 41 pm

screen shot 2017-04-27 at 9 37 12 pm

I'm not actually clear what changed 01:28 allowing the system to recover.

@a-robinson
Copy link
Contributor

I don't yet know why everything got blocked up, but node 1 seems to have been the problem child. There are two things pointing at that:

  1. All the other nodes (at least the four I checked - n2, n7, n8, and n10) only really have two things in their logs for the duration of the outage -- gossip thrashing and snapshot attempts to n1 (almost all of which got rejected at the reservation stage).
  2. Azure's monitoring shows disk activity dropping to practically zero from 8:55-9:28 for all nodes other than n1 (all of which look similar to the first screenshot below), while n1 still had disk activity during the outage (as shown in the second screenshot):

screen shot 2017-04-28 at 12 37 33 am

screen shot 2017-04-28 at 12 37 02 am

The gossip thrashing is really bad, and I'm not sure why or whether it's related to the lack of forward progress. I'll prioritize looking into #9819.

@a-robinson
Copy link
Contributor

Other interesting tidbits that don't necessarily explain anything by themselves:

  • Node 1 has way more data on it than any of the other nodes:

screen shot 2017-04-28 at 12 50 47 am

  • Nodes 2-10 had very little change in their number of SSTables during the outage, whereas node 1 appears to have been in the middle of some significant compactions that started flattening out around when the outage stopped:

screen shot 2017-04-28 at 12 54 07 am

@petermattis
Copy link
Collaborator Author

blue 1 once again has a ton of .fuse_hidden files. I think that explains the data discrepancy.

I'm not sure what to make of the RocksDB SSTables metric. blue 8 also shows a remarkably sharp dip, though somewhat before the outage:

screen shot 2017-04-28 at 8 44 12 am

@tamird
Copy link
Contributor

tamird commented Apr 30, 2017

Another zero QPS event on blue:

SQL:

cockroach-sql

Runtime:

cockroach-runtime

I'm not sure it's related, but poking around the admin UI I discovered that blue 7's liveness gossip was last updated at 2017-04-28 15:22:35.085695583 +0000 UTC (50+ hours ago) according to blue 7's own gossip. blue 7 also doesn't have a node descriptor in gossip.

/debug/request on blue 7 contains a few wedged requests that have been running for over 16 hours:

2017/04/30 19:43:17.558052	1049.859796	sql txn
19:43:17.558089	 .    37	... [client=127.0.0.1:33792,user=root,n11] executing 1/1: INSERT INTO blocks(block_id, writer_id, block_num, raw_bytes) VALUES (4058608034129001455, '36588123-f7ba-4b5e-9011-6e2e912bb530', 7335216, $1)
19:43:17.558164	 .    75	... [client=127.0.0.1:33792,user=root,n11] query not supported for distSQL: unsupported node *sql.insertNode
19:43:17.558216	 .    53	... txnID:e016abc1-0b3a-4473-9292-9c35448d2b6c
19:43:17.558254	 .    37	... [client=127.0.0.1:33792,user=root,n11] querying next range at /Table/51/1/4058608034129001455/"36588123-f7ba-4b5e-9011-6e2e912bb530"/7335216/0
19:43:17.558298	 .    44	... [client=127.0.0.1:33792,user=root,n11] sending batch 1 CPut, 1 BeginTxn, 1 EndTxn to r3187
19:43:17.558511	 .   213	... [client=127.0.0.1:33792,user=root,n11] application error: [NotLeaseHolderError] range 3187: replica (n11,s11):8 not lease holder; current lease is repl=(n6,s6):11 start=1493581019.136971047,0 epo=37 pro=1493581019.136989947,0
19:43:17.558515	 .     4	... [client=127.0.0.1:33792,user=root,n11] error, trying next peer
19:43:18.060190	 .501675	... [client=127.0.0.1:33792,user=root,n11] timeout, trying next peer
19:43:18.063872	 .  3682	... [client=127.0.0.1:33792,user=root,n11] application error: range 3187 was not found
19:43:18.063881	 .     9	... [client=127.0.0.1:33792,user=root,n11] error, trying next peer
19:43:18.064086	 .   205	... [client=127.0.0.1:33792,user=root,n11] application error: [NotLeaseHolderError] range 3187: replica (n11,s11):8 not lease holder; current lease is repl=(n6,s6):11 start=1493581019.136971047,0 epo=37 pro=1493581019.136989947,0
19:43:18.564253	 .500167	... [client=127.0.0.1:33792,user=root,n11] timeout, trying next peer
19:43:18.564588	 .   335	... [client=127.0.0.1:33792,user=root,n11] application error: [NotLeaseHolderError] range 3187: replica (n11,s11):8 not lease holder; current lease is repl=(n6,s6):11 start=1493581019.136971047,0 epo=37 pro=1493581019.136989947,0
...
<NotLeaseHolderError lines repeat>

@a-robinson
Copy link
Contributor

There's no gossip entry for node 7 because node 7 is now node 11 as of a couple days ago, which was caused by @arjunravinarayan recreating the instance:

CockroachDB node starting at 2017-04-28 15:22:35.5121194 +0000 UTC
build:      CCL 582a336 @ 2017/04/28 13:22:05 (go1.8.1)
admin:      https://cockroach-blue-07:8080
sql:        postgresql://root@cockroach-blue-07:26257?sslcert=certs%2Fclient.root.crt&sslkey=certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=certs%2Fca.crt
logs:       /home/cockroach/logs
store[0]:   path=/mnt/data1
status:     initialized new node, joined pre-existing cluster
clusterID:  7cef33d5-896a-48e3-94db-0c1909926054
nodeID:     11

@tamird
Copy link
Contributor

tamird commented May 1, 2017

Something else has happened now; node 1 has exactly 1 replica, while all the others have ~1400.

cockroach-replicas

https://monitoring.crdb.io/dashboard/db/cockroach-replicas?from=1493556494496&to=1493602790063&var-cluster=blue&var-node=All&var-rate_interval=1m&panelId=28&fullscreen

@a-robinson
Copy link
Contributor

Node 1 had a full disk due to a buildup of .fuse_hidden files. I've removed them, and it now appears to be reacquiring replicas.

@petermattis
Copy link
Collaborator Author

Possibly fixed by #15573. I'm going to optimistically close this. Will re-open if another 0 QPS episode with similar symptoms re-occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants