-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: periods of 0 QPS under chaos #15163
Comments
Another incident on Note that the queries didn't drop completely to 0, probably because Once again, raft-leader-not-leaseholder spikes during this event: KV traffic dropped to near zero for most nodes: The slow dist-sender shows growth right at the start of the chaos event and then another jump during the QPS valley: The slow raft proposal metric shows activity only during the QPS valley: Seems like Raft proposals aren't going through during this time period. But why? @spencerkimball Any chance this is fixed by #15199? The binary on |
No that would almost certainly not fix this problem. It's these exact sorts of problems we are worst at debugging. This is why I was advocating for a page that gathers up all of the problems a node is experiencing from the top down. |
Damn mobile experience. I was trying to rewrite that as "from bottom up" because it seems increasingly evident that top down is usually a waste of time in terms of diagnosing underlying problems:
I think our big problem with diagnoses is that we have only store aggregated details and it seems to come down to an individual range that's behaving badly. |
Let's leave the brainstorming for better troubleshooting mechanisms to a separate issue. Lots of slow requests recorded by Lightstep during the 0 QPS event. Looks like we're not electing a Raft leader:
Strange that we're not re-proposing more frequently for Cc @bdarnell |
Most ranges became quiescent during this incident: And time tracked in the raft scheduler correspondingly dropped: So it doesn't look like we're starving any ranges for processing time, unless perhaps there's high contention on Store.mu (which is the only place the raft scheduler could spend time that's not getting tracked, although even if that were the case I don't think the starved ranges could become quiescent). I suspect we're getting starved for GRPC resources, not processing time. What build was this running? Did it have #14939 ? |
OK, let's zoom in on the raft messages being sent. This cluster is sending snapshots (or at least MsgSnap messages) fairly regularly even before the chaos event starts at 10:24. That's surprising. MsgSnap messages indicate raft-initiated snapshots (I think; I'm a little fuzzy on the interactions of these metrics), but the snapshot-applied graph shows only preemptive snapshots at this time. When the node dies, we see a burst of Vote, VoteResp, Prop, and TimeoutNow messages. This is the expected failover process as ranges whose leader was on the dead node elect a new leader and transfer leases. The initial burst finishes after a minute, but we have continued MsgVote traffic without MsgVoteResp. This could mean a partition, but more likely it's a range that quiesced without all followers agreeing on this fact. At 10:29, the repair process starts, but MsgSnap (raft-initiated snapshot) traffic decreases. We get a lot more elections at this point, which is unexpected, and VoteResp traffic lags significantly behind Vote traffic. There is an increase in snapshots being reserved and applied, so we are sending preemptive snapshots. (without accompanying MsgSnap messages this time). |
There are also Raft dropped messages during this time, which is surprising given the QPS. |
OK, paging back in exactly how the metrics work: MsgSnap is used for all snapshots, not just raft-initiated ones. It's recorded on the receiver side at the start of the process. The generated and applied metrics are recorded at the end. The difference in MsgSnap (almost 30 qps in the pre-chaos steady state) and generated/applied (0.05 qps) indicates that a lot of attempted snapshots are being declined. The fact that we were seeing a steady rate of snapshot activity before the chaos event makes me think that some range has been in a bad state (underreplicated) since the previous chaos event, and then this event knocks out a second replica of some critical range. Snapshots show a similar pattern in previoius events: snapshot activity hits a plateau at a low level for a long time before reaching zero. This sometimes, but not always, extends to the start of the next chaos event. If we're still doing repair work from the previous chaos even when the next one hits, we have a significant chance of replicas becoming under-replicated. So here's a theory: After a long chaos event, the restarted node will have shed most of its replicas and be nearly empty. It will gradually fill back up, but if it doesn't reach equilibrium before the next chaos event, it will be the target of a disproportionate number of repaired replicas. The various levels of snapshot throttling will prevent this from finishing before the next chaos event. Here we can see the recovery of one node overlapping with the death of the next. On the other hand, the ranges are not considered under-replicated for the duration of this rebalancing: |
Yeah. There's a large spike of dropped raft messages just after the part I displayed here; I cropped it out so it wouldn't dominate the y axis. |
Yep, the "plateau" is rebalancing back on to the previously dead node.
Under-replicated isn't a problem, though. Unavailable is. And we're not seeing unavailable ranges. It does seem suspicious that rebalancing was occurring when the chaos event happened. Perhaps the under-replicated/unavailable metrics are buggy.
This definitely appears to be happening, but I'm not seeing how this explains the 0 QPS episodes given the evidence so far. |
I have to step out for a moment, but quick inspection of |
Yeah, I'm suspicious of those metrics too. Here, the orange line is the 10:24 chaos event that caused an outage; the others were fine. We can see that the preceding outage was relatively long, leaving it to lose more of its ranges before coming back up, and taking longer to recover. It was also recovering more slowly for a time. The number of ranges lost is proportional to the outage duration, which suggests that we haven't been able to move everything away from the dead node and those replicas should have remained in the under-replicated state. |
I'm not sure I'm following you here. The orange line dips 5min into the outage which is when recovery starts (and when grafana decides to zero that metric). For the preceding 5min of the chaos event, everything was copacetic. The preceding outage was longer, but recovery is prioritized after rebalancing (we accomplish this by disallowing rebalances while a node is too far behind in its Raft logs) and the fact that we started rebalancing onto |
I'm talking about when the node that was killed comes back online. The orange line spikes back up near its previous value, but rapidly drops as the replicas which had been rebalanced away from this node. If the node was down for long enough, we would expect this to drop all the way to zero as all of the replicas had been rebalanced away and the node had to build back up from scratch. But it's not, which means that the node is still a member of some of the ranges it had been a part of before the restart. But I think I had my graphs misaligned earlier - the under-replicated ranges metric does not go to zero during the outage; it decreases slowly throughout the outage and then drops to zero after the dead node comes back online. This is concerning (it's taking a long time for this node without much data to be recovered - we need to make preemptive snapshots for an under-replicated range use the higher "raft" snapshot rate limit), but wouldn't explain the 0 qps. |
And the rate of recovery during the outage is much slower during the 0 QPS event than the preceding chaos event. Compare the under-replicated metric on the left and right: I'm guessing this is a symptom rather than the cause. |
Another incident just occurred: In this case, we add range unavailability: The associated chaos event started at 18:24. This also coincided with a growth in the number of ranges on the cluster. Its interesting that the cluster recovered even before the down node restarted. I have no explanation for that. Still investigating. |
In this event, everything seemed to be horked up on a single range:
Immediately before this, I see a
About a minute before that:
So it seems like we were trying to rebalance a range. We successfully added a new replica and then were trying to remove a replica and that removal failed. Note that the chaos event was for |
This looks like it might be #10506. We initially had replicas on nodes 4, 6, and 8, while 4 was down. We added a replica on node 9 (with votes from 6 and 8), then tried to remove the replica on 8 (we should try to choose node 4 in this case instead). This would require votes from all three of the live notes (6, 8, 9). This should work (a node can vote for its own removal), but an ill-timed log truncation could have caused this range to get blocked until a raft snapshot could be processed (and we do we some raft snapshots in this time period). Of course, just after a chaos event a raft snapshot would also block any range that was reduced to 2/3 nodes, so this doesn't really explain things unless we can also figure out why the rebalance might cause a range to fall behind. Maybe |
We see raft snapshots during this outage, but nothing for I'm trying to think of whether there is some logging I can add that would verify this scenario in advance of implementing #10506. I suppose when we're about to remove a replica we can have the leader output the Raft log state of the followers. |
Right, n9 would not be able to vote until it caught up. The most severe form of this would be if the log was truncated and it needed a new snapshot, but I suppose even without that it would need the 30s of logs that were added during the time it took to send the snapshot (which would make it worse off than all the other ranges which also need 100% participation to make progress). #10506 wouldn't really help here. If that were fixed, we'd remove node 4 instead of node 8, but we'd still need node 9 to catch up and vote. Fixing this appears to require more prioritization of "foreground" raft messages compared to snapshots, although we're already limiting snapshots pretty severely. |
There is a flip side to #10506: we could avoid adding a replica to a range if the range is currently under-replicated. When |
I can reproduce some sort of related badness locally:
Boom! 0 QPS. |
Tweak Allocator.RebalanceTarget to not generate a rebalance target if adding a new replica would cause quorum to be violated. For example, rebalancing a 3 replica range requires temporarily up-relicating to 4 replicas which would violate quorum if one of the replicas is on a down node. Not rebalancing means we'll have to wait for the node to be declared dead and then go through the dead replica removal process before adding a new replica (3->2->3) for which we'll be able to maintain quorum at all times. Fixes cockroachdb#15163 (at least part of it)
Tweak Allocator.RebalanceTarget to not generate a rebalance target if adding a new replica would cause quorum to be violated. For example, rebalancing a 3 replica range requires temporarily up-relicating to 4 replicas which would violate quorum if one of the replicas is on a down node. Not rebalancing means we'll have to wait for the node to be declared dead and then go through the dead replica removal process before adding a new replica (3->2->3) for which we'll be able to maintain quorum at all times. Fixes cockroachdb#15163 (at least part of it)
Tweak Allocator.RebalanceTarget to not generate a rebalance target if adding a new replica would cause quorum to be violated. For example, rebalancing a 3 replica range requires temporarily up-relicating to 4 replicas which would violate quorum if one of the replicas is on a down node. Not rebalancing means we'll have to wait for the node to be declared dead and then go through the dead replica removal process before adding a new replica (3->2->3) for which we'll be able to maintain quorum at all times. Fixes cockroachdb#15163 (at least part of it)
Gah, the 0 QPS episodes documented at the start of this issue are likely not fixed. Will get a new binary pushed to |
Here is a 0 QPS event that occurred this morning: Interestingly, the start doesn't line up with a chaos event: Similar to the events described at the start of this issue, the leader-not-leaseholder metric is non-zero during this event: The slow raft proposal shows something is horked up at the Raft left: Those slow raft proposals are all from a single node: Looking at the logs on that node shows that a single range is having problems,
Preceding that warning are errors about failing to send a snapshot for that range:
The retries are likely a bit silly. We're probably trying to send the snapshot to the same node over and over. But that is only a performance issue and a small one because generating a snapshot primarily occurs after the remote as accepted the reservation. I'm not seeing anything so far that would indicate why |
@bdarnell Mind taking a look at this as well? Perhaps I'm missing something about what happened to |
Interestingly, the blip in node livness at 05:10:19 does not correspond to any actual node deaths. That was a node liveness hiccup but at the end of that hiccup |
Time accounted for in the raft scheduler drops to ~zero. There are two things that we could be waiting on in that loop that are not accounted for: either the store mutex or We see some changes around 4:48, 5 minutes before the full outage (could the store pool be incorrectly considering something dead at the 5 minute mark? Or maybe it's just taking 5 minutes for all the client threads to hit r58). MsgVotes start then, with a higher number of MsgVotes than MsgVoteResps, which is unexpected (but consistent with past instances of this problem); I'm not sure what it might mean. |
I wonder what is causing the "dropped" Raft messages. I wonder if we're dropping |
Another 0 QPS episode just occurred on Notice that the incident occurred outside of a chaos event. There were a handful of slow Raft proposals, but nothing persistent: I'm not actually clear what changed 01:28 allowing the system to recover. |
I don't yet know why everything got blocked up, but node 1 seems to have been the problem child. There are two things pointing at that:
The gossip thrashing is really bad, and I'm not sure why or whether it's related to the lack of forward progress. I'll prioritize looking into #9819. |
Other interesting tidbits that don't necessarily explain anything by themselves:
|
Another zero QPS event on blue: SQL: I'm not sure it's related, but poking around the admin UI I discovered that blue 7's liveness gossip was last updated at 2017-04-28 15:22:35.085695583 +0000 UTC (50+ hours ago) according to blue 7's own gossip. blue 7 also doesn't have a node descriptor in gossip. /debug/request on blue 7 contains a few wedged requests that have been running for over 16 hours:
|
There's no gossip entry for node 7 because node 7 is now node 11 as of a couple days ago, which was caused by @arjunravinarayan recreating the instance:
|
Something else has happened now; node 1 has exactly 1 replica, while all the others have ~1400. |
Node 1 had a full disk due to a buildup of |
Possibly fixed by #15573. I'm going to optimistically close this. Will re-open if another 0 QPS episode with similar symptoms re-occurs. |
blue
, the chaos test cluster, is experiencing periods of 0 QPS:These seem to correspond to a spike in the raft-leader-not-leaseholder metric:
Note that we're not seeing unavailable ranges which have been the cause of previous 0 QPS issues on
blue
. My suspicion is that the raft-leader-not-leaseholder spikes are indicating that we're not able to acquire the lease for certain ranges despite a quorum of replicas being available. Without the lease, no traffic. I haven't dug in any further or have any other evidence to validate my suspicion.The text was updated successfully, but these errors were encountered: