storage: simplify campaigning logic after PreVote is enabled #18365

petermattis · 2017-09-08T13:28:21Z

With PreVote, campaigning is disruptive to a range that currently has a leader because it forces a Raft election. But after splits and to speed up tests we want to elect a Raft leader for a range quickly. So we've added various hacks to eagerly campaign replicas (e.g. the leaseholder of a dormant replica will campaign upon creating its Raft group). When PreVote is enabled it seems feasible to eliminate these hacks and simply always campaign when a replica initializes its Raft group. The PreVote mechanism should prevent any disruption if there is already a Raft leader.

@bdarnell Am I missing anything here that would require the existing campaigning logic once PreVote is enabled?

bdarnell · 2017-09-08T18:03:44Z

The delayed campaigning is not just to prevent disruption to the existing leader, it also avoids sending O(ranges) messages as a new node is warming up. I've paged out all the replica initialization stuff but it might be OK if the common case for replica initialization (for a restarted node rejoining a cluster where everything else is up) is via paths where we already learn about the leader (and therefore don't need to campaign).

petermattis · 2017-09-08T18:12:33Z

Yeah, we wouldn't want to campaign at startup. I wasn't proposing to get rid of the lazy Raft loading. My proposal is to get rid of the logic for deciding whether to campaign a dormant replica when some operation is being performed and instead always campaign such replicas.

Fold the necessary checks into withRaftGroupLocked and remove unnecessary arguments. This has the effect of campaigning somewhat more than before but that's OK since PreVote minimizes the disruption. Fixes cockroachdb#18365 Release note: None

24920: storage: Simplify raft automatic campaigning after PreVote r=bdarnell a=bdarnell Before we implemented PreVote, we had various heuristics to decide when we should ask raft to campaign (bypassing the usual timeout). Since PreVote has reduced the cost of raft elections (by ensuring that a node that calls for an election it can't win doesn't disrupt its peers), we can get by with simpler logic. In addition to simplifying the logic, this PR introduces a new campaign trigger when a range unquiesces. This is a prerequisite for getting rid of the TickQuiesced hack (which is disabled by default in this PR and will be removed in a future one). Fixes #18365 Co-authored-by: Ben Darnell <[email protected]>

nvanbenschoten · 2018-06-05T20:37:32Z

I recently began seeing very high latencies (orders of magnitude larger than expected) when running tpccbench/nodes=3/cpu=4 with a total warehouse count of 100 and with 10. This size cluster can usually handle around 400 warehouses, so this is surprising.

I ran some request tracing and found a common theme between each of the slow SQL queries. They each contain a number of KV operations that all look like:

Notice that a lease request is being retried multiple times because of reasonTicks, reasonNewLeader, and then reasonNewLeaserOrConfigChange and it ends up taking around 3 or 4 seconds.

I was not seeing this on v2.0 but was on master, so I performed a bisection. This pointed at 44e3977 as the culprit. At the commit before, I reliably see normal latencies and none of these retry loops, but at that commit I reliably see this issue. The change seems innocuous enough, but the following jumped out at me:

This has the effect of campaigning somewhat
more than before but that's OK since PreVote minimizes the disruption.

Is it possible that this assumption is incorrect?

This issue and the resulting change was formed on the idea that:

The PreVote mechanism should prevent any disruption if there is already a Raft leader.

What is the effect when there is not already a Raft leader? Is it possible that this change could slow down leader elections if multiple replicas attempt to become a leader at the same time when one does not yet exist?

bdarnell · 2018-06-05T20:42:30Z

I'm looking at this now for #26391. The biggest issue is that when a MsgVote arrives, we wake up and campaign before processing the message, causing the initial election to fail, requiring a timeout for the next election. I've fixed that although I'm still seeing some slowness in elections that I'm trying to investigate.

What is the effect when there is not already a Raft leader? Is it possible that this change could slow down leader elections if multiple replicas attempt to become a leader at the same time when one does not yet exist?

Yes, that's a risk. It's always been the case, although this commit might have increased the chances somewhat. My assumption is that the wakeup events are unlikely to hit multiple replicas simultaneously.

petermattis added this to the 1.2 milestone Sep 8, 2017

petermattis self-assigned this Sep 8, 2017

petermattis assigned bdarnell and unassigned petermattis Jan 25, 2018

bdarnell modified the milestones: 2.0, 2.1 Feb 26, 2018

bdarnell mentioned this issue Apr 18, 2018

storage: Simplify raft automatic campaigning after PreVote #24920

Merged

bdarnell added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 26, 2018

craig bot closed this as completed in #24920 May 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: simplify campaigning logic after PreVote is enabled #18365

storage: simplify campaigning logic after PreVote is enabled #18365

petermattis commented Sep 8, 2017

bdarnell commented Sep 8, 2017

petermattis commented Sep 8, 2017

nvanbenschoten commented Jun 5, 2018

bdarnell commented Jun 5, 2018

storage: simplify campaigning logic after PreVote is enabled #18365

storage: simplify campaigning logic after PreVote is enabled #18365

Comments

petermattis commented Sep 8, 2017

bdarnell commented Sep 8, 2017

petermattis commented Sep 8, 2017

nvanbenschoten commented Jun 5, 2018

bdarnell commented Jun 5, 2018