Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: simplify campaigning logic after PreVote is enabled #18365

Closed
petermattis opened this issue Sep 8, 2017 · 4 comments
Closed

storage: simplify campaigning logic after PreVote is enabled #18365

petermattis opened this issue Sep 8, 2017 · 4 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.
Milestone

Comments

@petermattis
Copy link
Collaborator

With PreVote, campaigning is disruptive to a range that currently has a leader because it forces a Raft election. But after splits and to speed up tests we want to elect a Raft leader for a range quickly. So we've added various hacks to eagerly campaign replicas (e.g. the leaseholder of a dormant replica will campaign upon creating its Raft group). When PreVote is enabled it seems feasible to eliminate these hacks and simply always campaign when a replica initializes its Raft group. The PreVote mechanism should prevent any disruption if there is already a Raft leader.

@bdarnell Am I missing anything here that would require the existing campaigning logic once PreVote is enabled?

@petermattis petermattis added this to the 1.2 milestone Sep 8, 2017
@petermattis petermattis self-assigned this Sep 8, 2017
@bdarnell
Copy link
Contributor

bdarnell commented Sep 8, 2017

The delayed campaigning is not just to prevent disruption to the existing leader, it also avoids sending O(ranges) messages as a new node is warming up. I've paged out all the replica initialization stuff but it might be OK if the common case for replica initialization (for a restarted node rejoining a cluster where everything else is up) is via paths where we already learn about the leader (and therefore don't need to campaign).

@petermattis
Copy link
Collaborator Author

Yeah, we wouldn't want to campaign at startup. I wasn't proposing to get rid of the lazy Raft loading. My proposal is to get rid of the logic for deciding whether to campaign a dormant replica when some operation is being performed and instead always campaign such replicas.

@petermattis petermattis assigned bdarnell and unassigned petermattis Jan 25, 2018
@bdarnell bdarnell modified the milestones: 2.0, 2.1 Feb 26, 2018
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 18, 2018
Fold the necessary checks into withRaftGroupLocked and remove
unnecessary arguments. This has the effect of campaigning somewhat
more than before but that's OK since PreVote minimizes the disruption.

Fixes cockroachdb#18365

Release note: None
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 23, 2018
Fold the necessary checks into withRaftGroupLocked and remove
unnecessary arguments. This has the effect of campaigning somewhat
more than before but that's OK since PreVote minimizes the disruption.

Fixes cockroachdb#18365

Release note: None
@bdarnell bdarnell added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 26, 2018
bdarnell added a commit to bdarnell/cockroach that referenced this issue May 2, 2018
Fold the necessary checks into withRaftGroupLocked and remove
unnecessary arguments. This has the effect of campaigning somewhat
more than before but that's OK since PreVote minimizes the disruption.

Fixes cockroachdb#18365

Release note: None
craig bot pushed a commit that referenced this issue May 2, 2018
24920: storage: Simplify raft automatic campaigning after PreVote r=bdarnell a=bdarnell

Before we implemented PreVote, we had various heuristics to decide when we should ask raft to campaign (bypassing the usual timeout). Since PreVote has reduced the cost of raft elections (by ensuring that a node that calls for an election it can't win doesn't disrupt its peers), we can get by with simpler logic. 

In addition to simplifying the logic, this PR introduces a new campaign trigger when a range unquiesces. This is a prerequisite for getting rid of the TickQuiesced hack (which is disabled by default in this PR and will be removed in a future one). 

Fixes #18365

Co-authored-by: Ben Darnell <[email protected]>
@craig craig bot closed this as completed in #24920 May 2, 2018
@nvanbenschoten
Copy link
Member

I recently began seeing very high latencies (orders of magnitude larger than expected) when running tpccbench/nodes=3/cpu=4 with a total warehouse count of 100 and with 10. This size cluster can usually handle around 400 warehouses, so this is surprising.

I ran some request tracing and found a common theme between each of the slow SQL queries. They each contain a number of KV operations that all look like:

screenshot-2018-6-5 jaeger ui

Notice that a lease request is being retried multiple times because of reasonTicks, reasonNewLeader, and then reasonNewLeaserOrConfigChange and it ends up taking around 3 or 4 seconds.

I was not seeing this on v2.0 but was on master, so I performed a bisection. This pointed at 44e3977 as the culprit. At the commit before, I reliably see normal latencies and none of these retry loops, but at that commit I reliably see this issue. The change seems innocuous enough, but the following jumped out at me:

This has the effect of campaigning somewhat
more than before but that's OK since PreVote minimizes the disruption.

Is it possible that this assumption is incorrect?

This issue and the resulting change was formed on the idea that:

The PreVote mechanism should prevent any disruption if there is already a Raft leader.

What is the effect when there is not already a Raft leader? Is it possible that this change could slow down leader elections if multiple replicas attempt to become a leader at the same time when one does not yet exist?

@bdarnell
Copy link
Contributor

bdarnell commented Jun 5, 2018

I'm looking at this now for #26391. The biggest issue is that when a MsgVote arrives, we wake up and campaign before processing the message, causing the initial election to fail, requiring a timeout for the next election. I've fixed that although I'm still seeing some slowness in elections that I'm trying to investigate.

What is the effect when there is not already a Raft leader? Is it possible that this change could slow down leader elections if multiple replicas attempt to become a leader at the same time when one does not yet exist?

Yes, that's a risk. It's always been the case, although this commit might have increased the chances somewhat. My assumption is that the wakeup events are unlikely to hit multiple replicas simultaneously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.
Projects
None yet
Development

No branches or pull requests

3 participants