Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: optimize Raft ticking of quiesced ranges #17609

Closed
petermattis opened this issue Aug 12, 2017 · 5 comments
Closed

perf: optimize Raft ticking of quiesced ranges #17609

petermattis opened this issue Aug 12, 2017 · 5 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@petermattis
Copy link
Collaborator

After creating 1m mostly empty ranges on sky (each node contains ~15k ranges), the nodes are churning through 50% of the cpu on the machine and a significant fraction of that is Raft tick processing. Stats indicate that basically all of the ranges are quiescent.

screen shot 2017-08-11 at 8 20 42 pm

pprof001.svg.zip

Very strange to see the synchronization primitives so prominently. Ditto for sync.Map.

Once Raft pre-vote is enabled, we can get rid of the call to TickQuiesced. Perhaps we can also avoid enqueueing the replica on the Raft scheduler as well. I doubt it is good that ticking shares the same scheduler resource as other Raft processing. Might be worthwhile to have a separate set of goroutines for ticking.

Note the default tick interval is 200ms and we call an election after 15 missed ticks (making the Raft election timeout 3s). Perhaps we should increase the tick interval and reduce the timeout ticks. @bdarnell Can you remind me of the downsides to adjusting these values?

@petermattis petermattis added this to the 1.1 milestone Aug 12, 2017
@petermattis petermattis self-assigned this Aug 12, 2017
@petermattis
Copy link
Collaborator Author

#17612 is certainly exacerbating the problem here: we have 3x as many replicas as expected which means we're spending 3x as much time ticking.

petermattis added a commit to petermattis/cockroach that referenced this issue Aug 12, 2017
Add fast-path for ticking quiesced/dormant replicas which avoids going
through the Raft scheduler and avoids grabbing Replica.raftMu which can
be held for significant periods of time.

Fixes cockroachdb#17609
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 13, 2017
Motivated by seeing sync.Map.Load() high on profiles when there are
large numbers of replicas on a node. See cockroachdb#17609.

name               old time/op  new time/op  delta
StoreGetReplica-8  9.12ns ±10%  3.22ns ± 5%  -64.65%  (p=0.000 n=10+8)
@bdarnell
Copy link
Contributor

Note the default tick interval is 200ms and we call an election after 15 missed ticks (making the Raft election timeout 3s). Perhaps we should increase the tick interval and reduce the timeout ticks. @bdarnell Can you remind me of the downsides to adjusting these values?

The specified election timeout is a lower bound; the actual timeout is randomized between N and 2*N ticks. Setting the tick interval to 1s and the election timeout to 3 ticks (for example) would increase the risk of failed elections as two followers would be more likely to call their elections at the same time. There's also another tick-related value: we currently send a heartbeat every 5 ticks and would need to adjust that accordingly too.

Changing the effective heartbeat interval or election timeout (as opposed to changing the number of ticks per heartbeat/election while leaving the overall time the same) is tricky to do without downtime.

petermattis added a commit to petermattis/cockroach that referenced this issue Aug 14, 2017
Add fast-path for ticking quiesced/dormant replicas which avoids going
through the Raft scheduler and avoids grabbing Replica.raftMu which can
be held for significant periods of time.

Fixes cockroachdb#17609
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 14, 2017
Add fast-path for ticking quiesced/dormant replicas which avoids going
through the Raft scheduler and avoids grabbing Replica.raftMu which can
be held for significant periods of time.

Fixes cockroachdb#17609
@petermattis
Copy link
Collaborator Author

Still need to investigate adjusting the Raft tick settings.

@petermattis petermattis reopened this Aug 14, 2017
@petermattis
Copy link
Collaborator Author

Once Raft pre-vote lands and we don't have to tick quiesced ranges, another optimization possibility here is to keep track of active replicas (which are usually a small fraction of the total replicas on a store) so that we only loop over the active replicas on each tick cycle. There are some locking challenges in doing this, but they don't seem insurmountable.

@petermattis
Copy link
Collaborator Author

I don't think there is anything left to do here for 1.1.

@petermattis petermattis modified the milestones: 1.2, 1.1 Aug 17, 2017
@petermattis petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Sep 27, 2017
@petermattis petermattis assigned bdarnell and unassigned petermattis Jan 25, 2018
@bdarnell bdarnell modified the milestones: 2.0, 2.1 Feb 26, 2018
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 20, 2018
This means that idle replicas no longer have a per-tick CPU cost,
which is one of the bottlenecks limiting the amount of data we can
handle per store.

Fixes cockroachdb#17609

Release note (performance improvement): Reduced CPU overhead of idle
ranges
@nvanbenschoten nvanbenschoten added the A-kv-replication Relating to Raft, consensus, and coordination. label Apr 24, 2018
bdarnell added a commit to bdarnell/cockroach that referenced this issue May 15, 2018
This means that idle replicas no longer have a per-tick CPU cost,
which is one of the bottlenecks limiting the amount of data we can
handle per store.

Fixes cockroachdb#17609

Release note (performance improvement): Reduced CPU overhead of idle
ranges
craig bot pushed a commit that referenced this issue May 21, 2018
24956: storage: Maintain a separate set of unquiesced replicas r=petermattis a=bdarnell

This means that idle replicas no longer have a per-tick CPU cost,
which is one of the bottlenecks limiting the amount of data we can
handle per store.

Fixes #17609

Release note (performance improvement): Reduced CPU overhead of idle
ranges


The first five commits are from #24920; that PR should be merged and tested in isolation first. 

25735: sql: fix null normalization r=RaduBerinde a=RaduBerinde

The normalization rules are happy to convert `NULL::TEXT` to `NULL`.
While both expressions evaluate to `DNull`, the `ResolvedType()` is
different. It seems unsound for normalization to change the type.

This issue is shown by trying to run a query containing
`ARRAY_AGG(NULL::TEXT)` through distsql planning: by the time the
distsql planner looks at it, the `NULL::TEXT` is just `DNull` (with
the `Unknown` type) and the distsql planner cannot find the builtin.

This change fixes the normalization rules by retaining the cast in
this case. In general, any expression that statically evaluates to
NULL gets a cast to the original expression type. The same is done in
the opt execbuilder.

Fixes #25724.

Release note (bug fix): Fixed query errors in some cases involving a
NULL constant that is cast to a specific type.

Co-authored-by: Ben Darnell <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
@craig craig bot closed this as completed in #24956 May 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

4 participants