perf: optimize Raft ticking of quiesced ranges #17609

petermattis · 2017-08-12T00:26:52Z

After creating 1m mostly empty ranges on sky (each node contains ~15k ranges), the nodes are churning through 50% of the cpu on the machine and a significant fraction of that is Raft tick processing. Stats indicate that basically all of the ranges are quiescent.

pprof001.svg.zip

Very strange to see the synchronization primitives so prominently. Ditto for sync.Map.

Once Raft pre-vote is enabled, we can get rid of the call to TickQuiesced. Perhaps we can also avoid enqueueing the replica on the Raft scheduler as well. I doubt it is good that ticking shares the same scheduler resource as other Raft processing. Might be worthwhile to have a separate set of goroutines for ticking.

Note the default tick interval is 200ms and we call an election after 15 missed ticks (making the Raft election timeout 3s). Perhaps we should increase the tick interval and reduce the timeout ticks. @bdarnell Can you remind me of the downsides to adjusting these values?

The text was updated successfully, but these errors were encountered:

petermattis · 2017-08-12T02:28:45Z

#17612 is certainly exacerbating the problem here: we have 3x as many replicas as expected which means we're spending 3x as much time ticking.

Add fast-path for ticking quiesced/dormant replicas which avoids going through the Raft scheduler and avoids grabbing Replica.raftMu which can be held for significant periods of time. Fixes cockroachdb#17609

Motivated by seeing sync.Map.Load() high on profiles when there are large numbers of replicas on a node. See cockroachdb#17609. name old time/op new time/op delta StoreGetReplica-8 9.12ns ±10% 3.22ns ± 5% -64.65% (p=0.000 n=10+8)

bdarnell · 2017-08-14T14:37:30Z

Note the default tick interval is 200ms and we call an election after 15 missed ticks (making the Raft election timeout 3s). Perhaps we should increase the tick interval and reduce the timeout ticks. @bdarnell Can you remind me of the downsides to adjusting these values?

The specified election timeout is a lower bound; the actual timeout is randomized between N and 2*N ticks. Setting the tick interval to 1s and the election timeout to 3 ticks (for example) would increase the risk of failed elections as two followers would be more likely to call their elections at the same time. There's also another tick-related value: we currently send a heartbeat every 5 ticks and would need to adjust that accordingly too.

Changing the effective heartbeat interval or election timeout (as opposed to changing the number of ticks per heartbeat/election while leaving the overall time the same) is tricky to do without downtime.

Add fast-path for ticking quiesced/dormant replicas which avoids going through the Raft scheduler and avoids grabbing Replica.raftMu which can be held for significant periods of time. Fixes cockroachdb#17609

petermattis · 2017-08-14T16:33:06Z

Still need to investigate adjusting the Raft tick settings.

petermattis · 2017-08-15T16:29:15Z

Once Raft pre-vote lands and we don't have to tick quiesced ranges, another optimization possibility here is to keep track of active replicas (which are usually a small fraction of the total replicas on a store) so that we only loop over the active replicas on each tick cycle. There are some locking challenges in doing this, but they don't seem insurmountable.

petermattis · 2017-08-17T01:46:28Z

I don't think there is anything left to do here for 1.1.

This means that idle replicas no longer have a per-tick CPU cost, which is one of the bottlenecks limiting the amount of data we can handle per store. Fixes cockroachdb#17609 Release note (performance improvement): Reduced CPU overhead of idle ranges

24956: storage: Maintain a separate set of unquiesced replicas r=petermattis a=bdarnell This means that idle replicas no longer have a per-tick CPU cost, which is one of the bottlenecks limiting the amount of data we can handle per store. Fixes #17609 Release note (performance improvement): Reduced CPU overhead of idle ranges The first five commits are from #24920; that PR should be merged and tested in isolation first. 25735: sql: fix null normalization r=RaduBerinde a=RaduBerinde The normalization rules are happy to convert `NULL::TEXT` to `NULL`. While both expressions evaluate to `DNull`, the `ResolvedType()` is different. It seems unsound for normalization to change the type. This issue is shown by trying to run a query containing `ARRAY_AGG(NULL::TEXT)` through distsql planning: by the time the distsql planner looks at it, the `NULL::TEXT` is just `DNull` (with the `Unknown` type) and the distsql planner cannot find the builtin. This change fixes the normalization rules by retaining the cast in this case. In general, any expression that statically evaluates to NULL gets a cast to the original expression type. The same is done in the opt execbuilder. Fixes #25724. Release note (bug fix): Fixed query errors in some cases involving a NULL constant that is cast to a specific type. Co-authored-by: Ben Darnell <[email protected]> Co-authored-by: Radu Berinde <[email protected]>

petermattis added this to the 1.1 milestone Aug 12, 2017

petermattis self-assigned this Aug 12, 2017

petermattis mentioned this issue Aug 12, 2017

storage: add fast-path for ticking quiesced/dormant replicas #17617

Merged

petermattis mentioned this issue Aug 13, 2017

storage: switch Store.{replicas,replicaQueues} from sync.Map to IntMap #17625

Merged

petermattis closed this as completed in #17617 Aug 14, 2017

petermattis reopened this Aug 14, 2017

petermattis modified the milestones: 1.2, 1.1 Aug 17, 2017

petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Sep 27, 2017

petermattis assigned bdarnell and unassigned petermattis Jan 25, 2018

bdarnell modified the milestones: 2.0, 2.1 Feb 26, 2018

nvanbenschoten assigned benesch Apr 18, 2018

bdarnell mentioned this issue Apr 20, 2018

storage: Maintain a separate set of unquiesced replicas #24956

Merged

nvanbenschoten added the A-kv-replication Relating to Raft, consensus, and coordination. label Apr 24, 2018

craig bot closed this as completed in #24956 May 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize Raft ticking of quiesced ranges #17609

perf: optimize Raft ticking of quiesced ranges #17609

petermattis commented Aug 12, 2017

petermattis commented Aug 12, 2017

bdarnell commented Aug 14, 2017

petermattis commented Aug 14, 2017

petermattis commented Aug 15, 2017

petermattis commented Aug 17, 2017

perf: optimize Raft ticking of quiesced ranges #17609

perf: optimize Raft ticking of quiesced ranges #17609

Comments

petermattis commented Aug 12, 2017

petermattis commented Aug 12, 2017

bdarnell commented Aug 14, 2017

petermattis commented Aug 14, 2017

petermattis commented Aug 15, 2017

petermattis commented Aug 17, 2017