-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: optimize Raft ticking of quiesced ranges #17609
Comments
#17612 is certainly exacerbating the problem here: we have 3x as many replicas as expected which means we're spending 3x as much time ticking. |
Add fast-path for ticking quiesced/dormant replicas which avoids going through the Raft scheduler and avoids grabbing Replica.raftMu which can be held for significant periods of time. Fixes cockroachdb#17609
Motivated by seeing sync.Map.Load() high on profiles when there are large numbers of replicas on a node. See cockroachdb#17609. name old time/op new time/op delta StoreGetReplica-8 9.12ns ±10% 3.22ns ± 5% -64.65% (p=0.000 n=10+8)
The specified election timeout is a lower bound; the actual timeout is randomized between Changing the effective heartbeat interval or election timeout (as opposed to changing the number of ticks per heartbeat/election while leaving the overall time the same) is tricky to do without downtime. |
Add fast-path for ticking quiesced/dormant replicas which avoids going through the Raft scheduler and avoids grabbing Replica.raftMu which can be held for significant periods of time. Fixes cockroachdb#17609
Add fast-path for ticking quiesced/dormant replicas which avoids going through the Raft scheduler and avoids grabbing Replica.raftMu which can be held for significant periods of time. Fixes cockroachdb#17609
Still need to investigate adjusting the Raft tick settings. |
Once Raft pre-vote lands and we don't have to tick quiesced ranges, another optimization possibility here is to keep track of active replicas (which are usually a small fraction of the total replicas on a store) so that we only loop over the active replicas on each tick cycle. There are some locking challenges in doing this, but they don't seem insurmountable. |
I don't think there is anything left to do here for 1.1. |
This means that idle replicas no longer have a per-tick CPU cost, which is one of the bottlenecks limiting the amount of data we can handle per store. Fixes cockroachdb#17609 Release note (performance improvement): Reduced CPU overhead of idle ranges
This means that idle replicas no longer have a per-tick CPU cost, which is one of the bottlenecks limiting the amount of data we can handle per store. Fixes cockroachdb#17609 Release note (performance improvement): Reduced CPU overhead of idle ranges
24956: storage: Maintain a separate set of unquiesced replicas r=petermattis a=bdarnell This means that idle replicas no longer have a per-tick CPU cost, which is one of the bottlenecks limiting the amount of data we can handle per store. Fixes #17609 Release note (performance improvement): Reduced CPU overhead of idle ranges The first five commits are from #24920; that PR should be merged and tested in isolation first. 25735: sql: fix null normalization r=RaduBerinde a=RaduBerinde The normalization rules are happy to convert `NULL::TEXT` to `NULL`. While both expressions evaluate to `DNull`, the `ResolvedType()` is different. It seems unsound for normalization to change the type. This issue is shown by trying to run a query containing `ARRAY_AGG(NULL::TEXT)` through distsql planning: by the time the distsql planner looks at it, the `NULL::TEXT` is just `DNull` (with the `Unknown` type) and the distsql planner cannot find the builtin. This change fixes the normalization rules by retaining the cast in this case. In general, any expression that statically evaluates to NULL gets a cast to the original expression type. The same is done in the opt execbuilder. Fixes #25724. Release note (bug fix): Fixed query errors in some cases involving a NULL constant that is cast to a specific type. Co-authored-by: Ben Darnell <[email protected]> Co-authored-by: Radu Berinde <[email protected]>
After creating 1m mostly empty ranges on
sky
(each node contains ~15k ranges), the nodes are churning through 50% of the cpu on the machine and a significant fraction of that is Raft tick processing. Stats indicate that basically all of the ranges are quiescent.pprof001.svg.zip
Very strange to see the synchronization primitives so prominently. Ditto for
sync.Map
.Once Raft pre-vote is enabled, we can get rid of the call to
TickQuiesced
. Perhaps we can also avoid enqueueing the replica on the Raft scheduler as well. I doubt it is good that ticking shares the same scheduler resource as other Raft processing. Might be worthwhile to have a separate set of goroutines for ticking.Note the default tick interval is 200ms and we call an election after 15 missed ticks (making the Raft election timeout 3s). Perhaps we should increase the tick interval and reduce the timeout ticks. @bdarnell Can you remind me of the downsides to adjusting these values?
The text was updated successfully, but these errors were encountered: