kvserver: add replication delay and per replicated-to store stats #114181

sumeerbhola · 2023-11-09T19:11:18Z

proposalQuotaAndDelayTracker is introduced to track stats on a per range basis at the leader. These stats include:

Delay incurred per raft entry in reaching quorum and the replica/store responsible for the largest delay. This is used in tracing.
Per replica/store mean delay in persisting raft log entries. This can be easily replaced with a histogram if desired.
Per replica/store proposal quota consumed.
Per replica/store duration where it was responsible for proposal quota exhaustion.

These stats are aggregated across all ranges on a local store, for which the local store is a leader, per store that is replicated to (including the local store). The top-k replicated-to stores responsible for some undesired behavior are periodically logged.

The overall objective here is to give more visibility into where the slowness is in the replication path, since the slowness may not be at the store evaluating the proposal. And to provide such visibility both for individual requests (via tracing) and aggregated across stores. The aggregations could help quickly pinpoint slow stores or slow paths to stores, which may otherwise be harder to figure out in large clusters.

Epic: none

Release note: None

cockroach-teamcity · 2023-11-09T19:11:29Z

This change is

sumeerbhola

@erikgrinaker this is a draft -- I'd like your opinion on the value of the specific stats, the (in)accuracy with which they are being measured here (which is also affected by the integration points, which are possibly flawed), and the abstractions.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

proposalQuotaAndDelayTracker is introduced to track stats on a per range basis at the leader. These stats include: - Delay incurred per raft entry in reaching quorum and the replica/store responsible for the largest delay. This is used in tracing. - Per replica/store mean delay in persisting raft log entries. This can be easily replaced with a histogram if desired. - Per replica/store proposal quota consumed. - Per replica/store duration where it was responsible for proposal quota exhaustion. These stats are aggregated across all ranges on a local store, for which the local store is a leader, per store that is replicated to (including the local store). The top-k replicated-to stores responsible for some undesired behavior are periodically logged. The overall objective here is to give more visibility into where the slowness is in the replication path, since the slowness may not be at the store evaluating the proposal. And to provide such visibility both for individual requests (via tracing) and aggregated across stores. The aggregations could help quickly pinpoint slow stores or slow paths to stores, which may otherwise be harder to figure out in large clusters. Epic: none Release note: None

erikgrinaker · 2023-11-14T13:11:23Z

Thanks for looking into this!

Given the ongoing discussions to remove the quota pool entirely (#106063), it seems unfortunate to tie this so closely to the quota pool. If we do remove it, I suppose we could still backport this for increased observability, but perhaps we should invest our efforts towards something that would be useful in a world without the quota pool?

Delay incurred per raft entry in reaching quorum and the replica/store responsible for the largest delay. This is used in tracing.

This is generally useful.

Ideally we would want to decompose this into commit latency (which is quorum-dependant) and application latency (which only depends on the local node). I don't know if we can easily map proposals to log entries prior to application, but we could add some time tracking around Replica.stepRaftGroup() which records when the commit index advanced, and which replica/store sent the message that advanced it. During application, we could retrieve and report this timing information (if we have access to the log indices), and compute the application latency separately.

I don't have this stuff paged in currently, so I don't know if this approach is optimal or practical.

Per replica/store mean delay in persisting raft log entries. This can be easily replaced with a histogram if desired.

This also seems generally useful. Could we handle this simply by inspecting MsgApp and MsgAppResp with a bit of state tracking?

Per replica/store proposal quota consumed.
Per replica/store duration where it was responsible for proposal quota exhaustion.

For the quota pool specifically, we already have trace events for delays. If we feel like it's worth adding some targeted observability for backports alone, I think we'd primarily want delay histograms (#83263), and possibly the identity of the last store (like you've added here).

sumeerbhola

If we do remove it, I suppose we could still backport this for increased observability, but perhaps we should invest our efforts towards something that would be useful in a world without the quota pool?

That's fine with me. This seemed too large to me to backport, unless we were actually seeing frequent escalations where this would help.

I can easily rip out the quota pool stuff. What this needs is certain integration points, and I'd like to get feedback on what are the best places for that integration:

when to start tracking the latency of a raft entry: This currently happens in the call to proposalQuotaAndDelayTracker.raftAppend in handleRaftReadyRaftMuLocked.
progress tracking per replica: This is currently happening on each raft.Ready which eventually calls proposalQuotaAndDelayTracker.updateReplicaAtStore with the replica's tracker.Progress.Match. I think you are suggesting we should use Replica.stepRaftGroup() instead. Is there a field in RaftMessageRequest that tells us what that particular replica store has persisted? Presumably we don't want RaftMessageRequest.Message.Commit since that represents what that replica thinks has committed (i.e., quorum has been achieved, which is too late). Or perhaps you were suggesting "inspecting MsgApp and MsgAppResp with a bit of state tracking" -- where should I look in the code?
Regarding separating commit latency and application latency, the previous bullet will tell us which replica was the last to persist. It doesn't tell us what was needed to achieve quorum, since we don't know the size of the quorum. My assumption had been that the last replica to persist an entry before application happens gives us a ok-ish understanding of at least one slow node (if something is too slow and persists after application, we will ignore it since the call to proposalQuotaAndDelayTracker.lastStoreToProgress happens on application. Is there a place I can intercept advancement of the leader's commit index, which is what precedes application -- is that the Replica.stepRaftGroup suggestion? If I can intercept that advancement of the commit index, I can include only the last replica that persisted before that advancement, since that was the slowest replica. Also, I can start counting the application delay from there on.
proposalQuotaAndDelayTracker.lastStoreToProgress is called after application for tracing. If there is no quota pool, I can use this as the signal to garbage collect stats for earlier indices.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

erikgrinaker

when to start tracking the latency of a raft entry: This currently happens in the call to proposalQuotaAndDelayTracker.raftAppend in handleRaftReadyRaftMuLocked.

We could also consider starting tracking it when we either insert it into the proposal buffer (Insert), or when we flush it (FlushLockedWithRaftGroup), but we don't know the log index at this point. This would allow tracking the time to append to the local log separately, but perhaps the log commit latency is a good enough proxy.

I think you are suggesting we should use Replica.stepRaftGroup() instead. Is there a field in RaftMessageRequest that tells us what that particular replica store has persisted?

Yes, a RaftMessageRequest.Message with with Type: MsgAppResp will contain the replica's latest log index in Index. This works both for regular appends and snapshot application.

https://github.com/etcd-io/raft/blob/7c4d93ad08e00f34330d09b16b9a7d13271c885b/raft.go#L1745-L1753

We could for example record this somewhere around here:

cockroach/pkg/kv/kvserver/replica_raft.go

Line 601 in 09bd852

switch req.Message.Type {

Is there a place I can intercept advancement of the leader's commit index, which is what precedes application -- is that the Replica.stepRaftGroup suggestion?

Replica.raftBasicStatusRLocked() has the commit index as BasicStatus.HardState.Commit. This is advanced when stepping a MsgAppResp:

https://github.com/etcd-io/raft/blob/7c4d93ad08e00f34330d09b16b9a7d13271c885b/raft.go#L1513-L1517

The naïve approach would be to inspect the basic status commit index before/after stepping an MsgAppResp here:

cockroach/pkg/kv/kvserver/replica_raft.go

Line 621 in 09bd852

err := raftGroup.Step(req.Message)

I'm not aware of any better ways to be notified about commit index advancement, but there could be (we could also add an upstream callback). I don't think this should be prohibitively expensive, but it's worth measuring.

proposalQuotaAndDelayTracker.lastStoreToProgress is called after application for tracing. If there is no quota pool, I can use this as the signal to garbage collect stats for earlier indices.

Yeah, this is called during ApplySideEffects(), which seems like a reasonable place to record application and garbage collect stats.

Reviewed 1 of 9 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained

erikgrinaker · 2023-11-17T17:11:10Z

PS: use raftGroup.BasicStatus() instead of Replica.raftBasicStatusRLocked(), since we have the raft node handy.

sumeerbhola

Thanks for the pointers! I'll rework the PR to not concern itself with the quota pool.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

sumeerbhola requested a review from erikgrinaker November 9, 2023 19:11

sumeerbhola commented Nov 9, 2023

View reviewed changes

sumeerbhola force-pushed the repl_delay branch from 110681e to c48400d Compare November 10, 2023 15:27

sumeerbhola force-pushed the repl_delay branch from c48400d to 09bd852 Compare November 10, 2023 21:17

sumeerbhola commented Nov 14, 2023

View reviewed changes

erikgrinaker reviewed Nov 17, 2023

View reviewed changes

sumeerbhola requested a review from erikgrinaker November 20, 2023 15:30

sumeerbhola commented Nov 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: add replication delay and per replicated-to store stats #114181

kvserver: add replication delay and per replicated-to store stats #114181

sumeerbhola commented Nov 9, 2023

cockroach-teamcity commented Nov 9, 2023

sumeerbhola left a comment

erikgrinaker commented Nov 14, 2023

sumeerbhola left a comment

erikgrinaker left a comment

erikgrinaker commented Nov 17, 2023

sumeerbhola left a comment

kvserver: add replication delay and per replicated-to store stats #114181

Are you sure you want to change the base?

kvserver: add replication delay and per replicated-to store stats #114181

Conversation

sumeerbhola commented Nov 9, 2023

cockroach-teamcity commented Nov 9, 2023

sumeerbhola left a comment

Choose a reason for hiding this comment

erikgrinaker commented Nov 14, 2023

sumeerbhola left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Nov 17, 2023

sumeerbhola left a comment

Choose a reason for hiding this comment