-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: add replication delay and per replicated-to store stats #114181
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erikgrinaker this is a draft -- I'd like your opinion on the value of the specific stats, the (in)accuracy with which they are being measured here (which is also affected by the integration points, which are possibly flawed), and the abstractions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
110681e
to
c48400d
Compare
proposalQuotaAndDelayTracker is introduced to track stats on a per range basis at the leader. These stats include: - Delay incurred per raft entry in reaching quorum and the replica/store responsible for the largest delay. This is used in tracing. - Per replica/store mean delay in persisting raft log entries. This can be easily replaced with a histogram if desired. - Per replica/store proposal quota consumed. - Per replica/store duration where it was responsible for proposal quota exhaustion. These stats are aggregated across all ranges on a local store, for which the local store is a leader, per store that is replicated to (including the local store). The top-k replicated-to stores responsible for some undesired behavior are periodically logged. The overall objective here is to give more visibility into where the slowness is in the replication path, since the slowness may not be at the store evaluating the proposal. And to provide such visibility both for individual requests (via tracing) and aggregated across stores. The aggregations could help quickly pinpoint slow stores or slow paths to stores, which may otherwise be harder to figure out in large clusters. Epic: none Release note: None
c48400d
to
09bd852
Compare
Thanks for looking into this! Given the ongoing discussions to remove the quota pool entirely (#106063), it seems unfortunate to tie this so closely to the quota pool. If we do remove it, I suppose we could still backport this for increased observability, but perhaps we should invest our efforts towards something that would be useful in a world without the quota pool?
This is generally useful. Ideally we would want to decompose this into commit latency (which is quorum-dependant) and application latency (which only depends on the local node). I don't know if we can easily map proposals to log entries prior to application, but we could add some time tracking around I don't have this stuff paged in currently, so I don't know if this approach is optimal or practical.
This also seems generally useful. Could we handle this simply by inspecting
For the quota pool specifically, we already have trace events for delays. If we feel like it's worth adding some targeted observability for backports alone, I think we'd primarily want delay histograms (#83263), and possibly the identity of the last store (like you've added here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do remove it, I suppose we could still backport this for increased observability, but perhaps we should invest our efforts towards something that would be useful in a world without the quota pool?
That's fine with me. This seemed too large to me to backport, unless we were actually seeing frequent escalations where this would help.
I can easily rip out the quota pool stuff. What this needs is certain integration points, and I'd like to get feedback on what are the best places for that integration:
-
when to start tracking the latency of a raft entry: This currently happens in the call to
proposalQuotaAndDelayTracker.raftAppend
inhandleRaftReadyRaftMuLocked
. -
progress tracking per replica: This is currently happening on each
raft.Ready
which eventually callsproposalQuotaAndDelayTracker.updateReplicaAtStore
with the replica'stracker.Progress.Match
. I think you are suggesting we should useReplica.stepRaftGroup()
instead. Is there a field inRaftMessageRequest
that tells us what that particular replica store has persisted? Presumably we don't wantRaftMessageRequest.Message.Commit
since that represents what that replica thinks has committed (i.e., quorum has been achieved, which is too late). Or perhaps you were suggesting "inspectingMsgApp
andMsgAppResp
with a bit of state tracking" -- where should I look in the code? -
Regarding separating commit latency and application latency, the previous bullet will tell us which replica was the last to persist. It doesn't tell us what was needed to achieve quorum, since we don't know the size of the quorum. My assumption had been that the last replica to persist an entry before application happens gives us a ok-ish understanding of at least one slow node (if something is too slow and persists after application, we will ignore it since the call to
proposalQuotaAndDelayTracker.lastStoreToProgress
happens on application. Is there a place I can intercept advancement of the leader's commit index, which is what precedes application -- is that theReplica.stepRaftGroup
suggestion? If I can intercept that advancement of the commit index, I can include only the last replica that persisted before that advancement, since that was the slowest replica. Also, I can start counting the application delay from there on. -
proposalQuotaAndDelayTracker.lastStoreToProgress
is called after application for tracing. If there is no quota pool, I can use this as the signal to garbage collect stats for earlier indices.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when to start tracking the latency of a raft entry: This currently happens in the call to
proposalQuotaAndDelayTracker.raftAppend
inhandleRaftReadyRaftMuLocked
.
We could also consider starting tracking it when we either insert it into the proposal buffer (Insert
), or when we flush it (FlushLockedWithRaftGroup
), but we don't know the log index at this point. This would allow tracking the time to append to the local log separately, but perhaps the log commit latency is a good enough proxy.
I think you are suggesting we should use
Replica.stepRaftGroup()
instead. Is there a field inRaftMessageRequest
that tells us what that particular replica store has persisted?
Yes, a RaftMessageRequest.Message
with with Type: MsgAppResp
will contain the replica's latest log index in Index
. This works both for regular appends and snapshot application.
We could for example record this somewhere around here:
cockroach/pkg/kv/kvserver/replica_raft.go
Line 601 in 09bd852
switch req.Message.Type { |
Is there a place I can intercept advancement of the leader's commit index, which is what precedes application -- is that the
Replica.stepRaftGroup
suggestion?
Replica.raftBasicStatusRLocked()
has the commit index as BasicStatus.HardState.Commit
. This is advanced when stepping a MsgAppResp
:
The naïve approach would be to inspect the basic status commit index before/after stepping an MsgAppResp
here:
cockroach/pkg/kv/kvserver/replica_raft.go
Line 621 in 09bd852
err := raftGroup.Step(req.Message) |
I'm not aware of any better ways to be notified about commit index advancement, but there could be (we could also add an upstream callback). I don't think this should be prohibitively expensive, but it's worth measuring.
proposalQuotaAndDelayTracker.lastStoreToProgress
is called after application for tracing. If there is no quota pool, I can use this as the signal to garbage collect stats for earlier indices.
Yeah, this is called during ApplySideEffects()
, which seems like a reasonable place to record application and garbage collect stats.
Reviewed 1 of 9 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained
PS: use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointers! I'll rework the PR to not concern itself with the quota pool.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
proposalQuotaAndDelayTracker is introduced to track stats on a per range basis at the leader. These stats include:
These stats are aggregated across all ranges on a local store, for which the local store is a leader, per store that is replicated to (including the local store). The top-k replicated-to stores responsible for some undesired behavior are periodically logged.
The overall objective here is to give more visibility into where the slowness is in the replication path, since the slowness may not be at the store evaluating the proposal. And to provide such visibility both for individual requests (via tracing) and aggregated across stores. The aggregations could help quickly pinpoint slow stores or slow paths to stores, which may otherwise be harder to figure out in large clusters.
Epic: none
Release note: None