Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add replication delay and per replicated-to store stats #114181

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sumeerbhola
Copy link
Collaborator

proposalQuotaAndDelayTracker is introduced to track stats on a per range basis at the leader. These stats include:

  • Delay incurred per raft entry in reaching quorum and the replica/store responsible for the largest delay. This is used in tracing.
  • Per replica/store mean delay in persisting raft log entries. This can be easily replaced with a histogram if desired.
  • Per replica/store proposal quota consumed.
  • Per replica/store duration where it was responsible for proposal quota exhaustion.

These stats are aggregated across all ranges on a local store, for which the local store is a leader, per store that is replicated to (including the local store). The top-k replicated-to stores responsible for some undesired behavior are periodically logged.

The overall objective here is to give more visibility into where the slowness is in the replication path, since the slowness may not be at the store evaluating the proposal. And to provide such visibility both for individual requests (via tracing) and aggregated across stores. The aggregations could help quickly pinpoint slow stores or slow paths to stores, which may otherwise be harder to figure out in large clusters.

Epic: none

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikgrinaker this is a draft -- I'd like your opinion on the value of the specific stats, the (in)accuracy with which they are being measured here (which is also affected by the integration points, which are possibly flawed), and the abstractions.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

proposalQuotaAndDelayTracker is introduced to track stats on a per
range basis at the leader. These stats include:
- Delay incurred per raft entry in reaching quorum and the
  replica/store responsible for the largest delay. This is used in
  tracing.
- Per replica/store mean delay in persisting raft log entries. This
  can be easily replaced with a histogram if desired.
- Per replica/store proposal quota consumed.
- Per replica/store duration where it was responsible for proposal
  quota exhaustion.

These stats are aggregated across all ranges on a local store, for
which the local store is a leader, per store that is replicated to
(including the local store). The top-k replicated-to stores
responsible for some undesired behavior are periodically logged.

The overall objective here is to give more visibility into where
the slowness is in the replication path, since the slowness may not
be at the store evaluating the proposal. And to provide such
visibility both for individual requests (via tracing) and
aggregated across stores. The aggregations could help quickly
pinpoint slow stores or slow paths to stores, which may otherwise
be harder to figure out in large clusters.

Epic: none

Release note: None
@erikgrinaker
Copy link
Contributor

Thanks for looking into this!

Given the ongoing discussions to remove the quota pool entirely (#106063), it seems unfortunate to tie this so closely to the quota pool. If we do remove it, I suppose we could still backport this for increased observability, but perhaps we should invest our efforts towards something that would be useful in a world without the quota pool?

Delay incurred per raft entry in reaching quorum and the replica/store responsible for the largest delay. This is used in tracing.

This is generally useful.

Ideally we would want to decompose this into commit latency (which is quorum-dependant) and application latency (which only depends on the local node). I don't know if we can easily map proposals to log entries prior to application, but we could add some time tracking around Replica.stepRaftGroup() which records when the commit index advanced, and which replica/store sent the message that advanced it. During application, we could retrieve and report this timing information (if we have access to the log indices), and compute the application latency separately.

I don't have this stuff paged in currently, so I don't know if this approach is optimal or practical.

Per replica/store mean delay in persisting raft log entries. This can be easily replaced with a histogram if desired.

This also seems generally useful. Could we handle this simply by inspecting MsgApp and MsgAppResp with a bit of state tracking?

Per replica/store proposal quota consumed.
Per replica/store duration where it was responsible for proposal quota exhaustion.

For the quota pool specifically, we already have trace events for delays. If we feel like it's worth adding some targeted observability for backports alone, I think we'd primarily want delay histograms (#83263), and possibly the identity of the last store (like you've added here).

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do remove it, I suppose we could still backport this for increased observability, but perhaps we should invest our efforts towards something that would be useful in a world without the quota pool?

That's fine with me. This seemed too large to me to backport, unless we were actually seeing frequent escalations where this would help.

I can easily rip out the quota pool stuff. What this needs is certain integration points, and I'd like to get feedback on what are the best places for that integration:

  • when to start tracking the latency of a raft entry: This currently happens in the call to proposalQuotaAndDelayTracker.raftAppend in handleRaftReadyRaftMuLocked.

  • progress tracking per replica: This is currently happening on each raft.Ready which eventually calls proposalQuotaAndDelayTracker.updateReplicaAtStore with the replica's tracker.Progress.Match. I think you are suggesting we should use Replica.stepRaftGroup() instead. Is there a field in RaftMessageRequest that tells us what that particular replica store has persisted? Presumably we don't want RaftMessageRequest.Message.Commit since that represents what that replica thinks has committed (i.e., quorum has been achieved, which is too late). Or perhaps you were suggesting "inspecting MsgApp and MsgAppResp with a bit of state tracking" -- where should I look in the code?

  • Regarding separating commit latency and application latency, the previous bullet will tell us which replica was the last to persist. It doesn't tell us what was needed to achieve quorum, since we don't know the size of the quorum. My assumption had been that the last replica to persist an entry before application happens gives us a ok-ish understanding of at least one slow node (if something is too slow and persists after application, we will ignore it since the call to proposalQuotaAndDelayTracker.lastStoreToProgress happens on application. Is there a place I can intercept advancement of the leader's commit index, which is what precedes application -- is that the Replica.stepRaftGroup suggestion? If I can intercept that advancement of the commit index, I can include only the last replica that persisted before that advancement, since that was the slowest replica. Also, I can start counting the application delay from there on.

  • proposalQuotaAndDelayTracker.lastStoreToProgress is called after application for tracing. If there is no quota pool, I can use this as the signal to garbage collect stats for earlier indices.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when to start tracking the latency of a raft entry: This currently happens in the call to proposalQuotaAndDelayTracker.raftAppend in handleRaftReadyRaftMuLocked.

We could also consider starting tracking it when we either insert it into the proposal buffer (Insert), or when we flush it (FlushLockedWithRaftGroup), but we don't know the log index at this point. This would allow tracking the time to append to the local log separately, but perhaps the log commit latency is a good enough proxy.

I think you are suggesting we should use Replica.stepRaftGroup() instead. Is there a field in RaftMessageRequest that tells us what that particular replica store has persisted?

Yes, a RaftMessageRequest.Message with with Type: MsgAppResp will contain the replica's latest log index in Index. This works both for regular appends and snapshot application.

https://github.com/etcd-io/raft/blob/7c4d93ad08e00f34330d09b16b9a7d13271c885b/raft.go#L1745-L1753

We could for example record this somewhere around here:

switch req.Message.Type {

Is there a place I can intercept advancement of the leader's commit index, which is what precedes application -- is that the Replica.stepRaftGroup suggestion?

Replica.raftBasicStatusRLocked() has the commit index as BasicStatus.HardState.Commit. This is advanced when stepping a MsgAppResp:

https://github.com/etcd-io/raft/blob/7c4d93ad08e00f34330d09b16b9a7d13271c885b/raft.go#L1513-L1517

The naïve approach would be to inspect the basic status commit index before/after stepping an MsgAppResp here:

err := raftGroup.Step(req.Message)

I'm not aware of any better ways to be notified about commit index advancement, but there could be (we could also add an upstream callback). I don't think this should be prohibitively expensive, but it's worth measuring.

proposalQuotaAndDelayTracker.lastStoreToProgress is called after application for tracing. If there is no quota pool, I can use this as the signal to garbage collect stats for earlier indices.

Yeah, this is called during ApplySideEffects(), which seems like a reasonable place to record application and garbage collect stats.

Reviewed 1 of 9 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@erikgrinaker
Copy link
Contributor

PS: use raftGroup.BasicStatus() instead of Replica.raftBasicStatusRLocked(), since we have the raft node handy.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointers! I'll rework the PR to not concern itself with the quota pool.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants