Skip to content

Commit

Permalink
kvserver: log replicate queue processing errors via KvDistribution ch…
Browse files Browse the repository at this point in the history
…annel

While we introduced logging of replicate queue processing errors, as well
as the collected traces, in cockroachdb#86007, we also introduced a new logging
channel `KvDistribution` in cockroachdb#85688. This changes the log channel of
those errors to the new `KvDistribution` channel, and also modifies the
output to only log the traces if verbosity for the module is >=1.

Release justification: Low risk observability change.
Release note: None
  • Loading branch information
AlexTalks committed Aug 31, 2022
1 parent 6452f34 commit 8b8994c
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 15 deletions.
32 changes: 17 additions & 15 deletions pkg/kv/kvserver/replicate_queue.go
Original file line number Diff line number Diff line change
Expand Up @@ -678,16 +678,16 @@ func (rq *replicateQueue) process(
ctx, repl, rq.canTransferLeaseFrom, false /* scatter */, false, /* dryRun */
)
if isSnapshotError(err) {
// If ChangeReplicas failed because the snapshot failed, we log the
// error but then return success indicating we should retry the
// operation. The most likely causes of the snapshot failing are a
// declined reservation or the remote node being unavailable. In either
// case we don't want to wait another scanner cycle before reconsidering
// the range.
// If ChangeReplicas failed because the snapshot failed, we attempt to
// retry the operation. The most likely causes of the snapshot failing
// are a declined reservation (i.e. snapshot queue too long, or timeout
// while waiting in queue) or the remote node being unavailable. In
// either case we don't want to wait another scanner cycle before
// reconsidering the range.
// NB: The reason we are retrying snapshot failures immediately is that
// the recipient node will be "blocked" by a snapshot send failure for a
// few seconds. By retrying immediately we will choose the "second best"
// node to send to.
// few seconds. By retrying immediately we will choose another equally
// "good" target store chosen by the allocator.
// TODO(baptist): This is probably suboptimal behavior. In the case where
// there is only one option for a recipient, we will block the entire
// replicate queue until we are able to send this through. Also even if
Expand Down Expand Up @@ -799,23 +799,25 @@ func (rq *replicateQueue) processOneChangeWithTracing(
loggingThreshold := rq.logTracesThresholdFunc(rq.store.cfg.Settings, repl)
exceededDuration := loggingThreshold > time.Duration(0) && processDuration > loggingThreshold

loggingNeeded := err != nil || exceededDuration
if loggingNeeded {
// While errors and slow replica processing should always be logged, only
// output collected traces if the verbosity for the module is set to >=1.
var traceOutput string
traceLoggingNeeded := (err != nil || exceededDuration) && log.ExpensiveLogEnabled(ctx, 1)
if traceLoggingNeeded {
// If we have tracing spans from execChangeReplicasTxn, filter it from
// the recording so that we can render the traces to the log without it,
// as the traces from this span (and its children) are highly verbose.
rec = filterTracingSpans(sp.GetConfiguredRecording(),
replicaChangeTxnGetDescOpName, replicaChangeTxnUpdateDescOpName,
)
traceOutput = fmt.Sprintf("\ntrace:\n%s", rec)
}

if err != nil {
// TODO(sarkesian): Utilize Allocator log channel once available.
log.Warningf(ctx, "error processing replica: %v\ntrace:\n%s", err, rec)
log.KvDistribution.Warningf(ctx, "error processing replica: %v%s", err, traceOutput)
} else if exceededDuration {
// TODO(sarkesian): Utilize Allocator log channel once available.
log.Infof(ctx, "processing replica took %s, exceeding threshold of %s\ntrace:\n%s",
processDuration, loggingThreshold, rec)
log.KvDistribution.Infof(ctx, "processing replica took %s, exceeding threshold of %s%s",
processDuration, loggingThreshold, traceOutput)
}
}

Expand Down
1 change: 1 addition & 0 deletions pkg/kv/kvserver/replicate_queue_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -625,6 +625,7 @@ func TestReplicateQueueDecommissioningNonVoters(t *testing.T) {
func TestReplicateQueueTracingOnError(t *testing.T) {
defer leaktest.AfterTest(t)()
s := log.ScopeWithoutShowLogs(t)
_ = log.SetVModule("replicate_queue=1")
defer s.Close(t)

// NB: This test injects a fake failure during replica rebalancing, and we use
Expand Down

0 comments on commit 8b8994c

Please sign in to comment.