-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: log traces from replicate queue on errors or slow processing #86007
Conversation
52f0bea
to
29bbace
Compare
29bbace
to
0fb0c80
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will give this a closer look later today, but for now:
pkg/kv/kvserver/replicate_queue.go
Outdated
if err != nil { | ||
rec := sp.FinishAndGetConfiguredRecording() | ||
log.Warningf(ctx, "error processing replica: %v", err) | ||
if rec != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to break log lines across multiple log.Warning
or log.Info
calls. This will allow them to be interleaved with other log statements and could make the logs confusing.
pkg/kv/kvserver/replicate_queue.go
Outdated
rec := sp.FinishAndGetConfiguredRecording() | ||
log.Infof(ctx, "processing replica took %s, exceeding threshold of %s", processDuration, loggingThreshold) | ||
if rec != nil { | ||
log.Infof(ctx, "trace:\n%s", rec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give an example of what one of these trace recordings looks like? If we're logging to the main log file then there's a limit to how noisy we can be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! Can update these a bit more Monday.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @kvoli, and @nvanbenschoten)
pkg/kv/kvserver/replicate_queue.go
line 632 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't think we want to break log lines across multiple
log.Warning
orlog.Info
calls. This will allow them to be interleaved with other log statements and could make the logs confusing.
Good point, thanks!
pkg/kv/kvserver/replicate_queue.go
line 639 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you give an example of what one of these trace recordings looks like? If we're logging to the main log file then there's a limit to how noisy we can be.
It is certainly fairly verbose - if we were to split, perhaps it would be good to have a log for the store queues. See an example here: https://gist.github.com/AlexTalks/56de6fc500ceea2bed622d2f5f3fb202
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @kvoli, @lidorcarmel, and @nvanbenschoten)
pkg/kv/kvserver/replicate_queue.go
line 623 at r1 (raw file):
scatter, dryRun bool, ) (requeue bool, err error) { tCtx, sp := tracing.EnsureChildSpan(ctx, rq.Tracer, "process replica", tracing.WithRecording(tracingpb.RecordingVerbose))
Discussed with @andreimatei - sounds like recording is ok, but we may want to reduce some portions of the trace to simply RecordingStructured
rather than RecordingVerbose
, such as everything under the log-range-event
child span. Perhaps also the transactions in replica_command.go, such as in execChangeReplicasTxn(..)
, as the verbosity on these transactions can be very high (while potentially less insightful compared to allocator or snapshot child spans).
pkg/kv/kvserver/replicate_queue.go
line 639 at r1 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
It is certainly fairly verbose - if we were to split, perhaps it would be good to have a log for the store queues. See an example here: https://gist.github.com/AlexTalks/56de6fc500ceea2bed622d2f5f3fb202
Also talked with @andreimatei about piggybacking onto @lidorcarmel's new allocator log channel, introduced in #85688.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for slow traces, can we use the "take snapshot" button to get traces for those slow ops? this way we can see inflight ops, I think it should be the right way to debug the kind of issues such as a slow decommission.
I worry about a couple of things with this PR, spamming the log is definitely one thing, but maybe more important, it's not clear to me that we need a new mechanism for filtering logs, and something specific for the allocator only, we already have vlog which is good for some use cases, and we have tracing, can we use one of those 2 instead of introducing this flag?
back to spamming, it's not trivial how to configure this flag, say we set it to 5, is that too much during high load on the cluster? are we going to have 100 of these traces in the log files? which will log about 50MB (assuming the example you pointed blow is an average one, thought maybe it's an extreme rare one?).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @kvoli, and @nvanbenschoten)
pkg/kv/kvserver/queue.go
line 91 at r1 (raw file):
// provided, otherwise nil can be passed in. func makeRateLimitedTimeoutFuncBySlowdownMultiplier( slowdownMultiplier int64,
I think it will be cleaner to move the flag out of this func, and pass an int, the called should produce the int.
pkg/kv/kvserver/replicate_queue.go
line 639 at r1 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
Also talked with @andreimatei about piggybacking onto @lidorcarmel's new allocator log channel, introduced in #85688.
the new channel is not in yet, and I don't think you want to make this pr depending on the new channel, right?
FWIW the size of the gist is 444KB, which is huge IMO.
Worth keeping in mind - we don't get these logs if the op is stuck forever, we only log after it's done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @kvoli, and @nvanbenschoten)
pkg/kv/kvserver/replicate_queue.go
line 639 at r1 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
the new channel is not in yet, and I don't think you want to make this pr depending on the new channel, right?
FWIW the size of the gist is 444KB, which is huge IMO.
Worth keeping in mind - we don't get these logs if the op is stuck forever, we only log after it's done.
I'm open to either, assuming that the new channel is going to be in 22.2. As for the size - working on reducing that based on some of the suggestions mentioned above.
As for being stuck, I'm not sure this is a significant concern because we have timeouts on both the individual snapshot operations as well as the replicate queue processing as a whole.
See https://gist.github.com/AlexTalks/01ccf2d04748ea59cd883190a9cce922 for an updated example trace with both #86436 and #86483 incorporated. This isn't really a new filter on the logs, but is somewhat comparable to the already existing settings cockroach/pkg/sql/conn_executor_exec.go Line 2323 in 677ef2c
That said I can certainly understand that a new setting may not be the right way to go as it might be confusing / overkill - my current thought is that with the more concise trace (in the example above), it might be best to log traces on all errors, and log traces when replica processing takes over 50% of the timeout (without a new setting). This could be either to the |
0fb0c80
to
5f21a6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @kvoli, and @nvanbenschoten)
pkg/kv/kvserver/replicate_queue.go
line 517 at r2 (raw file):
updateCh chan time.Time lastLeaseTransfer atomic.Value // read and written by scanner & queue goroutines logTracesThresholdFunc queueProcessTimeoutFunc
this needs a comment. But does it really need to be a field, as opposed to directly a method?
pkg/kv/kvserver/replicate_queue.go
line 757 at r2 (raw file):
tracing.WithRecording(tracingpb.RecordingVerbose)) processStart := timeutil.Now() defer func() {
why put all this in a defer
, as opposed to moving it below the processOneChange()
call. Flow control through defer
is unusual, and the use of name retvals (err
) is not ideal
pkg/kv/kvserver/replicate_queue.go
line 762 at r2 (raw file):
exceededDuration := loggingThreshold > time.Duration(0) && processDuration > loggingThreshold if err != nil { rec := sp.FinishAndGetConfiguredRecording()
lift this out of the if
. You can do defer sp.Finish()
, and get the recording with sp.GetConfiguredRecording()
.
pkg/kv/kvserver/replicate_queue_test.go
line 692 at r2 (raw file):
foundErr = true } if matched, matchErr := regexp.MatchString(`trace:.*`, logMsg.Msg().StripMarkers()); matchErr == nil && matched {
hmmm so does this mean that now the debug page that runs a range through the queue renders most of the trace twice? I think we should fix this somehow.
5f21a6e
to
610652a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @kvoli, @lidorcarmel, and @nvanbenschoten)
pkg/kv/kvserver/queue.go
line 91 at r1 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
I think it will be cleaner to move the flag out of this func, and pass an int, the called should produce the int.
OK - removed the flag entirely actually, heh.
pkg/kv/kvserver/replicate_queue.go
line 517 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
this needs a comment. But does it really need to be a field, as opposed to directly a method?
OK - added comment.
This follows the pattern of the timeout threshold function, which is part of queueConfig
cockroach/pkg/kv/kvserver/queue.go
Lines 327 to 328 in 003c036
// processTimeout returns the timeout for processing a replica. | |
processTimeoutFunc queueProcessTimeoutFunc |
pkg/kv/kvserver/replicate_queue.go
line 757 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
why put all this in a
defer
, as opposed to moving it below theprocessOneChange()
call. Flow control throughdefer
is unusual, and the use of name retvals (err
) is not ideal
OK - fair point, the defer
isn't getting us anything here. I left the named retval for requeue so the meaning is clear, however (we already do the same in processOneChange
).
pkg/kv/kvserver/replicate_queue.go
line 762 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
lift this out of the
if
. You can dodefer sp.Finish()
, and get the recording withsp.GetConfiguredRecording()
.
OK - done.
pkg/kv/kvserver/replicate_queue_test.go
line 692 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
hmmm so does this mean that now the debug page that runs a range through the queue renders most of the trace twice? I think we should fix this somehow.
OK - flagged manually enqueued replicas so that the traces won't be rendered and output to logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @kvoli, @lidorcarmel, and @nvanbenschoten)
pkg/kv/kvserver/replicate_queue.go
line 632 at r1 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
Good point, thanks!
Ok, done
pkg/kv/kvserver/replicate_queue.go
line 639 at r1 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
I'm open to either, assuming that the new channel is going to be in 22.2. As for the size - working on reducing that based on some of the suggestions mentioned above.
As for being stuck, I'm not sure this is a significant concern because we have timeouts on both the individual snapshot operations as well as the replicate queue processing as a whole.
Ok - discussed with @lidorcarmel one-on-one and I believe we are ok with this.
610652a
to
45a3198
Compare
While we introduced logging of traces from replicate queue processing in cockroachdb#86007, it was noted that when collecting traces with verbose recording enabled, a significant fraction of these traces were from low-level database operations, particularly during the transaction run in `execChangeReplicasTxn(..)`. This function performs necessary bookkeeping of range descriptors when replica movement occurs, however the high verbosity on the low-level database operations within the transaction is not necessary when examining traces from higher-level operations like replicate queue processing. Hence, this change creates a child tracing span within `execChangeReplicasTxn` which will only record structured event if it is called within a parent tracing span with verbose recording enabled. This change decreases the size of the trace of processing a replica via the replicate queue by a factor of 10, allowing the resulting trace to be much more succint and focused. Release justification: Low-risk, high benefit observability change. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @kvoli, @lidorcarmel, and @nvanbenschoten)
pkg/kv/kvserver/replicate_queue.go
line 517 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
OK - added comment.
This follows the pattern of the timeout threshold function, which is part of
queueConfig
cockroach/pkg/kv/kvserver/queue.go
Lines 327 to 328 in 003c036
// processTimeout returns the timeout for processing a replica. processTimeoutFunc queueProcessTimeoutFunc
well but in the case of the queueConfig
, I think we want to be able to customize the function on a per-queue basis (but share the baseQueue
code. Or something. I don't think that applies here. You need to tell me more than that you're following that pattern.
pkg/kv/kvserver/replicate_queue.go
line 757 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
OK - fair point, the
defer
isn't getting us anything here. I left the named retval for requeue so the meaning is clear, however (we already do the same inprocessOneChange
).
FWIW, naming return values is fine. Using the respective variable is more unusual.
pkg/kv/kvserver/replicate_queue.go
line 529 at r4 (raw file):
purgCh: time.NewTicker(replicateQueuePurgatoryCheckInterval).C, updateCh: make(chan time.Time, 1), logTracesThresholdFunc: makeRateLimitedTimeoutFuncByPermittedSlowdown(permittedRangeScanSlowdown/2, rebalanceSnapshotRate, recoverySnapshotRate),
line way over 100 chars
Code quote:
logTracesThresholdFunc: makeRateLimitedTimeoutFuncByPermittedSlowdown(permittedRangeScanSlowdown/2, rebalanceSnapshotRate, recoverySnapshotRate),
pkg/kv/kvserver/replicate_queue.go
line 756 at r4 (raw file):
scatter, dryRun bool, ) (requeue bool, _ error) { tCtx, sp := tracing.EnsureChildSpan(ctx, rq.Tracer, "process replica",
Let's get rid of this tCtx
. I don't really know what it stands for or exactly why we decide to use ctx
again later. Also see below.
pkg/kv/kvserver/replicate_queue.go
line 765 at r4 (raw file):
// If processing a manually enqueued replica, it is unnecessary to log traces // as they will be collected and rendered by the caller. if ctx.Value(ManuallyEnqueuedKey{}) != nil {
If this ManuallyEnqueuedKey
stays, let's lift checking for it to the caller, and call processOneChange
instead of processOneChangeWithTracing
when the magic key is set.
But I don't think this kinda of context magic is necessary. I think we shouldn't log the trace with the ctx
of the trace; I think we should trace with context.Background()
(perhaps annotated with ctx's
logging tags). This way the trace won't be polluted with a recording of itself. Logging a trace recording is an unusual thing, so it deserves to be logged with an unusual ctx.
If you do this, it means that you won't be able to test this functionality by looking at a higher-level trace. As far as I'm concerned, that's a good thing. I think we should test exactly what we want to test - that some message made it to a log channel. I believe there are facilities for that, but I don't know them.
pkg/kv/kvserver/replicate_queue_test.go
line 678 at r4 (raw file):
testutils.RunTrueAndFalse(t, "manual", func(t *testing.T, manual bool) { if manual { ctx = context.WithValue(ctx, kvserver.ManuallyEnqueuedKey{}, struct{}{})
if this stays, it needs a comment. But hopefully the difference between "manual" and not manual will go away.
pkg/kv/kvserver/replicate_queue_test.go
line 699 at r4 (raw file):
foundParent = true for _, logMsg := range recSpan.Logs { if errRegexp.MatchString(logMsg.Msg().StripMarkers()) {
foundErr := errRegexp.MatchString()...
and same for foundTrace
.
45a3198
to
2d2ec9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @andreimatei from 2 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @kvoli)
While we introduced logging of traces from replicate queue processing in cockroachdb#86007, it was noted that when collecting traces with verbose recording enabled, a significant fraction of these traces were from low-level database operations, particularly during the transaction run in `execChangeReplicasTxn(..)`. This function performs necessary bookkeeping of range descriptors when replica movement occurs, however the high verbosity on the low-level database operations within the transaction is not necessary when examining traces from higher-level operations like replicate queue processing. Hence, this change creates child tracing spans within `execChangeReplicasTxn` which can be filtered out prior to logging the traces in the replicate queue processing loop. This decreases the size of the logged trace by a factor of 10, allowing the resulting log message to be much more succinct and focused. Release justification: Low-risk, high benefit observability change. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @kvoli)
bors r=andreimatei |
👎 Rejected by code reviews |
While we introduced logging of traces from replicate queue processing in cockroachdb#86007, it was noted that when collecting traces with verbose recording enabled, a significant fraction of these traces were from low-level database operations, particularly during the transaction run in `execChangeReplicasTxn(..)`. This function performs necessary bookkeeping of range descriptors when replica movement occurs, however the high verbosity on the low-level database operations within the transaction is not necessary when examining traces from higher-level operations like replicate queue processing. Hence, this change creates child tracing spans within `execChangeReplicasTxn` which can be filtered out prior to logging the traces in the replicate queue processing loop. This decreases the size of the logged trace by a factor of 10, allowing the resulting log message to be much more succinct and focused. Release justification: Low-risk, high benefit observability change. Release note: None
While we introduced logging of traces from replicate queue processing in cockroachdb#86007, it was noted that when collecting traces with verbose recording enabled, a significant fraction of these traces were from low-level database operations, particularly during the transaction run in `execChangeReplicasTxn(..)`. This function performs necessary bookkeeping of range descriptors when replica movement occurs, however the high verbosity on the low-level database operations within the transaction is not necessary when examining traces from higher-level operations like replicate queue processing. Hence, this change creates child tracing spans within `execChangeReplicasTxn` which can be filtered out prior to logging the traces in the replicate queue processing loop. This decreases the size of the logged trace by a factor of 10, allowing the resulting log message to be much more succinct and focused. Release justification: Low-risk, high benefit observability change. Release note: None
bors r=andreimatei |
Build failed (retrying...): |
Build succeeded: |
While we introduced logging of traces from replicate queue processing in cockroachdb#86007, it was noted that when collecting traces with verbose recording enabled, a significant fraction of these traces were from low-level database operations, particularly during the transaction run in `execChangeReplicasTxn(..)`. This function performs necessary bookkeeping of range descriptors when replica movement occurs, however the high verbosity on the low-level database operations within the transaction is not necessary when examining traces from higher-level operations like replicate queue processing. Hence, this change creates child tracing spans within `execChangeReplicasTxn` which can be filtered out prior to logging the traces in the replicate queue processing loop. This decreases the size of the logged trace by a factor of 10, allowing the resulting log message to be much more succinct and focused. Release justification: Low-risk, high benefit observability change. Release note: None
85464: ui: add Transaction and Statement fingerprint id r=maryliag a=j82w This adds the Transaction and Statement fingerprint id to SQL Activity Statements overview page. The columns are hidden by default. closes #78509 <img width="1337" alt="Screen Shot 2022-08-03 at 7 14 21 AM" src="https://user-images.githubusercontent.com/8868107/182594974-73bea276-0fb0-4cd4-a106-60e31962b1b9.png"> <img width="1341" alt="Screen Shot 2022-08-03 at 7 21 43 AM" src="https://user-images.githubusercontent.com/8868107/182596064-7a02e7da-3e07-4616-9be9-1a807b87b33d.png"> Release justification: Category 2: Bug fixes and low-risk updates to new functionality Release note (ui change): Adds Transaction and Statement fingerprint id to correlating SQL Activity overview pages. New columns are hidden by default. 86483: kvserver: decrease trace verbosity for change replicas db operations r=AlexTalks a=AlexTalks While we introduced logging of traces from replicate queue processing in #86007, it was noted that when collecting traces with verbose recording enabled, a significant fraction of these traces were from low-level database operations, particularly during the transaction run in `execChangeReplicasTxn(..)`. This function performs necessary bookkeeping of range descriptors when replica movement occurs, however the high verbosity on the low-level database operations within the transaction is not necessary when examining traces from higher-level operations like replicate queue processing. Hence, this change creates child tracing spans within `execChangeReplicasTxn` which can be filtered out prior to logging the traces in the replicate queue processing loop. This decreases the size of the logged trace by a factor of 10, allowing the resulting log message to be much more succinct and focused. Release justification: Low-risk, high benefit observability change. Release note: None 86779: ui: statement insight details page v1 r=ericharmeling a=ericharmeling This PR adds the Statement Insights Details page to the DB Console. https://www.loom.com/share/e9c7be3eb5da47b784b119b6fc9bbeb9 Release justification: low-risk updates to new functionality Release note (ui change): Added statement insight details page to DB Console. 86829: sql: include regions information into the sampled query telemetry r=yuzefovich a=yuzefovich **execstats: remove unused regions field** This field was never actually populated nor used since it was introduced in 9f716a7. Release justification: low-risk cleanup. Release note: None **sql: include regions information into the sampled query telemetry** This commit adds the regions (where SQL processors where planned for the query) to the sampled query telemetry. This required a couple of minor changes to derive the regions information stored in the instrumentation helper earlier (before the logging takes place). Fixes: #85427. Release justification: low-risk improvement. Release note (sql change): The structured payloads used for telemetry logs now include the new `Regions` field which indicates the regions of the nodes where SQL processing ran for the query. 86951: flowinfra: improve BenchmarkFlowSetup r=yuzefovich a=yuzefovich This commit improves `BenchmarkFlowSetup` by pulling out parsing of the statement from the hot path as well as more precise usage of the timer (to exclude the construction of the internal planner). This way the benchmark is better focused on its goal - the performance of the planning and execution. Release justification: test-only change. Release note: None 86964: rowexec: close all ValueGenerators in the project set processor r=yuzefovich a=yuzefovich Previously, we forgot to close the value generators when a new input row is read by the project set processor which could lead to leaking of the resources. This is now fixed. Most of the value generators don't need to release any resources, a few need to close their memory account (previously this would result in a sentry report, but no actual leak would occur in production builds), the only concerning one is `crdb_internal.payloads_for_trace` where we would not close the `InternalRows` iterator. But that seems like an internal debugging tool, so most likely the users weren't impacted. Fixes: #85418. Release justification: bug fix. Release note: None (It seems like in most cases the users would not see the impact.) 86965: pgcode: use custom error codes for home region errors r=msirek a=rafiss The XC error class is reserved for Cockroach-specific errors to avoid conflicting with any future error codes that Postgres might add. Release note: None Release justification: low risk change to new functionality. Co-authored-by: j82w <[email protected]> Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
…annel While we introduced logging of replicate queue processing errors, as well as the collected traces, in cockroachdb#86007, we also introduced a new logging channel `KvDistribution` in cockroachdb#85688. This changes the log channel of those errors to the new `KvDistribution` channel, and also modifies the output to only log the traces if verbosity for the module is >=1. Release justification: Low risk observability change. Release note: None
…annel While we introduced logging of replicate queue processing errors, as well as the collected traces, in cockroachdb#86007, we also introduced a new logging channel `KvDistribution` in cockroachdb#85688. This changes the log channel of those errors and traces to the new `KvDistribution` channel. Release justification: Low risk observability change. Release note: None
87159: kvserver: log replicate queue processing errors via KvDistribution channel r=AlexTalks a=AlexTalks While we introduced logging of replicate queue processing errors, as well as the collected traces, in #86007, we also introduced a new logging channel `KvDistribution` in #85688. This changes the log channel of those errors to the new `KvDistribution` channel, and also modifies the output to only log the traces if verbosity for the module is >=1. Release justification: Low risk observability change. Release note: None 87256: roachtest: loq_recovery fix incorrect error handling r=aliher1911 a=aliher1911 Previously, if attempt to set snapshot rates failed, then whole test fails. It is not correct as we report any unhealthy outcomes as histogram value and succeed the test. Release justification: roachtest fix Release note: None Fixes #85721 87283: allocator: increase log verbosity r=lidorcarmel a=lidorcarmel Now that we log distribution logs to a separate log file (#85688), we can enable more logging by default (without setting vmodule). This PR enables most log messages in the store rebalancer, and some in the replicate queue. The rational is that the store rebalancer iterates over 128 replicas every minute and therefore we can allow logging per replica. We roughly expect: 200 bytes per log message x 128 x 10 messages = 250K per minute, or 15MB per hour. The replicate queue tries to avoid per replica logging (e.g in functions such as shouldEnqueue), instead it will log when an action such as adding a replica is executed. In a roachtest, running drain and decommission when adding a new node we see only a few MBs per hour in the kv-distribution log file. Release justification: low risk, will help debugging. Release note: None 87286: rowexec: make sure to close generator builtin with error in Start r=yuzefovich a=yuzefovich Previously, if an error is returned by `Start` method of the generator builtin, we would never close it which could lead to leaking of resources. This is now fixed. Fixes: #87248. Release justification: bug fix. Release note: None Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Lidor Carmel <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
While we previously had some logging from the replicate queue as a
result of the standard queue logging, this change adds logging to the
replicate queue when there are errors in processing a replica, or when
processing a replica exceeds a 50% of the timeout duration.
When there are errors or the duration threshold is exceeded,
any error messages are logged along with the collected tracing spans
from the operation.
Release note (ops change): Added logging on replicate queue processing
in the presence of errors or when the duration exceeds 50% of the
timeout.
Release justification: Low risk observability change.