kvserver: log traces from replicate queue on errors or slow processing #86007

AlexTalks · 2022-08-12T03:54:47Z

While we previously had some logging from the replicate queue as a
result of the standard queue logging, this change adds logging to the
replicate queue when there are errors in processing a replica, or when
processing a replica exceeds a 50% of the timeout duration.
When there are errors or the duration threshold is exceeded,
any error messages are logged along with the collected tracing spans
from the operation.

Release note (ops change): Added logging on replicate queue processing
in the presence of errors or when the duration exceeds 50% of the
timeout.

Release justification: Low risk observability change.

cockroach-teamcity · 2022-08-12T03:54:54Z

This change is

nvanbenschoten

Will give this a closer look later today, but for now:

nvanbenschoten · 2022-08-12T17:38:56Z

pkg/kv/kvserver/replicate_queue.go

+		if err != nil {
+			rec := sp.FinishAndGetConfiguredRecording()
+			log.Warningf(ctx, "error processing replica: %v", err)
+			if rec != nil {


I don't think we want to break log lines across multiple log.Warning or log.Info calls. This will allow them to be interleaved with other log statements and could make the logs confusing.

nvanbenschoten · 2022-08-12T17:39:43Z

pkg/kv/kvserver/replicate_queue.go

+			rec := sp.FinishAndGetConfiguredRecording()
+			log.Infof(ctx, "processing replica took %s, exceeding threshold of %s", processDuration, loggingThreshold)
+			if rec != nil {
+				log.Infof(ctx, "trace:\n%s", rec)


Could you give an example of what one of these trace recordings looks like? If we're logging to the main log file then there's a limit to how noisy we can be.

AlexTalks

TFTR! Can update these a bit more Monday.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @kvoli, and @nvanbenschoten)

pkg/kv/kvserver/replicate_queue.go line 632 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think we want to break log lines across multiple log.Warning or log.Info calls. This will allow them to be interleaved with other log statements and could make the logs confusing.

Good point, thanks!

pkg/kv/kvserver/replicate_queue.go line 639 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you give an example of what one of these trace recordings looks like? If we're logging to the main log file then there's a limit to how noisy we can be.

It is certainly fairly verbose - if we were to split, perhaps it would be good to have a log for the store queues. See an example here: https://gist.github.com/AlexTalks/56de6fc500ceea2bed622d2f5f3fb202

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @kvoli, @lidorcarmel, and @nvanbenschoten)

pkg/kv/kvserver/replicate_queue.go line 623 at r1 (raw file):

	scatter, dryRun bool,
) (requeue bool, err error) {
	tCtx, sp := tracing.EnsureChildSpan(ctx, rq.Tracer, "process replica", tracing.WithRecording(tracingpb.RecordingVerbose))

Discussed with @andreimatei - sounds like recording is ok, but we may want to reduce some portions of the trace to simply RecordingStructured rather than RecordingVerbose, such as everything under the log-range-event child span. Perhaps also the transactions in replica_command.go, such as in execChangeReplicasTxn(..), as the verbosity on these transactions can be very high (while potentially less insightful compared to allocator or snapshot child spans).

pkg/kv/kvserver/replicate_queue.go line 639 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

It is certainly fairly verbose - if we were to split, perhaps it would be good to have a log for the store queues. See an example here: https://gist.github.com/AlexTalks/56de6fc500ceea2bed622d2f5f3fb202

Also talked with @andreimatei about piggybacking onto @lidorcarmel's new allocator log channel, introduced in #85688.

lidorcarmel

for slow traces, can we use the "take snapshot" button to get traces for those slow ops? this way we can see inflight ops, I think it should be the right way to debug the kind of issues such as a slow decommission.

I worry about a couple of things with this PR, spamming the log is definitely one thing, but maybe more important, it's not clear to me that we need a new mechanism for filtering logs, and something specific for the allocator only, we already have vlog which is good for some use cases, and we have tracing, can we use one of those 2 instead of introducing this flag?

back to spamming, it's not trivial how to configure this flag, say we set it to 5, is that too much during high load on the cluster? are we going to have 100 of these traces in the log files? which will log about 50MB (assuming the example you pointed blow is an average one, thought maybe it's an extreme rare one?).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @kvoli, and @nvanbenschoten)

pkg/kv/kvserver/queue.go line 91 at r1 (raw file):

// provided, otherwise nil can be passed in.
func makeRateLimitedTimeoutFuncBySlowdownMultiplier(
	slowdownMultiplier int64,

I think it will be cleaner to move the flag out of this func, and pass an int, the called should produce the int.

pkg/kv/kvserver/replicate_queue.go line 639 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Also talked with @andreimatei about piggybacking onto @lidorcarmel's new allocator log channel, introduced in #85688.

the new channel is not in yet, and I don't think you want to make this pr depending on the new channel, right?

FWIW the size of the gist is 444KB, which is huge IMO.

Worth keeping in mind - we don't get these logs if the op is stuck forever, we only log after it's done.

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @kvoli, and @nvanbenschoten)

pkg/kv/kvserver/replicate_queue.go line 639 at r1 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

the new channel is not in yet, and I don't think you want to make this pr depending on the new channel, right?

FWIW the size of the gist is 444KB, which is huge IMO.

Worth keeping in mind - we don't get these logs if the op is stuck forever, we only log after it's done.

I'm open to either, assuming that the new channel is going to be in 22.2. As for the size - working on reducing that based on some of the suggestions mentioned above.

As for being stuck, I'm not sure this is a significant concern because we have timeouts on both the individual snapshot operations as well as the replicate queue processing as a whole.

AlexTalks · 2022-08-19T20:43:16Z

See https://gist.github.com/AlexTalks/01ccf2d04748ea59cd883190a9cce922 for an updated example trace with both #86436 and #86483 incorporated.

This isn't really a new filter on the logs, but is somewhat comparable to the already existing settings sql.trace.txn.enable_threshold and sql.trace.stmt.enable_threshold, which set a time threshold for a SQL transaction (or statement) that, when exceeded, has all recorded traces logged to the DEV channel:

cockroach/pkg/sql/conn_executor_exec.go

Line 2323 in 677ef2c

    
           log.Infof(ctx, "%s took %s, exceeding threshold of %s:\n%s", opName, elapsed, threshold, dump)

That said I can certainly understand that a new setting may not be the right way to go as it might be confusing / overkill - my current thought is that with the more concise trace (in the example above), it might be best to log traces on all errors, and log traces when replica processing takes over 50% of the timeout (without a new setting). This could be either to the DEV channel or to the new allocator channel being introduced. Part of the reasoning there is because if a snapshot operation times out, the timeout occurs on the client side of the RPC, and thus we won't get back the remote traces (which include details like how long queueing took and timing breakdowns).

AlexTalks · 2022-08-19T23:05:43Z

Updated to remove flag and only log traces on replica processing error or exceeding 50% of the timeout duration. See also related PRs #86436 and #86483.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @kvoli, and @nvanbenschoten)

pkg/kv/kvserver/replicate_queue.go line 517 at r2 (raw file):

	updateCh               chan time.Time
	lastLeaseTransfer      atomic.Value // read and written by scanner & queue goroutines
	logTracesThresholdFunc queueProcessTimeoutFunc

this needs a comment. But does it really need to be a field, as opposed to directly a method?

pkg/kv/kvserver/replicate_queue.go line 757 at r2 (raw file):

		tracing.WithRecording(tracingpb.RecordingVerbose))
	processStart := timeutil.Now()
	defer func() {

why put all this in a defer, as opposed to moving it below the processOneChange() call. Flow control through defer is unusual, and the use of name retvals (err) is not ideal

pkg/kv/kvserver/replicate_queue.go line 762 at r2 (raw file):

		exceededDuration := loggingThreshold > time.Duration(0) && processDuration > loggingThreshold
		if err != nil {
			rec := sp.FinishAndGetConfiguredRecording()

lift this out of the if. You can do defer sp.Finish(), and get the recording with sp.GetConfiguredRecording().

pkg/kv/kvserver/replicate_queue_test.go line 692 at r2 (raw file):

					foundErr = true
				}
				if matched, matchErr := regexp.MatchString(`trace:.*`, logMsg.Msg().StripMarkers()); matchErr == nil && matched {

hmmm so does this mean that now the debug page that runs a range through the queue renders most of the trace twice? I think we should fix this somehow.

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @kvoli, @lidorcarmel, and @nvanbenschoten)

pkg/kv/kvserver/queue.go line 91 at r1 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

I think it will be cleaner to move the flag out of this func, and pass an int, the called should produce the int.

OK - removed the flag entirely actually, heh.

pkg/kv/kvserver/replicate_queue.go line 517 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this needs a comment. But does it really need to be a field, as opposed to directly a method?

OK - added comment.

This follows the pattern of the timeout threshold function, which is part of queueConfig

cockroach/pkg/kv/kvserver/queue.go

Lines 327 to 328 in 003c036

    
           // processTimeout returns the timeout for processing a replica. 
        
           processTimeoutFunc queueProcessTimeoutFunc

pkg/kv/kvserver/replicate_queue.go line 757 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why put all this in a defer, as opposed to moving it below the processOneChange() call. Flow control through defer is unusual, and the use of name retvals (err) is not ideal

OK - fair point, the defer isn't getting us anything here. I left the named retval for requeue so the meaning is clear, however (we already do the same in processOneChange).

pkg/kv/kvserver/replicate_queue.go line 762 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

lift this out of the if. You can do defer sp.Finish(), and get the recording with sp.GetConfiguredRecording().

OK - done.

pkg/kv/kvserver/replicate_queue_test.go line 692 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

hmmm so does this mean that now the debug page that runs a range through the queue renders most of the trace twice? I think we should fix this somehow.

OK - flagged manually enqueued replicas so that the traces won't be rendered and output to logs.

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @kvoli, @lidorcarmel, and @nvanbenschoten)

pkg/kv/kvserver/replicate_queue.go line 632 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Good point, thanks!

Ok, done

pkg/kv/kvserver/replicate_queue.go line 639 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

I'm open to either, assuming that the new channel is going to be in 22.2. As for the size - working on reducing that based on some of the suggestions mentioned above.

As for being stuck, I'm not sure this is a significant concern because we have timeouts on both the individual snapshot operations as well as the replicate queue processing as a whole.

Ok - discussed with @lidorcarmel one-on-one and I believe we are ok with this.

While we introduced logging of traces from replicate queue processing in cockroachdb#86007, it was noted that when collecting traces with verbose recording enabled, a significant fraction of these traces were from low-level database operations, particularly during the transaction run in `execChangeReplicasTxn(..)`. This function performs necessary bookkeeping of range descriptors when replica movement occurs, however the high verbosity on the low-level database operations within the transaction is not necessary when examining traces from higher-level operations like replicate queue processing. Hence, this change creates a child tracing span within `execChangeReplicasTxn` which will only record structured event if it is called within a parent tracing span with verbose recording enabled. This change decreases the size of the trace of processing a replica via the replicate queue by a factor of 10, allowing the resulting trace to be much more succint and focused. Release justification: Low-risk, high benefit observability change. Release note: None

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @kvoli, @lidorcarmel, and @nvanbenschoten)

pkg/kv/kvserver/replicate_queue.go line 517 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

OK - added comment.

This follows the pattern of the timeout threshold function, which is part of queueConfig

cockroach/pkg/kv/kvserver/queue.go

Lines 327 to 328 in 003c036

// processTimeout returns the timeout for processing a replica.

processTimeoutFunc queueProcessTimeoutFunc

well but in the case of the queueConfig, I think we want to be able to customize the function on a per-queue basis (but share the baseQueue code. Or something. I don't think that applies here. You need to tell me more than that you're following that pattern.

pkg/kv/kvserver/replicate_queue.go line 757 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

OK - fair point, the defer isn't getting us anything here. I left the named retval for requeue so the meaning is clear, however (we already do the same in processOneChange).

FWIW, naming return values is fine. Using the respective variable is more unusual.

pkg/kv/kvserver/replicate_queue.go line 529 at r4 (raw file):

		purgCh:                 time.NewTicker(replicateQueuePurgatoryCheckInterval).C,
		updateCh:               make(chan time.Time, 1),
		logTracesThresholdFunc: makeRateLimitedTimeoutFuncByPermittedSlowdown(permittedRangeScanSlowdown/2, rebalanceSnapshotRate, recoverySnapshotRate),

line way over 100 chars

Code quote:

		logTracesThresholdFunc: makeRateLimitedTimeoutFuncByPermittedSlowdown(permittedRangeScanSlowdown/2, rebalanceSnapshotRate, recoverySnapshotRate),

pkg/kv/kvserver/replicate_queue.go line 756 at r4 (raw file):

	scatter, dryRun bool,
) (requeue bool, _ error) {
	tCtx, sp := tracing.EnsureChildSpan(ctx, rq.Tracer, "process replica",

Let's get rid of this tCtx. I don't really know what it stands for or exactly why we decide to use ctx again later. Also see below.

pkg/kv/kvserver/replicate_queue.go line 765 at r4 (raw file):

	// If processing a manually enqueued replica, it is unnecessary to log traces
	// as they will be collected and rendered by the caller.
	if ctx.Value(ManuallyEnqueuedKey{}) != nil {

If this ManuallyEnqueuedKey stays, let's lift checking for it to the caller, and call processOneChange instead of processOneChangeWithTracing when the magic key is set.
But I don't think this kinda of context magic is necessary. I think we shouldn't log the trace with the ctx of the trace; I think we should trace with context.Background() (perhaps annotated with ctx's logging tags). This way the trace won't be polluted with a recording of itself. Logging a trace recording is an unusual thing, so it deserves to be logged with an unusual ctx.

If you do this, it means that you won't be able to test this functionality by looking at a higher-level trace. As far as I'm concerned, that's a good thing. I think we should test exactly what we want to test - that some message made it to a log channel. I believe there are facilities for that, but I don't know them.

pkg/kv/kvserver/replicate_queue_test.go line 678 at r4 (raw file):

	testutils.RunTrueAndFalse(t, "manual", func(t *testing.T, manual bool) {
		if manual {
			ctx = context.WithValue(ctx, kvserver.ManuallyEnqueuedKey{}, struct{}{})

if this stays, it needs a comment. But hopefully the difference between "manual" and not manual will go away.

pkg/kv/kvserver/replicate_queue_test.go line 699 at r4 (raw file):

				foundParent = true
				for _, logMsg := range recSpan.Logs {
					if errRegexp.MatchString(logMsg.Msg().StripMarkers()) {

foundErr := errRegexp.MatchString()... and same for foundTrace.

AlexTalks

Dismissed @andreimatei from 2 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @kvoli)

While we introduced logging of traces from replicate queue processing in cockroachdb#86007, it was noted that when collecting traces with verbose recording enabled, a significant fraction of these traces were from low-level database operations, particularly during the transaction run in `execChangeReplicasTxn(..)`. This function performs necessary bookkeeping of range descriptors when replica movement occurs, however the high verbosity on the low-level database operations within the transaction is not necessary when examining traces from higher-level operations like replicate queue processing. Hence, this change creates child tracing spans within `execChangeReplicasTxn` which can be filtered out prior to logging the traces in the replicate queue processing loop. This decreases the size of the logged trace by a factor of 10, allowing the resulting log message to be much more succinct and focused. Release justification: Low-risk, high benefit observability change. Release note: None

andreimatei

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @kvoli)

AlexTalks · 2022-08-26T00:56:26Z

bors r=andreimatei

craig · 2022-08-26T00:56:28Z

👎 Rejected by code reviews

Discussed via Slack

While we introduced logging of traces from replicate queue processing in cockroachdb#86007, it was noted that when collecting traces with verbose recording enabled, a significant fraction of these traces were from low-level database operations, particularly during the transaction run in `execChangeReplicasTxn(..)`. This function performs necessary bookkeeping of range descriptors when replica movement occurs, however the high verbosity on the low-level database operations within the transaction is not necessary when examining traces from higher-level operations like replicate queue processing. Hence, this change creates child tracing spans within `execChangeReplicasTxn` which can be filtered out prior to logging the traces in the replicate queue processing loop. This decreases the size of the logged trace by a factor of 10, allowing the resulting log message to be much more succinct and focused. Release justification: Low-risk, high benefit observability change. Release note: None

AlexTalks · 2022-08-26T16:49:28Z

bors r=andreimatei

craig · 2022-08-26T17:04:58Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-08-26T19:27:01Z

Build succeeded:

Bazel Essential CI (Cockroach)

While we introduced logging of traces from replicate queue processing in cockroachdb#86007, it was noted that when collecting traces with verbose recording enabled, a significant fraction of these traces were from low-level database operations, particularly during the transaction run in `execChangeReplicasTxn(..)`. This function performs necessary bookkeeping of range descriptors when replica movement occurs, however the high verbosity on the low-level database operations within the transaction is not necessary when examining traces from higher-level operations like replicate queue processing. Hence, this change creates child tracing spans within `execChangeReplicasTxn` which can be filtered out prior to logging the traces in the replicate queue processing loop. This decreases the size of the logged trace by a factor of 10, allowing the resulting log message to be much more succinct and focused. Release justification: Low-risk, high benefit observability change. Release note: None

85464: ui: add Transaction and Statement fingerprint id r=maryliag a=j82w This adds the Transaction and Statement fingerprint id to SQL Activity Statements overview page. The columns are hidden by default. closes #78509 <img width="1337" alt="Screen Shot 2022-08-03 at 7 14 21 AM" src="https://user-images.githubusercontent.com/8868107/182594974-73bea276-0fb0-4cd4-a106-60e31962b1b9.png"> <img width="1341" alt="Screen Shot 2022-08-03 at 7 21 43 AM" src="https://user-images.githubusercontent.com/8868107/182596064-7a02e7da-3e07-4616-9be9-1a807b87b33d.png"> Release justification: Category 2: Bug fixes and low-risk updates to new functionality Release note (ui change): Adds Transaction and Statement fingerprint id to correlating SQL Activity overview pages. New columns are hidden by default. 86483: kvserver: decrease trace verbosity for change replicas db operations r=AlexTalks a=AlexTalks While we introduced logging of traces from replicate queue processing in #86007, it was noted that when collecting traces with verbose recording enabled, a significant fraction of these traces were from low-level database operations, particularly during the transaction run in `execChangeReplicasTxn(..)`. This function performs necessary bookkeeping of range descriptors when replica movement occurs, however the high verbosity on the low-level database operations within the transaction is not necessary when examining traces from higher-level operations like replicate queue processing. Hence, this change creates child tracing spans within `execChangeReplicasTxn` which can be filtered out prior to logging the traces in the replicate queue processing loop. This decreases the size of the logged trace by a factor of 10, allowing the resulting log message to be much more succinct and focused. Release justification: Low-risk, high benefit observability change. Release note: None 86779: ui: statement insight details page v1 r=ericharmeling a=ericharmeling This PR adds the Statement Insights Details page to the DB Console. https://www.loom.com/share/e9c7be3eb5da47b784b119b6fc9bbeb9 Release justification: low-risk updates to new functionality Release note (ui change): Added statement insight details page to DB Console. 86829: sql: include regions information into the sampled query telemetry r=yuzefovich a=yuzefovich **execstats: remove unused regions field** This field was never actually populated nor used since it was introduced in 9f716a7. Release justification: low-risk cleanup. Release note: None **sql: include regions information into the sampled query telemetry** This commit adds the regions (where SQL processors where planned for the query) to the sampled query telemetry. This required a couple of minor changes to derive the regions information stored in the instrumentation helper earlier (before the logging takes place). Fixes: #85427. Release justification: low-risk improvement. Release note (sql change): The structured payloads used for telemetry logs now include the new `Regions` field which indicates the regions of the nodes where SQL processing ran for the query. 86951: flowinfra: improve BenchmarkFlowSetup r=yuzefovich a=yuzefovich This commit improves `BenchmarkFlowSetup` by pulling out parsing of the statement from the hot path as well as more precise usage of the timer (to exclude the construction of the internal planner). This way the benchmark is better focused on its goal - the performance of the planning and execution. Release justification: test-only change. Release note: None 86964: rowexec: close all ValueGenerators in the project set processor r=yuzefovich a=yuzefovich Previously, we forgot to close the value generators when a new input row is read by the project set processor which could lead to leaking of the resources. This is now fixed. Most of the value generators don't need to release any resources, a few need to close their memory account (previously this would result in a sentry report, but no actual leak would occur in production builds), the only concerning one is `crdb_internal.payloads_for_trace` where we would not close the `InternalRows` iterator. But that seems like an internal debugging tool, so most likely the users weren't impacted. Fixes: #85418. Release justification: bug fix. Release note: None (It seems like in most cases the users would not see the impact.) 86965: pgcode: use custom error codes for home region errors r=msirek a=rafiss The XC error class is reserved for Cockroach-specific errors to avoid conflicting with any future error codes that Postgres might add. Release note: None Release justification: low risk change to new functionality. Co-authored-by: j82w <[email protected]> Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

…annel While we introduced logging of replicate queue processing errors, as well as the collected traces, in cockroachdb#86007, we also introduced a new logging channel `KvDistribution` in cockroachdb#85688. This changes the log channel of those errors to the new `KvDistribution` channel, and also modifies the output to only log the traces if verbosity for the module is >=1. Release justification: Low risk observability change. Release note: None

…annel While we introduced logging of replicate queue processing errors, as well as the collected traces, in cockroachdb#86007, we also introduced a new logging channel `KvDistribution` in cockroachdb#85688. This changes the log channel of those errors and traces to the new `KvDistribution` channel. Release justification: Low risk observability change. Release note: None

87159: kvserver: log replicate queue processing errors via KvDistribution channel r=AlexTalks a=AlexTalks While we introduced logging of replicate queue processing errors, as well as the collected traces, in #86007, we also introduced a new logging channel `KvDistribution` in #85688. This changes the log channel of those errors to the new `KvDistribution` channel, and also modifies the output to only log the traces if verbosity for the module is >=1. Release justification: Low risk observability change. Release note: None 87256: roachtest: loq_recovery fix incorrect error handling r=aliher1911 a=aliher1911 Previously, if attempt to set snapshot rates failed, then whole test fails. It is not correct as we report any unhealthy outcomes as histogram value and succeed the test. Release justification: roachtest fix Release note: None Fixes #85721 87283: allocator: increase log verbosity r=lidorcarmel a=lidorcarmel Now that we log distribution logs to a separate log file (#85688), we can enable more logging by default (without setting vmodule). This PR enables most log messages in the store rebalancer, and some in the replicate queue. The rational is that the store rebalancer iterates over 128 replicas every minute and therefore we can allow logging per replica. We roughly expect: 200 bytes per log message x 128 x 10 messages = 250K per minute, or 15MB per hour. The replicate queue tries to avoid per replica logging (e.g in functions such as shouldEnqueue), instead it will log when an action such as adding a replica is executed. In a roachtest, running drain and decommission when adding a new node we see only a few MBs per hour in the kv-distribution log file. Release justification: low risk, will help debugging. Release note: None 87286: rowexec: make sure to close generator builtin with error in Start r=yuzefovich a=yuzefovich Previously, if an error is returned by `Start` method of the generator builtin, we would never close it which could lead to leaking of resources. This is now fixed. Fixes: #87248. Release justification: bug fix. Release note: None Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Lidor Carmel <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

AlexTalks requested a review from a team as a code owner August 12, 2022 03:54

AlexTalks force-pushed the rb_rq_traces branch from 52f0bea to 29bbace Compare August 12, 2022 09:52

AlexTalks requested review from andreimatei and nvanbenschoten August 12, 2022 09:53

AlexTalks force-pushed the rb_rq_traces branch from 29bbace to 0fb0c80 Compare August 12, 2022 09:54

AlexTalks requested a review from kvoli August 12, 2022 17:31

nvanbenschoten reviewed Aug 12, 2022

View reviewed changes

AlexTalks commented Aug 12, 2022

View reviewed changes

AlexTalks commented Aug 19, 2022

View reviewed changes

lidorcarmel previously requested changes Aug 19, 2022

View reviewed changes

AlexTalks commented Aug 19, 2022

View reviewed changes

AlexTalks force-pushed the rb_rq_traces branch from 0fb0c80 to 5f21a6e Compare August 19, 2022 22:58

andreimatei reviewed Aug 22, 2022

View reviewed changes

AlexTalks added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 labels Aug 23, 2022

AlexTalks force-pushed the rb_rq_traces branch from 5f21a6e to 610652a Compare August 24, 2022 03:51

AlexTalks requested a review from a team August 24, 2022 03:51

AlexTalks requested a review from a team as a code owner August 24, 2022 03:51

AlexTalks commented Aug 24, 2022

View reviewed changes

AlexTalks force-pushed the rb_rq_traces branch from 610652a to 45a3198 Compare August 24, 2022 05:01

AlexTalks mentioned this pull request Aug 24, 2022

kvserver: decrease trace verbosity for change replicas db operations #86483

Merged

andreimatei reviewed Aug 24, 2022

View reviewed changes

AlexTalks force-pushed the rb_rq_traces branch from 45a3198 to 2d2ec9b Compare August 25, 2022 00:25

AlexTalks commented Aug 25, 2022

View reviewed changes

andreimatei reviewed Aug 25, 2022

View reviewed changes

AlexTalks requested review from lidorcarmel and removed request for lidorcarmel August 26, 2022 00:54

cockroachdb deleted a comment from craig bot Aug 26, 2022

andreimatei approved these changes Aug 26, 2022

View reviewed changes

craig bot merged commit db1554b into cockroachdb:master Aug 26, 2022

cockroach-teamcity mentioned this pull request Aug 26, 2022

kvserver: log traces from replicate queue on errors or slow processing cockroachdb/docs#14953

Open

AlexTalks mentioned this pull request Aug 31, 2022

kvserver: log replicate queue processing errors via KvDistribution channel #87159

Merged

andrewbaptist mentioned this pull request Oct 16, 2024

kvserver: store.Enqueue shouldn't start verbose tracing #132713

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: log traces from replicate queue on errors or slow processing #86007

kvserver: log traces from replicate queue on errors or slow processing #86007

AlexTalks commented Aug 12, 2022 •

edited

Loading

cockroach-teamcity commented Aug 12, 2022

nvanbenschoten left a comment

nvanbenschoten Aug 12, 2022

nvanbenschoten Aug 12, 2022

AlexTalks left a comment

AlexTalks left a comment

lidorcarmel left a comment

AlexTalks left a comment

AlexTalks commented Aug 19, 2022

AlexTalks commented Aug 19, 2022

andreimatei left a comment

AlexTalks left a comment

AlexTalks left a comment

andreimatei left a comment

AlexTalks left a comment

andreimatei left a comment

AlexTalks commented Aug 26, 2022

craig bot commented Aug 26, 2022

AlexTalks commented Aug 26, 2022

craig bot commented Aug 26, 2022

craig bot commented Aug 26, 2022

	// processTimeout returns the timeout for processing a replica.
	processTimeoutFunc queueProcessTimeoutFunc

kvserver: log traces from replicate queue on errors or slow processing #86007

kvserver: log traces from replicate queue on errors or slow processing #86007

Conversation

AlexTalks commented Aug 12, 2022 • edited Loading

cockroach-teamcity commented Aug 12, 2022

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten Aug 12, 2022

Choose a reason for hiding this comment

nvanbenschoten Aug 12, 2022

Choose a reason for hiding this comment

AlexTalks left a comment

Choose a reason for hiding this comment

AlexTalks left a comment

Choose a reason for hiding this comment

lidorcarmel left a comment

Choose a reason for hiding this comment

AlexTalks left a comment

Choose a reason for hiding this comment

AlexTalks commented Aug 19, 2022

AlexTalks commented Aug 19, 2022

andreimatei left a comment

Choose a reason for hiding this comment

AlexTalks left a comment

Choose a reason for hiding this comment

AlexTalks left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

AlexTalks left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

AlexTalks commented Aug 26, 2022

craig bot commented Aug 26, 2022

AlexTalks commented Aug 26, 2022

craig bot commented Aug 26, 2022

craig bot commented Aug 26, 2022

AlexTalks commented Aug 12, 2022 •

edited

Loading