Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: log traces from replicate queue on errors or slow processing #86007

Merged
merged 1 commit into from
Aug 26, 2022

Conversation

AlexTalks
Copy link
Contributor

@AlexTalks AlexTalks commented Aug 12, 2022

While we previously had some logging from the replicate queue as a
result of the standard queue logging, this change adds logging to the
replicate queue when there are errors in processing a replica, or when
processing a replica exceeds a 50% of the timeout duration.
When there are errors or the duration threshold is exceeded,
any error messages are logged along with the collected tracing spans
from the operation.

Release note (ops change): Added logging on replicate queue processing
in the presence of errors or when the duration exceeds 50% of the
timeout.

Release justification: Low risk observability change.

@AlexTalks AlexTalks requested a review from a team as a code owner August 12, 2022 03:54
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will give this a closer look later today, but for now:

if err != nil {
rec := sp.FinishAndGetConfiguredRecording()
log.Warningf(ctx, "error processing replica: %v", err)
if rec != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to break log lines across multiple log.Warning or log.Info calls. This will allow them to be interleaved with other log statements and could make the logs confusing.

rec := sp.FinishAndGetConfiguredRecording()
log.Infof(ctx, "processing replica took %s, exceeding threshold of %s", processDuration, loggingThreshold)
if rec != nil {
log.Infof(ctx, "trace:\n%s", rec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give an example of what one of these trace recordings looks like? If we're logging to the main log file then there's a limit to how noisy we can be.

Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! Can update these a bit more Monday.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @kvoli, and @nvanbenschoten)


pkg/kv/kvserver/replicate_queue.go line 632 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think we want to break log lines across multiple log.Warning or log.Info calls. This will allow them to be interleaved with other log statements and could make the logs confusing.

Good point, thanks!


pkg/kv/kvserver/replicate_queue.go line 639 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you give an example of what one of these trace recordings looks like? If we're logging to the main log file then there's a limit to how noisy we can be.

It is certainly fairly verbose - if we were to split, perhaps it would be good to have a log for the store queues. See an example here: https://gist.github.com/AlexTalks/56de6fc500ceea2bed622d2f5f3fb202

Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @kvoli, @lidorcarmel, and @nvanbenschoten)


pkg/kv/kvserver/replicate_queue.go line 623 at r1 (raw file):

	scatter, dryRun bool,
) (requeue bool, err error) {
	tCtx, sp := tracing.EnsureChildSpan(ctx, rq.Tracer, "process replica", tracing.WithRecording(tracingpb.RecordingVerbose))

Discussed with @andreimatei - sounds like recording is ok, but we may want to reduce some portions of the trace to simply RecordingStructured rather than RecordingVerbose, such as everything under the log-range-event child span. Perhaps also the transactions in replica_command.go, such as in execChangeReplicasTxn(..), as the verbosity on these transactions can be very high (while potentially less insightful compared to allocator or snapshot child spans).


pkg/kv/kvserver/replicate_queue.go line 639 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

It is certainly fairly verbose - if we were to split, perhaps it would be good to have a log for the store queues. See an example here: https://gist.github.com/AlexTalks/56de6fc500ceea2bed622d2f5f3fb202

Also talked with @andreimatei about piggybacking onto @lidorcarmel's new allocator log channel, introduced in #85688.

Copy link
Contributor

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for slow traces, can we use the "take snapshot" button to get traces for those slow ops? this way we can see inflight ops, I think it should be the right way to debug the kind of issues such as a slow decommission.

I worry about a couple of things with this PR, spamming the log is definitely one thing, but maybe more important, it's not clear to me that we need a new mechanism for filtering logs, and something specific for the allocator only, we already have vlog which is good for some use cases, and we have tracing, can we use one of those 2 instead of introducing this flag?

back to spamming, it's not trivial how to configure this flag, say we set it to 5, is that too much during high load on the cluster? are we going to have 100 of these traces in the log files? which will log about 50MB (assuming the example you pointed blow is an average one, thought maybe it's an extreme rare one?).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @kvoli, and @nvanbenschoten)


pkg/kv/kvserver/queue.go line 91 at r1 (raw file):

// provided, otherwise nil can be passed in.
func makeRateLimitedTimeoutFuncBySlowdownMultiplier(
	slowdownMultiplier int64,

I think it will be cleaner to move the flag out of this func, and pass an int, the called should produce the int.


pkg/kv/kvserver/replicate_queue.go line 639 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Also talked with @andreimatei about piggybacking onto @lidorcarmel's new allocator log channel, introduced in #85688.

the new channel is not in yet, and I don't think you want to make this pr depending on the new channel, right?

FWIW the size of the gist is 444KB, which is huge IMO.

Worth keeping in mind - we don't get these logs if the op is stuck forever, we only log after it's done.

Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @kvoli, and @nvanbenschoten)


pkg/kv/kvserver/replicate_queue.go line 639 at r1 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

the new channel is not in yet, and I don't think you want to make this pr depending on the new channel, right?

FWIW the size of the gist is 444KB, which is huge IMO.

Worth keeping in mind - we don't get these logs if the op is stuck forever, we only log after it's done.

I'm open to either, assuming that the new channel is going to be in 22.2. As for the size - working on reducing that based on some of the suggestions mentioned above.

As for being stuck, I'm not sure this is a significant concern because we have timeouts on both the individual snapshot operations as well as the replicate queue processing as a whole.

@AlexTalks
Copy link
Contributor Author

See https://gist.github.com/AlexTalks/01ccf2d04748ea59cd883190a9cce922 for an updated example trace with both #86436 and #86483 incorporated.

This isn't really a new filter on the logs, but is somewhat comparable to the already existing settings sql.trace.txn.enable_threshold and sql.trace.stmt.enable_threshold, which set a time threshold for a SQL transaction (or statement) that, when exceeded, has all recorded traces logged to the DEV channel:

log.Infof(ctx, "%s took %s, exceeding threshold of %s:\n%s", opName, elapsed, threshold, dump)

That said I can certainly understand that a new setting may not be the right way to go as it might be confusing / overkill - my current thought is that with the more concise trace (in the example above), it might be best to log traces on all errors, and log traces when replica processing takes over 50% of the timeout (without a new setting). This could be either to the DEV channel or to the new allocator channel being introduced. Part of the reasoning there is because if a snapshot operation times out, the timeout occurs on the client side of the RPC, and thus we won't get back the remote traces (which include details like how long queueing took and timing breakdowns).

@AlexTalks
Copy link
Contributor Author

Updated to remove flag and only log traces on replica processing error or exceeding 50% of the timeout duration. See also related PRs #86436 and #86483.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @kvoli, and @nvanbenschoten)


pkg/kv/kvserver/replicate_queue.go line 517 at r2 (raw file):

	updateCh               chan time.Time
	lastLeaseTransfer      atomic.Value // read and written by scanner & queue goroutines
	logTracesThresholdFunc queueProcessTimeoutFunc

this needs a comment. But does it really need to be a field, as opposed to directly a method?


pkg/kv/kvserver/replicate_queue.go line 757 at r2 (raw file):

		tracing.WithRecording(tracingpb.RecordingVerbose))
	processStart := timeutil.Now()
	defer func() {

why put all this in a defer, as opposed to moving it below the processOneChange() call. Flow control through defer is unusual, and the use of name retvals (err) is not ideal


pkg/kv/kvserver/replicate_queue.go line 762 at r2 (raw file):

		exceededDuration := loggingThreshold > time.Duration(0) && processDuration > loggingThreshold
		if err != nil {
			rec := sp.FinishAndGetConfiguredRecording()

lift this out of the if. You can do defer sp.Finish(), and get the recording with sp.GetConfiguredRecording().


pkg/kv/kvserver/replicate_queue_test.go line 692 at r2 (raw file):

					foundErr = true
				}
				if matched, matchErr := regexp.MatchString(`trace:.*`, logMsg.Msg().StripMarkers()); matchErr == nil && matched {

hmmm so does this mean that now the debug page that runs a range through the queue renders most of the trace twice? I think we should fix this somehow.

@AlexTalks AlexTalks added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 labels Aug 23, 2022
@AlexTalks AlexTalks requested a review from a team August 24, 2022 03:51
@AlexTalks AlexTalks requested a review from a team as a code owner August 24, 2022 03:51
Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @kvoli, @lidorcarmel, and @nvanbenschoten)


pkg/kv/kvserver/queue.go line 91 at r1 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

I think it will be cleaner to move the flag out of this func, and pass an int, the called should produce the int.

OK - removed the flag entirely actually, heh.


pkg/kv/kvserver/replicate_queue.go line 517 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this needs a comment. But does it really need to be a field, as opposed to directly a method?

OK - added comment.

This follows the pattern of the timeout threshold function, which is part of queueConfig

// processTimeout returns the timeout for processing a replica.
processTimeoutFunc queueProcessTimeoutFunc


pkg/kv/kvserver/replicate_queue.go line 757 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why put all this in a defer, as opposed to moving it below the processOneChange() call. Flow control through defer is unusual, and the use of name retvals (err) is not ideal

OK - fair point, the defer isn't getting us anything here. I left the named retval for requeue so the meaning is clear, however (we already do the same in processOneChange).


pkg/kv/kvserver/replicate_queue.go line 762 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

lift this out of the if. You can do defer sp.Finish(), and get the recording with sp.GetConfiguredRecording().

OK - done.


pkg/kv/kvserver/replicate_queue_test.go line 692 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

hmmm so does this mean that now the debug page that runs a range through the queue renders most of the trace twice? I think we should fix this somehow.

OK - flagged manually enqueued replicas so that the traces won't be rendered and output to logs.

Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @kvoli, @lidorcarmel, and @nvanbenschoten)


pkg/kv/kvserver/replicate_queue.go line 632 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Good point, thanks!

Ok, done


pkg/kv/kvserver/replicate_queue.go line 639 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

I'm open to either, assuming that the new channel is going to be in 22.2. As for the size - working on reducing that based on some of the suggestions mentioned above.

As for being stuck, I'm not sure this is a significant concern because we have timeouts on both the individual snapshot operations as well as the replicate queue processing as a whole.

Ok - discussed with @lidorcarmel one-on-one and I believe we are ok with this.

AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Aug 24, 2022
While we introduced logging of traces from replicate queue processing
in cockroachdb#86007, it was noted that when collecting traces with verbose recording
enabled, a significant fraction of these traces were from low-level
database operations, particularly during the transaction run in
`execChangeReplicasTxn(..)`. This function performs necessary
bookkeeping of range descriptors when replica movement occurs, however
the high verbosity on the low-level database operations within the
transaction is not necessary when examining traces from higher-level
operations like replicate queue processing. Hence, this change
creates a child tracing span within `execChangeReplicasTxn` which will
only record structured event if it is called within a parent tracing
span with verbose recording enabled. This change decreases the size of
the trace of processing a replica via the replicate queue by a factor of
10, allowing the resulting trace to be much more succint and focused.

Release justification: Low-risk, high benefit observability change.
Release note: None
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @andreimatei, @kvoli, @lidorcarmel, and @nvanbenschoten)


pkg/kv/kvserver/replicate_queue.go line 517 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

OK - added comment.

This follows the pattern of the timeout threshold function, which is part of queueConfig

// processTimeout returns the timeout for processing a replica.
processTimeoutFunc queueProcessTimeoutFunc

well but in the case of the queueConfig, I think we want to be able to customize the function on a per-queue basis (but share the baseQueue code. Or something. I don't think that applies here. You need to tell me more than that you're following that pattern.


pkg/kv/kvserver/replicate_queue.go line 757 at r2 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

OK - fair point, the defer isn't getting us anything here. I left the named retval for requeue so the meaning is clear, however (we already do the same in processOneChange).

FWIW, naming return values is fine. Using the respective variable is more unusual.


pkg/kv/kvserver/replicate_queue.go line 529 at r4 (raw file):

		purgCh:                 time.NewTicker(replicateQueuePurgatoryCheckInterval).C,
		updateCh:               make(chan time.Time, 1),
		logTracesThresholdFunc: makeRateLimitedTimeoutFuncByPermittedSlowdown(permittedRangeScanSlowdown/2, rebalanceSnapshotRate, recoverySnapshotRate),

line way over 100 chars

Code quote:

		logTracesThresholdFunc: makeRateLimitedTimeoutFuncByPermittedSlowdown(permittedRangeScanSlowdown/2, rebalanceSnapshotRate, recoverySnapshotRate),

pkg/kv/kvserver/replicate_queue.go line 756 at r4 (raw file):

	scatter, dryRun bool,
) (requeue bool, _ error) {
	tCtx, sp := tracing.EnsureChildSpan(ctx, rq.Tracer, "process replica",

Let's get rid of this tCtx. I don't really know what it stands for or exactly why we decide to use ctx again later. Also see below.


pkg/kv/kvserver/replicate_queue.go line 765 at r4 (raw file):

	// If processing a manually enqueued replica, it is unnecessary to log traces
	// as they will be collected and rendered by the caller.
	if ctx.Value(ManuallyEnqueuedKey{}) != nil {

If this ManuallyEnqueuedKey stays, let's lift checking for it to the caller, and call processOneChange instead of processOneChangeWithTracing when the magic key is set.
But I don't think this kinda of context magic is necessary. I think we shouldn't log the trace with the ctx of the trace; I think we should trace with context.Background() (perhaps annotated with ctx's logging tags). This way the trace won't be polluted with a recording of itself. Logging a trace recording is an unusual thing, so it deserves to be logged with an unusual ctx.

If you do this, it means that you won't be able to test this functionality by looking at a higher-level trace. As far as I'm concerned, that's a good thing. I think we should test exactly what we want to test - that some message made it to a log channel. I believe there are facilities for that, but I don't know them.


pkg/kv/kvserver/replicate_queue_test.go line 678 at r4 (raw file):

	testutils.RunTrueAndFalse(t, "manual", func(t *testing.T, manual bool) {
		if manual {
			ctx = context.WithValue(ctx, kvserver.ManuallyEnqueuedKey{}, struct{}{})

if this stays, it needs a comment. But hopefully the difference between "manual" and not manual will go away.


pkg/kv/kvserver/replicate_queue_test.go line 699 at r4 (raw file):

				foundParent = true
				for _, logMsg := range recSpan.Logs {
					if errRegexp.MatchString(logMsg.Msg().StripMarkers()) {

foundErr := errRegexp.MatchString()... and same for foundTrace.

Copy link
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @andreimatei from 2 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @kvoli)

AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Aug 25, 2022
While we introduced logging of traces from replicate queue processing
in cockroachdb#86007, it was noted that when collecting traces with verbose recording
enabled, a significant fraction of these traces were from low-level
database operations, particularly during the transaction run in
`execChangeReplicasTxn(..)`. This function performs necessary
bookkeeping of range descriptors when replica movement occurs, however
the high verbosity on the low-level database operations within the
transaction is not necessary when examining traces from higher-level
operations like replicate queue processing. Hence, this change
creates child tracing spans within `execChangeReplicasTxn` which
can be filtered out prior to logging the traces in the replicate queue
processing loop. This decreases the size of the logged trace by a factor
of 10, allowing the resulting log message to be much more succinct and
focused.

Release justification: Low-risk, high benefit observability change.
Release note: None
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @kvoli)

@AlexTalks AlexTalks requested review from lidorcarmel and removed request for lidorcarmel August 26, 2022 00:54
@cockroachdb cockroachdb deleted a comment from craig bot Aug 26, 2022
@cockroachdb cockroachdb deleted a comment from craig bot Aug 26, 2022
@AlexTalks
Copy link
Contributor Author

bors r=andreimatei

@craig
Copy link
Contributor

craig bot commented Aug 26, 2022

👎 Rejected by code reviews

@AlexTalks AlexTalks dismissed lidorcarmel’s stale review August 26, 2022 01:06

Discussed via Slack

AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Aug 26, 2022
While we introduced logging of traces from replicate queue processing
in cockroachdb#86007, it was noted that when collecting traces with verbose recording
enabled, a significant fraction of these traces were from low-level
database operations, particularly during the transaction run in
`execChangeReplicasTxn(..)`. This function performs necessary
bookkeeping of range descriptors when replica movement occurs, however
the high verbosity on the low-level database operations within the
transaction is not necessary when examining traces from higher-level
operations like replicate queue processing. Hence, this change
creates child tracing spans within `execChangeReplicasTxn` which
can be filtered out prior to logging the traces in the replicate queue
processing loop. This decreases the size of the logged trace by a factor
of 10, allowing the resulting log message to be much more succinct and
focused.

Release justification: Low-risk, high benefit observability change.
Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Aug 26, 2022
While we introduced logging of traces from replicate queue processing
in cockroachdb#86007, it was noted that when collecting traces with verbose recording
enabled, a significant fraction of these traces were from low-level
database operations, particularly during the transaction run in
`execChangeReplicasTxn(..)`. This function performs necessary
bookkeeping of range descriptors when replica movement occurs, however
the high verbosity on the low-level database operations within the
transaction is not necessary when examining traces from higher-level
operations like replicate queue processing. Hence, this change
creates child tracing spans within `execChangeReplicasTxn` which
can be filtered out prior to logging the traces in the replicate queue
processing loop. This decreases the size of the logged trace by a factor
of 10, allowing the resulting log message to be much more succinct and
focused.

Release justification: Low-risk, high benefit observability change.
Release note: None
@AlexTalks
Copy link
Contributor Author

bors r=andreimatei

@craig
Copy link
Contributor

craig bot commented Aug 26, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 26, 2022

Build succeeded:

@craig craig bot merged commit db1554b into cockroachdb:master Aug 26, 2022
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Aug 26, 2022
While we introduced logging of traces from replicate queue processing
in cockroachdb#86007, it was noted that when collecting traces with verbose recording
enabled, a significant fraction of these traces were from low-level
database operations, particularly during the transaction run in
`execChangeReplicasTxn(..)`. This function performs necessary
bookkeeping of range descriptors when replica movement occurs, however
the high verbosity on the low-level database operations within the
transaction is not necessary when examining traces from higher-level
operations like replicate queue processing. Hence, this change
creates child tracing spans within `execChangeReplicasTxn` which
can be filtered out prior to logging the traces in the replicate queue
processing loop. This decreases the size of the logged trace by a factor
of 10, allowing the resulting log message to be much more succinct and
focused.

Release justification: Low-risk, high benefit observability change.
Release note: None
craig bot pushed a commit that referenced this pull request Aug 26, 2022
85464: ui: add Transaction and Statement fingerprint id r=maryliag a=j82w

This adds the Transaction and Statement fingerprint id
to SQL Activity Statements overview page. The columns
are hidden by default.

closes #78509

<img width="1337" alt="Screen Shot 2022-08-03 at 7 14 21 AM" src="https://user-images.githubusercontent.com/8868107/182594974-73bea276-0fb0-4cd4-a106-60e31962b1b9.png">

<img width="1341" alt="Screen Shot 2022-08-03 at 7 21 43 AM" src="https://user-images.githubusercontent.com/8868107/182596064-7a02e7da-3e07-4616-9be9-1a807b87b33d.png">

Release justification: Category 2: Bug fixes and
low-risk updates to new functionality

Release note (ui change): Adds Transaction and Statement fingerprint
id to correlating SQL Activity overview pages. New columns are
hidden by default.

86483: kvserver: decrease trace verbosity for change replicas db operations r=AlexTalks a=AlexTalks

While we introduced logging of traces from replicate queue processing
in #86007, it was noted that when collecting traces with verbose recording
enabled, a significant fraction of these traces were from low-level
database operations, particularly during the transaction run in
`execChangeReplicasTxn(..)`. This function performs necessary
bookkeeping of range descriptors when replica movement occurs, however
the high verbosity on the low-level database operations within the
transaction is not necessary when examining traces from higher-level
operations like replicate queue processing. Hence, this change
creates child tracing spans within `execChangeReplicasTxn` which
can be filtered out prior to logging the traces in the replicate queue
processing loop. This decreases the size of the logged trace by a factor
of 10, allowing the resulting log message to be much more succinct and
focused.

Release justification: Low-risk, high benefit observability change.
Release note: None

86779: ui: statement insight details page v1 r=ericharmeling a=ericharmeling

This PR adds the Statement Insights Details page to the DB Console.

https://www.loom.com/share/e9c7be3eb5da47b784b119b6fc9bbeb9

Release justification: low-risk updates to new functionality

Release note (ui change): Added statement insight details page to DB Console.

86829: sql: include regions information into the sampled query telemetry r=yuzefovich a=yuzefovich

**execstats: remove unused regions field**

This field was never actually populated nor used since it was introduced
in 9f716a7.

Release justification: low-risk cleanup.

Release note: None

**sql: include regions information into the sampled query telemetry**

This commit adds the regions (where SQL processors where planned for the
query) to the sampled query telemetry. This required a couple of minor
changes to derive the regions information stored in the instrumentation
helper earlier (before the logging takes place).

Fixes: #85427.

Release justification: low-risk improvement.

Release note (sql change): The structured payloads used for telemetry
logs now include the new `Regions` field which indicates the regions of
the nodes where SQL processing ran for the query.

86951: flowinfra: improve BenchmarkFlowSetup r=yuzefovich a=yuzefovich

This commit improves `BenchmarkFlowSetup` by pulling out parsing of the
statement from the hot path as well as more precise usage of the timer
(to exclude the construction of the internal planner). This way the
benchmark is better focused on its goal - the performance of the
planning and execution.

Release justification: test-only change.

Release note: None

86964: rowexec: close all ValueGenerators in the project set processor r=yuzefovich a=yuzefovich

Previously, we forgot to close the value generators when a new input row
is read by the project set processor which could lead to leaking of the
resources. This is now fixed. Most of the value generators don't need to
release any resources, a few need to close their memory account
(previously this would result in a sentry report, but no actual leak
would occur in production builds), the only concerning one is
`crdb_internal.payloads_for_trace` where we would not close the
`InternalRows` iterator. But that seems like an internal debugging tool,
so most likely the users weren't impacted.

Fixes: #85418.

Release justification: bug fix.

Release note: None (It seems like in most cases the users would not see
the impact.)

86965: pgcode: use custom error codes for home region errors r=msirek a=rafiss

The XC error class is reserved for Cockroach-specific errors to avoid
conflicting with any future error codes that Postgres might add.

Release note: None

Release justification: low risk change to new functionality.

Co-authored-by: j82w <[email protected]>
Co-authored-by: Alex Sarkesian <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Aug 31, 2022
…annel

While we introduced logging of replicate queue processing errors, as well
as the collected traces, in cockroachdb#86007, we also introduced a new logging
channel `KvDistribution` in cockroachdb#85688. This changes the log channel of
those errors to the new `KvDistribution` channel, and also modifies the
output to only log the traces if verbosity for the module is >=1.

Release justification: Low risk observability change.
Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Sep 1, 2022
…annel

While we introduced logging of replicate queue processing errors, as well
as the collected traces, in cockroachdb#86007, we also introduced a new logging
channel `KvDistribution` in cockroachdb#85688. This changes the log channel of
those errors and traces to the new `KvDistribution` channel.

Release justification: Low risk observability change.
Release note: None
craig bot pushed a commit that referenced this pull request Sep 1, 2022
87159: kvserver: log replicate queue processing errors via KvDistribution channel r=AlexTalks a=AlexTalks

While we introduced logging of replicate queue processing errors, as well
as the collected traces, in #86007, we also introduced a new logging
channel `KvDistribution` in #85688. This changes the log channel of
those errors to the new `KvDistribution` channel, and also modifies the
output to only log the traces if verbosity for the module is >=1.

Release justification: Low risk observability change.
Release note: None

87256: roachtest: loq_recovery fix incorrect error handling r=aliher1911 a=aliher1911

Previously, if attempt to set snapshot rates failed, then whole test
fails. It is not correct as we report any unhealthy outcomes as
histogram value and succeed the test.

Release justification: roachtest fix
Release note: None

Fixes #85721

87283: allocator: increase log verbosity r=lidorcarmel a=lidorcarmel

Now that we log distribution logs to a separate log file (#85688), we can
enable more logging by default (without setting vmodule).

This PR enables most log messages in the store rebalancer, and some in the
replicate queue. The rational is that the store rebalancer iterates over
128 replicas every minute and therefore we can allow logging per replica.
We roughly expect:
200 bytes per log message x 128 x 10 messages = 250K per minute,
or 15MB per hour. The replicate queue tries to avoid per replica logging
(e.g in functions such as shouldEnqueue), instead it will log when an
action such as adding a replica is executed.

In a roachtest, running drain and decommission when adding a new node we
see only a few MBs per hour in the kv-distribution log file.

Release justification: low risk, will help debugging.
Release note: None

87286: rowexec: make sure to close generator builtin with error in Start r=yuzefovich a=yuzefovich

Previously, if an error is returned by `Start` method of the generator
builtin, we would never close it which could lead to leaking of
resources. This is now fixed.

Fixes: #87248.

Release justification: bug fix.

Release note: None

Co-authored-by: Alex Sarkesian <[email protected]>
Co-authored-by: Oleg Afanasyev <[email protected]>
Co-authored-by: Lidor Carmel <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants