Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: add logging tags for rpc connect events #71243

Merged
merged 5 commits into from
Jan 21, 2022
Merged

Conversation

knz
Copy link
Contributor

@knz knz commented Oct 6, 2021

Epic: CRDB-11517
First commit from #75237.

This ensures that the log messages sent to the DEV channel relating to
RPC low-level activity (e.g. conn failures, heartbeat etc) include the
remote node ID and address and connection class.

This also connects the heartbeat task to the tracing infra properly.

Example before:

[n6] 87  dialing n4: 127.0.0.1:41706 (default)

After:

[n6,rnode=4,raddr=127.0.0.1:41706,class=default] 87  dialing

@knz knz requested review from erikgrinaker and tbg October 6, 2021 20:15
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -172,7 +172,15 @@ func (r *RemoteClockMonitor) UpdateOffset(
latencyAvg = ewma.NewMovingAverage(avgLatencyMeasurementAge)
r.mu.latenciesNanos[addr] = latencyAvg
}
latencyAvg.Add(float64(roundTripLatency.Nanoseconds()))

// If the roundtrip jumps by 50% beyond the previously recorded average, report it in logs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that the idea of this PR is to have something simple that works, and not to have accurate detection, so please feel free to ignore my comment. But I'm was curious how people do this kind of detection usually. The issue is of course that the quantile you're tracking (avg here) might change to more than 1.5x in more than 1 round, and then you’ll miss it. I found this by searching online: https://en.wikipedia.org/wiki/CUSUM which seems to address this issue by tracking an aggregate deviation, but probably there are more relevant methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have attempted to look into CUSUM. I don't think that method is applicable, becasue we don't have an expected range of values beforehand.

@knz
Copy link
Contributor Author

knz commented Oct 7, 2021

In addition to Alex' comment above, while I was trying this out I found the following:

  • there's noise when the latency increases from zero. The operator arguably doesn't care when the avg latency increases from 1ms to 2ms if they are expecting up to 100ms. (But they would if they were expecting just 1ms.) So we need to have some kind of threshold based on expected inter-node latency.
    Maybe use MaxOffset/2 as threshold?
  • supposing we had some threshold, we'd get noise too when the latency is jittery around the threshold. So we probably want some hysteresis function.

@erikgrinaker
Copy link
Contributor

Maybe use MaxOffset/2 as threshold?

These seem like orthogonal concerns. Presumably someone with atomic clocks and MaxOffset=10ms wouldn't want to get spammed with latency warnings.

supposing we had some threshold, we'd get noise too when the latency is jittery around the threshold. So we probably want some hysteresis function.

Yup.

@tbg tbg removed their request for review October 28, 2021 10:44
@jtsiros jtsiros added A-logging In and around the logging infrastructure. T-server-and-security DB Server & Security labels Dec 7, 2021
@knz
Copy link
Contributor Author

knz commented Dec 8, 2021

@rauchenstein the next step for this PR is to implement a hysteresis (separate trhreshold for downward and upward gradients, computed as e.g. +/- 50% of last 30s mean) and see empirically if it delivers acceptable results.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. Thanks Jay.
One nit and I think this is good to go.

Reviewed 2 of 3 files at r6, 2 of 2 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @shralex)


pkg/rpc/clock_offset.go, line 68 at r7 (raw file):

func (t *resettingMaxTrigger) triggers(value, resetThreshold, triggerThreshold float64) bool {
	if *t {
		// "recently triggered" state

nit: here and below, make sentences and ensure capitals and final periods.


pkg/rpc/clock_offset.go, line 211 at r7 (raw file):

		// If the roundtrip jumps by 50% beyond the previously recorded average, report it in logs.
		// Don't report it again until it falls belowe 40% above the average.

nit: below*


pkg/rpc/clock_offset_test.go, line 230 at r7 (raw file):

func TestResettingMaxTrigger(t *testing.T) {
	defer leaktest.AfterTest(t)()

nice 💯

@rauchenstein rauchenstein force-pushed the 20211006-rpc branch 2 times, most recently from bac49bf to b735fee Compare December 21, 2021 19:50
Copy link
Contributor

@rauchenstein rauchenstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @knz, and @shralex)


pkg/rpc/clock_offset.go, line 68 at r7 (raw file):

Previously, knz (kena) wrote…

nit: here and below, make sentences and ensure capitals and final periods.

Done.


pkg/rpc/clock_offset.go, line 211 at r7 (raw file):

Previously, knz (kena) wrote…

nit: below*

Done.


pkg/rpc/clock_offset_test.go, line 230 at r7 (raw file):

Previously, knz (kena) wrote…

nice 💯

thanks. ack

@knz
Copy link
Contributor Author

knz commented Dec 22, 2021

you can rebase this on top of #74195 (which should be merged by the time you read this) to remove the first 2 commits from this PR and make it easier to review.

craig bot pushed a commit that referenced this pull request Dec 22, 2021
74195: rpc: minor improvements r=tbg a=knz

Prereq to #71243.

Co-authored-by: Raphael 'kena' Poss <[email protected]>
@rauchenstein rauchenstein force-pushed the 20211006-rpc branch 2 times, most recently from b6bb523 to 9019453 Compare December 29, 2021 17:54
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this will be useful.

Please update the commit message and PR description. The title is a bit vague, and is identical to a previous PR/commit with the same title (which confused me at first since I mixed up this PR with the other one). Also, the message alludes to "two changes", but it only makes one change.

if remoteNodeID == 0 {
dialCtx = logtags.AddTag(dialCtx, "rnode?", nil)
} else {
dialCtx = logtags.AddTag(dialCtx, "rnode", remoteNodeID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the standard node formatting here, i.e. n7 rather than 7. We can maybe make this more compact by using e.g. r=n7.

} else {
dialCtx = logtags.AddTag(dialCtx, "rnode", remoteNodeID)
}
dialCtx = logtags.AddTag(dialCtx, "raddr", target)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets marked as sensitive and redacted in logs. Is that intentional? I'm not sure if we consider IP addresses to be sensitive, but it would make debugging a bit harder.

@knz
Copy link
Contributor Author

knz commented Jan 4, 2022

Thanks for the review. I'll mark this PR as draft for now, as it completely breaks the RPC heartbeat protocol for a reason I can't figure out just yet.

@knz knz marked this pull request as draft January 4, 2022 14:01
@knz knz changed the title rpc: better logging for network activity rpc: add logging tags for rpc connect events Jan 4, 2022
@knz
Copy link
Contributor Author

knz commented Jan 20, 2022

This is rebased and simplified.
PTAL

@rauchenstein FYI, the problem that we couldn't solve earlier, is that in context_test.go, the call to the runHeartbeat() method must be done using the rpcCtx.masterCtx, because that has a particular cancellation policy necessary for the HB task to terminate when the test terminates too.

knz added a commit to knz/cockroach that referenced this pull request Jan 20, 2022
This commit removes a code path that would tickle an assertion failure
if we were to later fix the context propagation in the RPC heartbeat
method (see PR cockroachdb#71243): there's no "active cluster version" in the CLI
and so we can't compare it in a client interceptor.

Release note: None
@knz knz requested a review from a team as a code owner January 20, 2022 19:16
@knz knz requested a review from andreimatei January 20, 2022 19:17
knz added a commit to knz/cockroach that referenced this pull request Jan 20, 2022
This commit removes a code path that would tickle an assertion failure
if we were to later fix the context propagation in the RPC heartbeat
method (see PR cockroachdb#71243): there's no "active cluster version" in the CLI
and so we can't compare it in a client interceptor.

Release note: None
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz and @shralex)


pkg/rpc/context.go, line 1149 at r17 (raw file):

// that it respects the same cancellation policy.
func (rpcCtx *Context) grpcDialRaw(
	dialCtx context.Context, target string, remoteNodeID roachpb.NodeID, class ConnectionClass,

you can't use ctx ?


pkg/rpc/context.go, line 1141 at r20 (raw file):

	dialCtx := rpcCtx.masterCtx
	if remoteNodeID == 0 {
		dialCtx = logtags.AddTag(dialCtx, "rnode?", nil)

maybe rnode, '?' ?


pkg/rpc/context.go, line 1302 at r21 (raw file):

		conn.grpcConn, redialChan, conn.dialErr = rpcCtx.grpcDialRaw(dialCtx, target, remoteNodeID, class)
		if conn.dialErr == nil {
			dialCtx = logtags.AddTag(dialCtx, "heartbeat", nil)

make this a new ctx := ... variable. I don't think you want to overwrite dialCtx.

craig bot pushed a commit that referenced this pull request Jan 21, 2022
74318: tracing: add /debug/tracez rendering the active spans  r=andreimatei a=andreimatei

`/debug/tracez` lets users take a snapshot of the active spans registry
and render the new snapshot, or one of the previously taken snapshots.
The Tracer can hold up to 10 snapshots in memory.

It looks like this:
![Screenshot from 2022-01-04 19-03-39](https://user-images.githubusercontent.com/377201/148140272-306658d5-5b9c-4f2a-b59c-28df9c5ed10c.png)


When visualizing a snapshot, the page lets you do a number of things:
1. List all the spans.
2. See the (current) stack trace for each span's goroutine (if the
   goroutine was still running at the time when the snapshot was
   captured). Stack traces can be toggled visible/hidden.
3. Sort the spans by name or start time.
4. Filter the span according to text search. The search works across
   the name and stack trace.
5. Go from a span to the full trace containing that span.   

For the table Javascript providing sorting and filtering, this patch
embeds the library from https://listjs.com/ .

Limitations:
- for now, only the registry of the local node is snapshotted. In the
  fuiture I'll collect info from all nodes.
- for now, the relationships between different spans are not represented
  in any way. I'll work on the ability to go from a span to the whole
  trace that the span is part of.
- for now, tags and structured and unstructured log messages that a span
  might have are not displayed in any way.

At the moment, span creation is not enabled in production by default
(i.e. the Tracer is put in `TracingModeOnDemand` by default, instead of
the required `TracingModeActiveSpansRegistry`). This patch does not change
that, so in order to benefit from /debug/tracez in all its glory, one
has to run with `COCKROACH_REAL_SPANS=1` for now. Not for long, though.

Release note: None

74867: sql: Support CREATE DATABASE WITH OWNER r=Fenil-P a=Fenil-P

fixes #67817

Release note (sql change): Allow users to specify the owner when creating a database. 
			                      Similar to postgresql: CREATE DATABASE name [ [ WITH ] [ OWNER [=] user_name ]



74871: sql: add a tracing tag with the txn ID r=andreimatei a=andreimatei

This patch adds the txn's ID as a tag to the tracing span representing a
SQL txn. I'm creating a UI to explore the current spans, and this ID
will make it easy to navigate between a query/request blocking on a lock
held by some other txn, and the activity of that other txn.

Release note: None

75114: sql: directly specify columns in TableReader r=RaduBerinde a=RaduBerinde

~Note: the first commit is #74922.~

The internal columns of the TableReader (as well as the row fetcher)
are all the columns of the table, with only a subset of values
actually produced. This design decision has been carried over way past
the point where it makes sense (I admit, it's questionable whether it
ever made sense). For one, "all the columns" is ambiguous (does it
contain non-public columns? does it include system columns?) leading
to various flags and inherent fragility. Second, it relies on the
execution engine to figure out (based on the PostProcessSpec) which
columns are actually needed, which the optimizer already figures out
for us now.

This commit changes the TableReader spec and the interface of
row.Fetcher to always produce a given specific set of column IDs. The
diagram for table readers now specifies the columns by name.

The JoinReader, InvertedJoiner, ZigzagJoiner are not changed in this
commit (but they should be cleaned up as well).

Release note: None


75175: colfetcher: fix the bytes read statistic collection r=yuzefovich a=yuzefovich

During 21.2 release we adjusted the `cFetcher` to be `Close`d eagerly
when it is returning the zero-length batch. This was done in order to
release some references in order for the memory to be GCed sooner;
additionally, the `cFetcher` started being used for the index join where
the fetcher is restarted from scratch for every batch of spans, so it
seemed reasonable to close it automatically.

However, that eager closure broke "bytes read" statistic collection
since the `row.KVFetcher` was responsible for providing it, and we were
zeroing it out. This commit fixes this problem by the `cFetcher`
memorizing the number of bytes it has read in `Close`. Some care needs
to be taken to not double-count the bytes read in the index join, so
a couple of helper methods have been introduced.

Additionally this commit applies the same eager-close optimization to
the `cFetcher` when the last batch is returned (which makes it so that
if we've just exhausted all KVs, we close the fetcher - previously, we
would set the zero length on the batch and might never get into
`stateFinished`).

Fixes: #75128.

Release note (bug fix): Previously, CockroachDB could incorrectly report
`KV bytes read` statistic in `EXPLAIN ANALYZE` output. The bug is
present only in 21.2.x versions.

75215: cmd/github-post: fix Pebble metamorphic reproduction command r=jbowens a=jbowens

When posting a github issue for a Pebble metamorphic test failure, include the
correct `-ops` flag.

Discovered because cockroachdb/pebble#1459 contained a
reproduction command that contained too few ops to reproduce the issue.

Release note: none

75228: logictestccl: skip flaky TestCCLLogic/fakedist-metadata/partitioning_enum r=mgartner a=mgartner

Informs #75227

Release note: None

75237: cli,rpc: don't check the active cluster version in the CLI r=andreimatei a=knz

This commit removes a code path that would tickle an assertion failure
if we were to later fix the context propagation in the RPC heartbeat
method (see PR #71243): there's no "active cluster version" in the CLI
and so we can't compare it in a client interceptor.

Release note: None

75254: scripts: add `dev generate --mirror` to `bump-pebble.sh` script r=jbowens a=nicktrav

CI now expects that dependencies are mirrored to cloud storage and will
fail if the TODO for mirroring the repo is left unaddressed in the
`DEPS.bzl` file.

Add a mirroring step to the `bump-pebble.sh` script.

Release note: none

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Fenil Patel <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
knz added 4 commits January 21, 2022 18:32
This ensures that the log messages sent to the DEV channel relating to
RPC low-level activity (e.g. conn failures, heartbeat etc) include the
remote node ID and address and connection class.

Example before:
```
[n6] 87  dialing
```

After:
```
[n6,rnode=4,raddr=127.0.0.1:41706,class=default] 87  dialing
```

Release note: None
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @shralex)


pkg/rpc/context.go, line 1149 at r17 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

you can't use ctx ?

Done.


pkg/rpc/context.go, line 1141 at r20 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

maybe rnode, '?' ?

Done.


pkg/rpc/context.go, line 1302 at r21 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

make this a new ctx := ... variable. I don't think you want to overwrite dialCtx.

Done.

This removes now-redundant logging output (since the same data is
included in tags) and adds a tag for the heartbeat background task.

For example, before:
```
[n6,rnode=4,raddr=127.0.0.1:41706,class=default] 87  dialing n4: 127.0.0.1:41706 (default)
```

After:
```
[n6,rnode=4,raddr=127.0.0.1:41706,class=default] 87  dialing
```

Release note: None
@knz
Copy link
Contributor Author

knz commented Jan 21, 2022

TFYR!

bors r=andreimatei

@craig
Copy link
Contributor

craig bot commented Jan 21, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 21, 2022

Build succeeded:

@craig craig bot merged commit 22f45e5 into cockroachdb:master Jan 21, 2022
@knz knz deleted the 20211006-rpc branch January 21, 2022 21:12
gtr pushed a commit to gtr/cockroach that referenced this pull request Jan 24, 2022
This commit removes a code path that would tickle an assertion failure
if we were to later fix the context propagation in the RPC heartbeat
method (see PR cockroachdb#71243): there's no "active cluster version" in the CLI
and so we can't compare it in a client interceptor.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-logging In and around the logging infrastructure. T-server-and-security DB Server & Security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants