-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: add logging tags for rpc connect events #71243
Conversation
pkg/rpc/clock_offset.go
Outdated
@@ -172,7 +172,15 @@ func (r *RemoteClockMonitor) UpdateOffset( | |||
latencyAvg = ewma.NewMovingAverage(avgLatencyMeasurementAge) | |||
r.mu.latenciesNanos[addr] = latencyAvg | |||
} | |||
latencyAvg.Add(float64(roundTripLatency.Nanoseconds())) | |||
|
|||
// If the roundtrip jumps by 50% beyond the previously recorded average, report it in logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that the idea of this PR is to have something simple that works, and not to have accurate detection, so please feel free to ignore my comment. But I'm was curious how people do this kind of detection usually. The issue is of course that the quantile you're tracking (avg here) might change to more than 1.5x in more than 1 round, and then you’ll miss it. I found this by searching online: https://en.wikipedia.org/wiki/CUSUM which seems to address this issue by tracking an aggregate deviation, but probably there are more relevant methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have attempted to look into CUSUM. I don't think that method is applicable, becasue we don't have an expected range of values beforehand.
In addition to Alex' comment above, while I was trying this out I found the following:
|
These seem like orthogonal concerns. Presumably someone with atomic clocks and MaxOffset=10ms wouldn't want to get spammed with latency warnings.
Yup. |
@rauchenstein the next step for this PR is to implement a hysteresis (separate trhreshold for downward and upward gradients, computed as e.g. +/- 50% of last 30s mean) and see empirically if it delivers acceptable results. |
e14971f
to
3547e8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. Thanks Jay.
One nit and I think this is good to go.
Reviewed 2 of 3 files at r6, 2 of 2 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @shralex)
pkg/rpc/clock_offset.go, line 68 at r7 (raw file):
func (t *resettingMaxTrigger) triggers(value, resetThreshold, triggerThreshold float64) bool { if *t { // "recently triggered" state
nit: here and below, make sentences and ensure capitals and final periods.
pkg/rpc/clock_offset.go, line 211 at r7 (raw file):
// If the roundtrip jumps by 50% beyond the previously recorded average, report it in logs. // Don't report it again until it falls belowe 40% above the average.
nit: below*
pkg/rpc/clock_offset_test.go, line 230 at r7 (raw file):
func TestResettingMaxTrigger(t *testing.T) { defer leaktest.AfterTest(t)()
nice 💯
bac49bf
to
b735fee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @knz, and @shralex)
pkg/rpc/clock_offset.go, line 68 at r7 (raw file):
Previously, knz (kena) wrote…
nit: here and below, make sentences and ensure capitals and final periods.
Done.
pkg/rpc/clock_offset.go, line 211 at r7 (raw file):
Previously, knz (kena) wrote…
nit: below*
Done.
pkg/rpc/clock_offset_test.go, line 230 at r7 (raw file):
Previously, knz (kena) wrote…
nice 💯
thanks. ack
b735fee
to
a8538e2
Compare
you can rebase this on top of #74195 (which should be merged by the time you read this) to remove the first 2 commits from this PR and make it easier to review. |
74195: rpc: minor improvements r=tbg a=knz Prereq to #71243. Co-authored-by: Raphael 'kena' Poss <[email protected]>
b6bb523
to
9019453
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this will be useful.
Please update the commit message and PR description. The title is a bit vague, and is identical to a previous PR/commit with the same title (which confused me at first since I mixed up this PR with the other one). Also, the message alludes to "two changes", but it only makes one change.
pkg/rpc/context.go
Outdated
if remoteNodeID == 0 { | ||
dialCtx = logtags.AddTag(dialCtx, "rnode?", nil) | ||
} else { | ||
dialCtx = logtags.AddTag(dialCtx, "rnode", remoteNodeID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the standard node formatting here, i.e. n7
rather than 7
. We can maybe make this more compact by using e.g. r=n7
.
} else { | ||
dialCtx = logtags.AddTag(dialCtx, "rnode", remoteNodeID) | ||
} | ||
dialCtx = logtags.AddTag(dialCtx, "raddr", target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets marked as sensitive and redacted in logs. Is that intentional? I'm not sure if we consider IP addresses to be sensitive, but it would make debugging a bit harder.
Thanks for the review. I'll mark this PR as draft for now, as it completely breaks the RPC heartbeat protocol for a reason I can't figure out just yet. |
This is rebased and simplified. @rauchenstein FYI, the problem that we couldn't solve earlier, is that in |
This commit removes a code path that would tickle an assertion failure if we were to later fix the context propagation in the RPC heartbeat method (see PR cockroachdb#71243): there's no "active cluster version" in the CLI and so we can't compare it in a client interceptor. Release note: None
This commit removes a code path that would tickle an assertion failure if we were to later fix the context propagation in the RPC heartbeat method (see PR cockroachdb#71243): there's no "active cluster version" in the CLI and so we can't compare it in a client interceptor. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz and @shralex)
pkg/rpc/context.go, line 1149 at r17 (raw file):
// that it respects the same cancellation policy. func (rpcCtx *Context) grpcDialRaw( dialCtx context.Context, target string, remoteNodeID roachpb.NodeID, class ConnectionClass,
you can't use ctx
?
pkg/rpc/context.go, line 1141 at r20 (raw file):
dialCtx := rpcCtx.masterCtx if remoteNodeID == 0 { dialCtx = logtags.AddTag(dialCtx, "rnode?", nil)
maybe rnode, '?'
?
pkg/rpc/context.go, line 1302 at r21 (raw file):
conn.grpcConn, redialChan, conn.dialErr = rpcCtx.grpcDialRaw(dialCtx, target, remoteNodeID, class) if conn.dialErr == nil { dialCtx = logtags.AddTag(dialCtx, "heartbeat", nil)
make this a new ctx := ...
variable. I don't think you want to overwrite dialCtx
.
74318: tracing: add /debug/tracez rendering the active spans r=andreimatei a=andreimatei `/debug/tracez` lets users take a snapshot of the active spans registry and render the new snapshot, or one of the previously taken snapshots. The Tracer can hold up to 10 snapshots in memory. It looks like this: ![Screenshot from 2022-01-04 19-03-39](https://user-images.githubusercontent.com/377201/148140272-306658d5-5b9c-4f2a-b59c-28df9c5ed10c.png) When visualizing a snapshot, the page lets you do a number of things: 1. List all the spans. 2. See the (current) stack trace for each span's goroutine (if the goroutine was still running at the time when the snapshot was captured). Stack traces can be toggled visible/hidden. 3. Sort the spans by name or start time. 4. Filter the span according to text search. The search works across the name and stack trace. 5. Go from a span to the full trace containing that span. For the table Javascript providing sorting and filtering, this patch embeds the library from https://listjs.com/ . Limitations: - for now, only the registry of the local node is snapshotted. In the fuiture I'll collect info from all nodes. - for now, the relationships between different spans are not represented in any way. I'll work on the ability to go from a span to the whole trace that the span is part of. - for now, tags and structured and unstructured log messages that a span might have are not displayed in any way. At the moment, span creation is not enabled in production by default (i.e. the Tracer is put in `TracingModeOnDemand` by default, instead of the required `TracingModeActiveSpansRegistry`). This patch does not change that, so in order to benefit from /debug/tracez in all its glory, one has to run with `COCKROACH_REAL_SPANS=1` for now. Not for long, though. Release note: None 74867: sql: Support CREATE DATABASE WITH OWNER r=Fenil-P a=Fenil-P fixes #67817 Release note (sql change): Allow users to specify the owner when creating a database. Similar to postgresql: CREATE DATABASE name [ [ WITH ] [ OWNER [=] user_name ] 74871: sql: add a tracing tag with the txn ID r=andreimatei a=andreimatei This patch adds the txn's ID as a tag to the tracing span representing a SQL txn. I'm creating a UI to explore the current spans, and this ID will make it easy to navigate between a query/request blocking on a lock held by some other txn, and the activity of that other txn. Release note: None 75114: sql: directly specify columns in TableReader r=RaduBerinde a=RaduBerinde ~Note: the first commit is #74922.~ The internal columns of the TableReader (as well as the row fetcher) are all the columns of the table, with only a subset of values actually produced. This design decision has been carried over way past the point where it makes sense (I admit, it's questionable whether it ever made sense). For one, "all the columns" is ambiguous (does it contain non-public columns? does it include system columns?) leading to various flags and inherent fragility. Second, it relies on the execution engine to figure out (based on the PostProcessSpec) which columns are actually needed, which the optimizer already figures out for us now. This commit changes the TableReader spec and the interface of row.Fetcher to always produce a given specific set of column IDs. The diagram for table readers now specifies the columns by name. The JoinReader, InvertedJoiner, ZigzagJoiner are not changed in this commit (but they should be cleaned up as well). Release note: None 75175: colfetcher: fix the bytes read statistic collection r=yuzefovich a=yuzefovich During 21.2 release we adjusted the `cFetcher` to be `Close`d eagerly when it is returning the zero-length batch. This was done in order to release some references in order for the memory to be GCed sooner; additionally, the `cFetcher` started being used for the index join where the fetcher is restarted from scratch for every batch of spans, so it seemed reasonable to close it automatically. However, that eager closure broke "bytes read" statistic collection since the `row.KVFetcher` was responsible for providing it, and we were zeroing it out. This commit fixes this problem by the `cFetcher` memorizing the number of bytes it has read in `Close`. Some care needs to be taken to not double-count the bytes read in the index join, so a couple of helper methods have been introduced. Additionally this commit applies the same eager-close optimization to the `cFetcher` when the last batch is returned (which makes it so that if we've just exhausted all KVs, we close the fetcher - previously, we would set the zero length on the batch and might never get into `stateFinished`). Fixes: #75128. Release note (bug fix): Previously, CockroachDB could incorrectly report `KV bytes read` statistic in `EXPLAIN ANALYZE` output. The bug is present only in 21.2.x versions. 75215: cmd/github-post: fix Pebble metamorphic reproduction command r=jbowens a=jbowens When posting a github issue for a Pebble metamorphic test failure, include the correct `-ops` flag. Discovered because cockroachdb/pebble#1459 contained a reproduction command that contained too few ops to reproduce the issue. Release note: none 75228: logictestccl: skip flaky TestCCLLogic/fakedist-metadata/partitioning_enum r=mgartner a=mgartner Informs #75227 Release note: None 75237: cli,rpc: don't check the active cluster version in the CLI r=andreimatei a=knz This commit removes a code path that would tickle an assertion failure if we were to later fix the context propagation in the RPC heartbeat method (see PR #71243): there's no "active cluster version" in the CLI and so we can't compare it in a client interceptor. Release note: None 75254: scripts: add `dev generate --mirror` to `bump-pebble.sh` script r=jbowens a=nicktrav CI now expects that dependencies are mirrored to cloud storage and will fail if the TODO for mirroring the repo is left unaddressed in the `DEPS.bzl` file. Add a mirroring step to the `bump-pebble.sh` script. Release note: none Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Fenil Patel <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Nick Travers <[email protected]>
Release note: None
Release note: None
Release note: None
This ensures that the log messages sent to the DEV channel relating to RPC low-level activity (e.g. conn failures, heartbeat etc) include the remote node ID and address and connection class. Example before: ``` [n6] 87 dialing ``` After: ``` [n6,rnode=4,raddr=127.0.0.1:41706,class=default] 87 dialing ``` Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @shralex)
pkg/rpc/context.go, line 1149 at r17 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
you can't use
ctx
?
Done.
pkg/rpc/context.go, line 1141 at r20 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
maybe
rnode, '?'
?
Done.
pkg/rpc/context.go, line 1302 at r21 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
make this a new
ctx := ...
variable. I don't think you want to overwritedialCtx
.
Done.
This removes now-redundant logging output (since the same data is included in tags) and adds a tag for the heartbeat background task. For example, before: ``` [n6,rnode=4,raddr=127.0.0.1:41706,class=default] 87 dialing n4: 127.0.0.1:41706 (default) ``` After: ``` [n6,rnode=4,raddr=127.0.0.1:41706,class=default] 87 dialing ``` Release note: None
TFYR! bors r=andreimatei |
Build failed (retrying...): |
Build succeeded: |
This commit removes a code path that would tickle an assertion failure if we were to later fix the context propagation in the RPC heartbeat method (see PR cockroachdb#71243): there's no "active cluster version" in the CLI and so we can't compare it in a client interceptor. Release note: None
Epic: CRDB-11517
First commit from #75237.
This ensures that the log messages sent to the DEV channel relating to
RPC low-level activity (e.g. conn failures, heartbeat etc) include the
remote node ID and address and connection class.
This also connects the heartbeat task to the tracing infra properly.
Example before:
After: