Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: spammy trace logging on context cancellation #105378

Closed
erikgrinaker opened this issue Jun 22, 2023 · 5 comments · Fixed by #106280
Closed

server: spammy trace logging on context cancellation #105378

erikgrinaker opened this issue Jun 22, 2023 · 5 comments · Fixed by #106280
Assignees
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 22, 2023

In #102793, we began logging trace events on context cancellation. However, this results in traces being logged constantly, mostly for internal requests. We see this frequently in roachtests, e.g. #104709 and #105260, often due to HeartbeatTxn, QueryTxn, and EndTxn requests. The traces are often, but not always, empty. We need to do something about these.

Here's an example, there were thousands of these in a few minutes.

I230619 12:36:59.982466 46378 server/node.go:1317 ⋮ [T1,n7] 1352  batch request ‹HeartbeatTxn [/Local/Range/Table/106/1/-8578380985326469808/RangeDescriptor,/Min), [txn: c9232c77]› failed with error: ‹result is ambiguous: after 0.02s of attempting command: context canceled›
I230619 12:36:59.982466 46378 server/node.go:1317 ⋮ [T1,n7] 1352 +trace:
I230619 12:36:59.982466 46378 server/node.go:1317 ⋮ [T1,n7] 1352 +‹<empty recording>›

Jira issue: CRDB-29006

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery labels Jun 22, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 22, 2023

cc @cockroachdb/disaster-recovery

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jun 23, 2023

I think part of the problem here is that we're checking the error kind (e.g. context.Canceled) of the returned error, but we don't check if the caller's context is still valid.

cockroach/pkg/server/node.go

Lines 1315 to 1320 in a1bea39

if pErr != nil && errors.IsAny(pErr.GoError(), context.Canceled, context.DeadlineExceeded) {
if sp := tracing.SpanFromContext(ctx); sp != nil && !sp.IsNoop() {
log.Infof(ctx, "batch request %s failed with error: %s\ntrace:\n%s", args.String(),
pErr.GoError().Error(), sp.GetConfiguredRecording().String())
}
}

For example, consider if we end up making another RPC call somewhere below us in the stack, and something goes wrong with it so we return a context.Canceled up the stack -- that will fire here and log a message, even though our upstream caller may still be around, with a perfectly fine context, waiting for the error to be returned.

We should instead be checking ctx.Err() here, because we're interested in whether the caller's context has been canceled, not whether one of our child contexts have been cancelled.

I don't know if this will fully address the problem, but it's at least part of the problem.

@adityamaru
Copy link
Contributor

That makes sense, I'll try changing it to check the caller's context instead and see if it still spams the logs. Another conservative option is to allow list batch requests that will actually print on context timeout + cancellation. For the time being this can be only ExportRequests.

@erikgrinaker
Copy link
Contributor Author

Yeah, that'd help. I also noticed that most of the spans were empty, so they probably hadn't fully enabled tracing (or maybe we don't return traces on context cancellation). Could maybe have some additional conditions here to filter out traces that are pointless to log.

@erikgrinaker
Copy link
Contributor Author

Still seeing a lot of these, e.g. over on #106108:

I230704 11:53:36.734161 12422 server/node.go:1318 ⋮ [T1,n2] 546  batch request ‹PushTxn(PUSH_TOUCH,00000000->e875be52) [/Local/Range/Table/106/1/5169142570105423792/RangeDescriptor,/Min)› failed with error: ‹context canceled›
I230704 11:53:36.734161 12422 server/node.go:1318 ⋮ [T1,n2] 546 +trace:
I230704 11:53:36.734161 12422 server/node.go:1318 ⋮ [T1,n2] 546 +‹<empty recording>›
I230704 11:53:36.739966 12421 server/node.go:1318 ⋮ [T1,n2] 547  batch request ‹PushTxn(PUSH_TOUCH,00000000->133562db) [/Local/Range/Table/106/1/6459124673162035792/RangeDescriptor,/Min)› failed with error: ‹context deadline exceeded›
I230704 11:53:36.739966 12421 server/node.go:1318 ⋮ [T1,n2] 547 +trace:
I230704 11:53:36.739966 12421 server/node.go:1318 ⋮ [T1,n2] 547 +‹<empty recording>›
I230704 11:53:36.828270 12413 server/node.go:1318 ⋮ [T1,n2] 548  batch request ‹PushTxn(PUSH_TOUCH,00000000->fb9da637) [/Local/Range/Table/106/1/-5298140780411085008/RangeDescriptor,/Min)› failed with error: ‹context canceled›
I230704 11:53:36.828270 12413 server/node.go:1318 ⋮ [T1,n2] 548 +trace:
I230704 11:53:36.828270 12413 server/node.go:1318 ⋮ [T1,n2] 548 +‹<empty recording>›
I230704 11:53:37.105523 12436 server/node.go:1318 ⋮ [T1,n2] 549  batch request ‹PushTxn(PUSH_TOUCH,00000000->6cc31794) [/Local/Range/Table/106/1/-598920262133427008/RangeDescriptor,/Min)› failed with error: ‹context canceled›
I230704 11:53:37.105523 12436 server/node.go:1318 ⋮ [T1,n2] 549 +trace:
I230704 11:53:37.105523 12436 server/node.go:1318 ⋮ [T1,n2] 549 +‹<empty recording>›
I230704 11:53:37.194077 12453 server/node.go:1318 ⋮ [T1,n2] 550  batch request ‹PushTxn(PUSH_TOUCH,00000000->cc5ace84) [/Local/Range/Table/106/1/6753977725289261392/RangeDescriptor,/Min)› failed with error: ‹context deadline exceeded›
I230704 11:53:37.194077 12453 server/node.go:1318 ⋮ [T1,n2] 550 +trace:
I230704 11:53:37.194077 12453 server/node.go:1318 ⋮ [T1,n2] 550 +‹<empty recording>›
I230704 11:53:38.188970 12532 server/node.go:1318 ⋮ [T1,n2] 551  batch request ‹PushTxn(PUSH_TOUCH,00000000->97175142) [/Local/Range/Table/106/1/598920262133426992/RangeDescriptor,/Min)› failed with error: ‹context canceled›
I230704 11:53:38.188970 12532 server/node.go:1318 ⋮ [T1,n2] 551 +trace:
I230704 11:53:38.188970 12532 server/node.go:1318 ⋮ [T1,n2] 551 +‹<empty recording>›

adityamaru added a commit to adityamaru/cockroach that referenced this issue Jul 6, 2023
This change adds a conditional to only print the
trace of context cancelled request if the trace is not empty.
This avoids log spam noticed in cockroachdb#105378.

Fixes: cockroachdb#105378
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Jul 6, 2023
This change adds a conditional to only print the
trace of context cancelled request if the trace is not empty.
This avoids log spam noticed in cockroachdb#105378.

Fixes: cockroachdb#105378
Release note: None
craig bot pushed a commit that referenced this issue Jul 7, 2023
106280: server: don't print trace if empty r=knz a=adityamaru

This change adds a conditional to only print the
trace of context cancelled request if the trace is not empty. This avoids log spam noticed in #105378.

Fixes: #105378
Release note: None

Co-authored-by: adityamaru <[email protected]>
@craig craig bot closed this as completed in c829b15 Jul 7, 2023
adityamaru added a commit to adityamaru/cockroach that referenced this issue Jul 12, 2023
This change adds a conditional to only print the
trace of context cancelled request if the trace is not empty.
This avoids log spam noticed in cockroachdb#105378.

Fixes: cockroachdb#105378
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants