release-23.1: backupccl,kvserver: log failed ExportRequest trace on client and server #104214

blathers-crl · 2023-06-01T16:48:58Z

Backport 1/1 commits from #102793 on behalf of @adityamaru.

/cc @cockroachdb/release

This change strives to improve observability around
backups that fail because of timed out ExportRequests.
Currently, there is very little indication of what the request
was doing when the client cancelled the context after
the pre-determined timeout window. With this change we
now log the trace of the ExportRequest that failed. If
the ExportRequest was served locally, then the trace will be
part of the sender's tracing span. However, if the request
was served on a remote node then the sender does not wait
for the server side evaluation to notice the context cancellation.
To work around this, we also print the trace on the server side
if the request encountered a context cancellation and the
associating tracing span is not a noop.

This change also adds a private cluster setting
bulkio.backup.export_request_verbose_tracing that if set to true
will send all backup export requests with verbose tracing
mode.

To debug a backup failing with a timed out export request we
can now:

SET CLUSTER SETTING bulkio.backup.export_request_verbose_tracing = true;
SET CLUSTER SETTING trace.snapshot.rate = '1m'

Once the backup times out we can look at the logs
for the server side and client side ExportRequest traces
to then understand where we were stuck executing for so long.

Fixes: #86047
Release note: None

Release justification: improving observability into a common cause of escalations

This change strives to improve observability around backups that fail because of timed out ExportRequests. Currently, there is very little indication of what the request was doing when the client cancelled the context after the pre-determined timeout window. With this change we now log the trace of the ExportRequest that failed. If the ExportRequest was served locally, then the trace will be part of the sender's tracing span. However, if the request was served on a remote node then the sender does not wait for the server side evaluation to notice the context cancellation. To work around this, we also print the trace on the server side if the request encountered a context cancellation and the associating tracing span is not a noop. This change also adds a private cluster setting `bulkio.backup.export_request_verbose_tracing` that if set to true will send all backup export requests with verbose tracing mode. To debug a backup failing with a timed out export request we can now: - SET CLUSTER SETTING bulkio.backup.export_request_verbose_tracing = true; - SET CLUSTER SETTING trace.snapshot.rate = '1m' Once the backup times out we can look at the logs for the server side and client side ExportRequest traces to then understand where we were stuck executing for so long. Fixes: #86047 Release note: None

blathers-crl · 2023-06-01T16:49:01Z

blathers-crl · 2023-06-01T16:49:03Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-06-01T16:49:16Z

This change is

erikgrinaker

Please hold off on this backport. I'm seeing tons of log spam in roachtests. Will open an issue in a bit.

erikgrinaker · 2023-06-22T19:08:22Z

#105378

adityamaru · 2023-07-10T19:52:56Z

@erikgrinaker with #105378 fixed are we okay merging this with the log spam change included? This will help debug timing out export requests in CC such as https://github.com/cockroachlabs/support/issues/2452

adityamaru · 2023-07-11T18:17:46Z

Closing in favour of #106611.

blathers-crl bot requested review from a team as code owners June 1, 2023 16:48

blathers-crl bot force-pushed the blathers/backport-release-23.1-102793 branch from 3e07121 to fd26f97 Compare June 1, 2023 16:49

blathers-crl bot added the blathers-backport This is a backport that Blathers created automatically. label Jun 1, 2023

blathers-crl bot removed the request for review from a team June 1, 2023 16:49

blathers-crl bot added the O-robot Originated from a bot. label Jun 1, 2023

blathers-crl bot requested a review from rhu713 June 1, 2023 16:49

blathers-crl bot force-pushed the blathers/backport-release-23.1-102793 branch from 70cbfe5 to 8527308 Compare June 1, 2023 16:49

blathers-crl bot assigned adityamaru Jun 1, 2023

blathers-crl bot requested review from adityamaru, arulajmani and irfansharif June 1, 2023 16:49

blathers-crl bot requested review from knz and stevendanna June 1, 2023 16:49

adityamaru removed request for a team, rhu713 and adityamaru June 1, 2023 16:57

irfansharif approved these changes Jun 1, 2023

View reviewed changes

erikgrinaker suggested changes Jun 22, 2023

View reviewed changes

adityamaru requested a review from erikgrinaker July 10, 2023 19:53

adityamaru closed this Jul 11, 2023

rafiss deleted the blathers/backport-release-23.1-102793 branch December 11, 2023 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-23.1: backupccl,kvserver: log failed ExportRequest trace on client and server #104214

release-23.1: backupccl,kvserver: log failed ExportRequest trace on client and server #104214

blathers-crl bot commented Jun 1, 2023 •

edited by adityamaru

Loading

blathers-crl bot commented Jun 1, 2023

blathers-crl bot commented Jun 1, 2023

cockroach-teamcity commented Jun 1, 2023

erikgrinaker left a comment

erikgrinaker commented Jun 22, 2023

adityamaru commented Jul 10, 2023 •

edited

Loading

adityamaru commented Jul 11, 2023

release-23.1: backupccl,kvserver: log failed ExportRequest trace on client and server #104214

release-23.1: backupccl,kvserver: log failed ExportRequest trace on client and server #104214

Conversation

blathers-crl bot commented Jun 1, 2023 • edited by adityamaru Loading

blathers-crl bot commented Jun 1, 2023

blathers-crl bot commented Jun 1, 2023

cockroach-teamcity commented Jun 1, 2023

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Jun 22, 2023

adityamaru commented Jul 10, 2023 • edited Loading

adityamaru commented Jul 11, 2023

blathers-crl bot commented Jun 1, 2023 •

edited by adityamaru

Loading

adityamaru commented Jul 10, 2023 •

edited

Loading