jobs: surface the trace_id linked to job execution #65322

adityamaru · 2021-05-17T10:11:50Z

Please see individual commit messages.

Informs: #64992

cockroach-teamcity · 2021-05-17T10:11:59Z

This change is

adityamaru · 2021-05-17T10:21:17Z

@irfansharif adding you as a reviewer mainly for the first commit that changes the nature of the span associated with a job.

adityamaru · 2021-05-17T11:22:05Z

Hmm the CI failure indicates that I have to Finish() the job root span somewhere to avoid leaking it. I'm not clear how we would previously Finish() the forked span, but the test cluster checks the node's registry before shutting down for any unfinished root spans. Please hold off on reviewing this till I figure this out!

adityamaru · 2021-05-17T13:17:36Z

I think I just scared myself because of a unit test that was mocking out a job record and not going through the registry. We should not be leaking any spans. For posterity:

The root span created when the job is created for the first time is marked as Finished once runJob returns in the Start() method.
If the job is resumed on a node other than the one that created it, the root span created on resumption in runJob has a defer span.Finish().

This is RFAL!

abarganier

Nice, LGTM. This seems like it will give some nice immediate returns! 😎 I don't think I have enough context re: the new root spans being created in the 1st commit, but curious to hear @irfansharif's thoughts.

Reviewed 7 of 7 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @irfansharif, and @pbardea)

pkg/jobs/adopt.go, line 258 at r1 (raw file):

	// If the job is being resumed by a registry, AnnotateCtxWithSpan will create
	// a new root span as ctx is not annotated with a parent span.
	ctx, span = r.ac.AnnotateCtxWithSpan(ctx, spanName, tracing.WithForceRealSpan())

More a curiosity question - can you link me to any resources that can help me understand the scenarios & semantics surrounding "resumed" jobs?

irfansharif

Like discussed offline, I'd be curious to see what this PR would look like if we created a root span when only running the job (both the very first time, and also during resumption), instead of creating the span when the job is first created, and not using it until executing it. That way, when running the job, we can unconditionally create a root span instead of "check if a span exists, if so use it, if not, create a new one because we're being resumed".

Also, from our offline discussions, the first commit message wants to be changed. Forked spans don't have to propagate payloads back to parent span. Everything else around storing the trace ID (and updating it on resumption) for future instropectability sounds sane to me. I'm excited for us to flesh out the primitives to read in-flight trace data from across the cluster, instead of the simpler primitive we have today that only renders a full recording of the trace's remote children only after it's done.

irfansharif · 2021-05-17T17:18:57Z

pkg/jobs/adopt.go

 	defer span.Finish()
+	if err := job.Update(ctx, nil /* txn */, func(txn *kv.Txn, md JobMetadata,
+		ju *JobUpdater) error {
+		md.Payload.TraceId = span.TraceID()


This proto field should be named TraceID, there's a gogoproto override that lets you do that.

nice catch, done.

irfansharif · 2021-05-17T17:19:57Z

pkg/jobs/jobs.go

+
+// FinishSpan will mark the span linked to the StartableJob as Finish()'d.
+func (sj *StartableJob) FinishSpan() {
+	if sj.span != nil {


The nil check is unnecessary, it's fine to call .Finish on a nil span. Inlining sj.span.Finish at callsites seems fine to me.

cockroach/pkg/util/tracing/span.go

Line 79 in ab73621

if sp == nil || sp.i.isNoop() || atomic.AddInt32(&sp.numFinishCalled, 1) != 1 {

makes sense, deleted the method though since we're no longer bookkeeping a span on job creation.

irfansharif · 2021-05-17T17:25:02Z

pkg/jobs/registry.go

-			resumerCtx = tracing.ContextWithSpan(resumerCtx, span)
-		}
+		resumerCtx, span = r.settings.Tracer.StartSpanCtx(resumerCtx, "job",
+			tracing.WithForceRealSpan())


How often do these jobs execute? How many active jobs are typicall present in any running cluster, under stable operation? Real spans have a tiny amount of overhead, and before this PR, they're only created for sampled statements. Just sanity checking that this isn't in the hotpath or anything.

Along this line, one idea is to give jobs a verbosity level of their own that they can declare when creating their job record. That way particular types of jobs can opt-in to be traced (e.g. we're pretty confident there won't be a crazy amount of backup/restore jobs so those jobs can always enable tracing).

This is a good idea, I haven't added it in yet because I want to flesh this out a little.

A span can either be a noop (this is the default unless tracing is on) or a "real" span. A noop span records nothing. A "real" span can further be in verbose mode or not. A non-verbose, real span allows for recording structured payloads but discards all free form logging. A verbose, real span allows for all forms of recording. In the initial google doc, knz had suggested we stick to structured payloads for tracing job execution.

So I'm inclined to allow jobs to opt-in to being a real, root span, but not allowing them to choose between verbose/non-verbose. What do you think?

That sounds good, my mention of jobs having a "verbosity" was to air on the side of making that flag an enum in the proto rather than a bool so that we can adjust later on if we want to add more fine-grained control (if it was to be something that we put in the proto -- I could imagine a future where some jobs might want to be sampled and all executions of that job would want to be traced, but I'm +1 on a traceable job interface like Yevgeniy suggested below).

adityamaru · 2021-05-18T10:52:22Z

Like discussed offline, I'd be curious to see what this PR would look like if we created a root span when only running the job (both the very first time, and also during resumption), instead of creating the span when the job is first created, and not using it until executing it. That way, when running the job, we can unconditionally create a root span instead of "check if a span exists, if so use it, if not, create a new one because we're being resumed".

Updated to only creating a root span when we start job execution. Also updated the first commit message to reflect this.

adityamaru · 2021-05-18T11:08:02Z

Nice, LGTM. This seems like it will give some nice immediate returns! 😎 I don't think I have enough context re: the new root spans being created in the 1st commit, but curious to hear @irfansharif's thoughts.

Reviewed 7 of 7 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @irfansharif, and @pbardea)

pkg/jobs/adopt.go, line 258 at r1 (raw file):
	// If the job is being resumed by a registry, AnnotateCtxWithSpan will create
	// a new root span as ctx is not annotated with a parent span.
	ctx, span = r.ac.AnnotateCtxWithSpan(ctx, spanName, tracing.WithForceRealSpan())
More a curiosity question - can you link me to any resources that can help me understand the scenarios & semantics surrounding "resumed" jobs?

I'm not sure we have any docs around the specifics of job resumption. We have an old RFC on the jobs system though I think it has undergone fairly significant changes since then. In general, job creation and job execution are independent in the jobs infrastructure. A job when created is recorded once in the system.jobs table following which it is an acceptable candidate for adoption. Every node in the cluster runs a registry, and one of the processes in this registry polls the system.jobs table for adoptable jobs. When a registry adopts a job, it recreates the job from the information stored during creation, and begins execution. A job can subsequently be PAUSED and RESUMED multiple times, and this works by effectively toggling the adoptability of the job. There is no guarantee that the node that created the job/was previously running the job, will be the one to pick up the execution of the job on unpause.
A job might also be a candidate for resumption if the node running the job crashes. In this case, the current execution of the job will fail but will not mark the job entry as failed, thereby allowing other idle nodes to pickup its execution.

I'm happy to talk more about the job subsystem offline, if you have any questions!

miretskiy

Reviewed 2 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @irfansharif, and @pbardea)

pkg/jobs/registry.go, line 540 at r1 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

This is a good idea, I haven't added it in yet because I want to flesh this out a little.

A span can either be a noop (this is the default unless tracing is on) or a "real" span. A noop span records nothing. A "real" span can further be in verbose mode or not. A non-verbose, real span allows for recording structured payloads but discards all free form logging. A verbose, real span allows for all forms of recording. In the initial google doc, knz had suggested we stick to structured payloads for tracing job execution.

So I'm inclined to allow jobs to opt-in to being a real, root span, but not allowing them to choose between verbose/non-verbose. What do you think?

@adityamaru FYI: opting into being traced is easy. I would just add a marker interface to jobs:

type TraceableJob interface {}

You can even add some methods there too (like: I want to be traced verbose or some such).
Have registry test if the resumer also implements this interface and create spans appropriately.

irfansharif

The use of pkg/tracing in this way LGTM.

irfansharif · 2021-05-18T13:34:37Z

pkg/jobs/adopt.go

@@ -248,8 +248,23 @@ func (r *Registry) runJob(
 	defer cleanup()
 	spanName := fmt.Sprintf(`%s-%d`, typ, job.ID())
 	var span *tracing.Span
-	ctx, span = r.ac.AnnotateCtxWithSpan(ctx, spanName)
+
+	// Wrap the context with a root span that is tied to the execution of the


I'd drop the "Wrap the context" verbiage, lets just say "Create a new root span to trace the execution of the current instance of job...".

irfansharif · 2021-05-18T13:36:06Z

pkg/util/log/ambient_context.go

@@ -158,7 +158,7 @@ func (ac *AmbientContext) annotateCtxInternal(ctx context.Context) context.Conte
 //
 // The caller is responsible for closing the span (via Span.Finish).
 func (ac *AmbientContext) AnnotateCtxWithSpan(
-	ctx context.Context, opName string,
+	ctx context.Context, opName string, os ...tracing.SpanOption,


The diffs in these files can be reverted.

pbardea

LGTM mod the discussion about opting into this.

Also, should there be a release note somewhere?

adityamaru

@pbardea added Yevgeniy's suggestion for opting in - https://github.com/cockroachdb/cockroach/pull/65322/files#diff-651e5b9db62dc86f9ef2a2d1937717a587117d43e9a589db3a43531817f2ca3bR257

In the future, we can add a ForceVerboseSpan if we deem it necessary. re: release note, I added one to the second commit about our new column in SHOW JOBS.

~~edit: oops i think i broke all the tests with an NPE.~~

Previously, a job on creation would create a new context and wrap it with a "forked span" from the parent ctx. This meant that the span would inherit the traceID from its remote parent. When the job began executing, we would create yet another span. This span would be a child of the "forked span" mentioned above, if the job was being started on the same registry that created it. If the job was being resumed by a new registry, we would create a root span. This change deletes the root span created during job creation, and always creates a root span when the job begins executing. We did not use the span created on job creation for anything and therefore this seems like an appropriate simplification. Creating a root span allows us to identify only those spans linked to job execution using the trace_id assigned to this root span. This will be helpful when we build tooling to improve job observability that will pull all the relevant spans. This change also stores the trace_id associated with the current job resumer in the jobs' payload. This allows us to use it at a later time to inspect what a job is doing at a given point in time! Release note: None

This change adds the trace_id associated with the current execution of the job via the crdb_internal.jobs table and as a column in `SHOW JOBS`. This will allow users to pull the inflight traces for a job for debugging purposes. Release note (sql change): `SHOW JOBS` now shows the `trace_id` of the trace that is associated with the current execution of the job. This allows pulling inflight traces for a job for debugging purposes.

abarganier

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @adityamaru, @irfansharif, and @miretskiy)

pkg/jobs/adopt.go, line 258 at r1 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

I'm not sure we have any docs around the specifics of job resumption. We have an old RFC on the jobs system though I think it has undergone fairly significant changes since then. In general, job creation and job execution are independent in the jobs infrastructure. A job when created is recorded once in the system.jobs table following which it is an acceptable candidate for adoption. Every node in the cluster runs a registry, and one of the processes in this registry polls the system.jobs table for adoptable jobs. When a registry adopts a job, it recreates the job from the information stored during creation, and begins execution. A job can subsequently be PAUSED and RESUMED multiple times, and this works by effectively toggling the adoptability of the job. There is no guarantee that the node that created the job/was previously running the job, will be the one to pick up the execution of the job on unpause.
A job might also be a candidate for resumption if the node running the job crashes. In this case, the current execution of the job will fail but will not mark the job entry as failed, thereby allowing other idle nodes to pickup its execution.

I'm happy to talk more about the job subsystem offline, if you have any questions!

Makes sense - thanks for the clear explanation, much appreciated

adityamaru · 2021-05-18T19:01:37Z

TFTRs!

bors r+

pbardea

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @adityamaru, @irfansharif, and @miretskiy)

pkg/jobs/adopt.go, line 268 at r3 (raw file):

		ju *JobUpdater) error {
		md.Payload.TraceID = span.TraceID()
		ju.UpdatePayload(md.Payload)

Should this be progress?

Update: as per offline discussion we'll change this in a follow-up

pkg/jobs/jobs.go, line 100 at r3 (raw file):

	// ForceRealSpan forces the registry to create a real Span instead of a
	// low-overhead non-recordable noop span.
	ForceRealSpan() bool

nit, take it or leave it: but it looks like every implementation of this is going to return True, so we can not even have a return value here and implementing it indicates that we want a real span/traceable job.

craig · 2021-05-18T20:27:41Z

Build succeeded:

GitHub CI (Cockroach)

adityamaru · 2021-05-19T08:43:44Z

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @adityamaru, @irfansharif, and @miretskiy)

pkg/jobs/adopt.go, line 268 at r3 (raw file):
		ju *JobUpdater) error {
		md.Payload.TraceID = span.TraceID()
		ju.UpdatePayload(md.Payload)
Should this be progress?

Update: as per offline discussion we'll change this in a follow-up

pkg/jobs/jobs.go, line 100 at r3 (raw file):
	// ForceRealSpan forces the registry to create a real Span instead of a
	// low-overhead non-recordable noop span.
	ForceRealSpan() bool
nit, take it or leave it: but it looks like every implementation of this is going to return True, so we can not even have a return value here and implementing it indicates that we want a real span/traceable job.

good point, addressed over in #65454.

The trace_id associated with a job is updated on every resumption of the job. If the job is paused and resumed several times, we don't want to rewrite the payload every single time. Thus, we move the field to the job progress which is expected to be rewritten frequently. Also addresses a comment from cockroachdb#65322. Release note: None

@ajwerner

62719: kvserver: simplify and document handling of ReplicaPlaceholder r=erikgrinaker a=tbg This commit cleans up the handling of ReplicaPlaceholder and documents their semantics better. I embarked on this while looking into an unrelated test flake. It occurred to me that we were not installing placeholders before the snapshot transfer, but only after. This seems like a bad idea now that our average snapshot size has gone up, as placeholders help avoid duplicate work; we now install them before accepting the range data. In practice, we currently only allow once snapshot inflight per store, so it is not clear that that improvement is buying us anything. I think that the improved clarity in this commit will, though. Concretely, we now have two simple rules for placeholders: 1. they only exist for uninitialized replicas (and must exist for those, or we can end up with overlapping replicas) 2. you write it --> you (and *only* you) remove it. 1 was true before but the documentation was not clear. 2. was not true, as there were a few out-of-band places that removed placeholders (and didn't clearly explain why - sounded like we were previously leaking placeholders, maybe @ajwerner remembers?). There's also a bit of extra cleanup that clarifies things - for example, since caller to `applySnapshot` already checked for an empty snapshot, the placeholder removal within it effectively dead code and is now removed. Also, in receiveSnapshot, there was an optimistic check that allowed us to discard snapshots early. It was always confusing to keep track of which one is which and now that we've lifted the authoritative check to the top of receiveSnapshot, we got to delete the other one. Release note: None 65454: jobs: move trace_id from payload to progress r=pbardea a=adityamaru The trace_id associated with a job is updated on every resumption of the job. If the job is paused and resumed several times, we don't want to rewrite the payload every single time. Thus, we move the field to the job progress which is expected to be rewritten frequently. Also addresses a comment from #65322. Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Aditya Maru <[email protected]>

In cockroachdb#65322 we added code to record a trace ID into job progress. This code added a new error return path to runJob which did not unregister the job from the in-memory state in the registry. This PR fixes that oversight by deferring that call and allowing jobs to be retried when writing that trace ID fails. Release note (bug fix): Fixed a bug in jobs where failures to write to the jobs table can prevent subsequent adoption of a job until the previous node dies or the jobs is paused.

67671: jobs: fix bug where failure to record the trace ID prevents adoption, remove RTT r=ajwerner a=ajwerner In #65322 we added code to record a trace ID into job progress. This code added a new error return path to runJob which did not unregister the job from the in-memory state in the registry. This PR fixes that oversight by deferring that call and allowing jobs to be retried when writing that trace ID fails. **jobs: only write the trace ID into jobs if nonzero** We don't trace most jobs. It's an extra write to the jobs table for each run to just mark it as 0. Release note (performance improvement): Eliminated a round-trip when running most jobs. Release note (bug fix): Fixed a bug in jobs where failures to write to the jobs table can prevent subsequent adoption of a job until the previous node dies or the jobs is paused. Co-authored-by: Andrew Werner <[email protected]>

adityamaru requested a review from a team May 17, 2021 10:11

adityamaru requested review from a team, pbardea and irfansharif and removed request for a team May 17, 2021 10:12

adityamaru mentioned this pull request May 17, 2021

cli: add cockroach debug job-trace #65324

Merged

adityamaru force-pushed the job-tracing branch from 92f17eb to bee54db Compare May 17, 2021 13:13

abarganier reviewed May 17, 2021

View reviewed changes

irfansharif reviewed May 17, 2021

View reviewed changes

adityamaru force-pushed the job-tracing branch from bee54db to f8712b9 Compare May 18, 2021 10:49

blathers-crl bot requested a review from irfansharif May 18, 2021 10:49

adityamaru force-pushed the job-tracing branch from f8712b9 to f988d3d Compare May 18, 2021 10:56

miretskiy requested a review from abarganier May 18, 2021 13:11

miretskiy reviewed May 18, 2021

View reviewed changes

irfansharif approved these changes May 18, 2021

View reviewed changes

pbardea approved these changes May 18, 2021

View reviewed changes

adityamaru force-pushed the job-tracing branch from f988d3d to 786f82e Compare May 18, 2021 16:18

adityamaru commented May 18, 2021

View reviewed changes

adityamaru added 2 commits May 18, 2021 17:53

adityamaru force-pushed the job-tracing branch from 786f82e to 9a02327 Compare May 18, 2021 16:53

abarganier reviewed May 18, 2021

View reviewed changes

pbardea reviewed May 18, 2021

View reviewed changes

craig bot merged commit 64dad5f into cockroachdb:master May 18, 2021

irfansharif mentioned this pull request May 18, 2021

sql: allow inflight traces to be collected across the cluster #60999

Closed

adityamaru mentioned this pull request May 19, 2021

jobs: move trace_id from payload to progress #65454

Merged

adityamaru added the backport-21.1.x label Jun 1, 2021

ajwerner mentioned this pull request Jul 15, 2021

jobs: move span registration to existing calls to in runStateMachine #67661

Closed

ajwerner mentioned this pull request Jul 15, 2021

jobs: fix bug where failure to record the trace ID prevents adoption, remove RTT #67671

Merged

jseldess mentioned this pull request Sep 8, 2021

jobs: surface the trace_id linked to job execution cockroachdb/docs#11331

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs: surface the trace_id linked to job execution #65322

jobs: surface the trace_id linked to job execution #65322

adityamaru commented May 17, 2021

cockroach-teamcity commented May 17, 2021

adityamaru commented May 17, 2021

adityamaru commented May 17, 2021

adityamaru commented May 17, 2021

abarganier left a comment

irfansharif left a comment

irfansharif May 17, 2021

adityamaru May 18, 2021

irfansharif May 17, 2021

adityamaru May 18, 2021

irfansharif May 17, 2021

pbardea May 17, 2021

adityamaru May 18, 2021

pbardea May 18, 2021 •

edited

Loading

adityamaru commented May 18, 2021

adityamaru commented May 18, 2021

miretskiy left a comment

irfansharif left a comment

irfansharif May 18, 2021

adityamaru May 18, 2021

irfansharif May 18, 2021

adityamaru May 18, 2021

pbardea left a comment

adityamaru left a comment •

edited

Loading

abarganier left a comment

adityamaru commented May 18, 2021

pbardea left a comment

craig bot commented May 18, 2021

adityamaru commented May 19, 2021

jobs: surface the trace_id linked to job execution #65322

jobs: surface the trace_id linked to job execution #65322

Conversation

adityamaru commented May 17, 2021

cockroach-teamcity commented May 17, 2021

adityamaru commented May 17, 2021

adityamaru commented May 17, 2021

adityamaru commented May 17, 2021

abarganier left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbardea May 18, 2021 • edited Loading

Choose a reason for hiding this comment

adityamaru commented May 18, 2021

adityamaru commented May 18, 2021

miretskiy left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbardea left a comment

Choose a reason for hiding this comment

adityamaru left a comment • edited Loading

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

adityamaru commented May 18, 2021

pbardea left a comment

Choose a reason for hiding this comment

craig bot commented May 18, 2021

adityamaru commented May 19, 2021

pbardea May 18, 2021 •

edited

Loading

adityamaru left a comment •

edited

Loading