Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: surface the trace_id linked to job execution #65322

Merged
merged 2 commits into from
May 18, 2021

Conversation

adityamaru
Copy link
Contributor

Please see individual commit messages.

Informs: #64992

@adityamaru adityamaru requested a review from a team May 17, 2021 10:11
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru adityamaru requested review from a team, pbardea and irfansharif and removed request for a team May 17, 2021 10:12
@adityamaru
Copy link
Contributor Author

@irfansharif adding you as a reviewer mainly for the first commit that changes the nature of the span associated with a job.

@adityamaru
Copy link
Contributor Author

Hmm the CI failure indicates that I have to Finish() the job root span somewhere to avoid leaking it. I'm not clear how we would previously Finish() the forked span, but the test cluster checks the node's registry before shutting down for any unfinished root spans. Please hold off on reviewing this till I figure this out!

@adityamaru
Copy link
Contributor Author

I think I just scared myself because of a unit test that was mocking out a job record and not going through the registry. We should not be leaking any spans. For posterity:

  • The root span created when the job is created for the first time is marked as Finished once runJob returns in the Start() method.
  • If the job is resumed on a node other than the one that created it, the root span created on resumption in runJob has a defer span.Finish().

This is RFAL!

Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM. This seems like it will give some nice immediate returns! 😎 I don't think I have enough context re: the new root spans being created in the 1st commit, but curious to hear @irfansharif's thoughts.

Reviewed 7 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @irfansharif, and @pbardea)


pkg/jobs/adopt.go, line 258 at r1 (raw file):

	// If the job is being resumed by a registry, AnnotateCtxWithSpan will create
	// a new root span as ctx is not annotated with a parent span.
	ctx, span = r.ac.AnnotateCtxWithSpan(ctx, spanName, tracing.WithForceRealSpan())

More a curiosity question - can you link me to any resources that can help me understand the scenarios & semantics surrounding "resumed" jobs?

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like discussed offline, I'd be curious to see what this PR would look like if we created a root span when only running the job (both the very first time, and also during resumption), instead of creating the span when the job is first created, and not using it until executing it. That way, when running the job, we can unconditionally create a root span instead of "check if a span exists, if so use it, if not, create a new one because we're being resumed".

Also, from our offline discussions, the first commit message wants to be changed. Forked spans don't have to propagate payloads back to parent span. Everything else around storing the trace ID (and updating it on resumption) for future instropectability sounds sane to me. I'm excited for us to flesh out the primitives to read in-flight trace data from across the cluster, instead of the simpler primitive we have today that only renders a full recording of the trace's remote children only after it's done.

defer span.Finish()
if err := job.Update(ctx, nil /* txn */, func(txn *kv.Txn, md JobMetadata,
ju *JobUpdater) error {
md.Payload.TraceId = span.TraceID()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proto field should be named TraceID, there's a gogoproto override that lets you do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, done.

pkg/jobs/jobs.go Outdated

// FinishSpan will mark the span linked to the StartableJob as Finish()'d.
func (sj *StartableJob) FinishSpan() {
if sj.span != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nil check is unnecessary, it's fine to call .Finish on a nil span. Inlining sj.span.Finish at callsites seems fine to me.

if sp == nil || sp.i.isNoop() || atomic.AddInt32(&sp.numFinishCalled, 1) != 1 {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, deleted the method though since we're no longer bookkeeping a span on job creation.

resumerCtx = tracing.ContextWithSpan(resumerCtx, span)
}
resumerCtx, span = r.settings.Tracer.StartSpanCtx(resumerCtx, "job",
tracing.WithForceRealSpan())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often do these jobs execute? How many active jobs are typicall present in any running cluster, under stable operation? Real spans have a tiny amount of overhead, and before this PR, they're only created for sampled statements. Just sanity checking that this isn't in the hotpath or anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along this line, one idea is to give jobs a verbosity level of their own that they can declare when creating their job record. That way particular types of jobs can opt-in to be traced (e.g. we're pretty confident there won't be a crazy amount of backup/restore jobs so those jobs can always enable tracing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, I haven't added it in yet because I want to flesh this out a little.

A span can either be a noop (this is the default unless tracing is on) or a "real" span. A noop span records nothing. A "real" span can further be in verbose mode or not. A non-verbose, real span allows for recording structured payloads but discards all free form logging. A verbose, real span allows for all forms of recording. In the initial google doc, knz had suggested we stick to structured payloads for tracing job execution.

So I'm inclined to allow jobs to opt-in to being a real, root span, but not allowing them to choose between verbose/non-verbose. What do you think?

Copy link
Contributor

@pbardea pbardea May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good, my mention of jobs having a "verbosity" was to air on the side of making that flag an enum in the proto rather than a bool so that we can adjust later on if we want to add more fine-grained control (if it was to be something that we put in the proto -- I could imagine a future where some jobs might want to be sampled and all executions of that job would want to be traced, but I'm +1 on a traceable job interface like Yevgeniy suggested below).

@adityamaru
Copy link
Contributor Author

Like discussed offline, I'd be curious to see what this PR would look like if we created a root span when only running the job (both the very first time, and also during resumption), instead of creating the span when the job is first created, and not using it until executing it. That way, when running the job, we can unconditionally create a root span instead of "check if a span exists, if so use it, if not, create a new one because we're being resumed".

Updated to only creating a root span when we start job execution. Also updated the first commit message to reflect this.

@adityamaru
Copy link
Contributor Author

Nice, LGTM. This seems like it will give some nice immediate returns! 😎 I don't think I have enough context re: the new root spans being created in the 1st commit, but curious to hear @irfansharif's thoughts.

Reviewed 7 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @irfansharif, and @pbardea)

pkg/jobs/adopt.go, line 258 at r1 (raw file):

	// If the job is being resumed by a registry, AnnotateCtxWithSpan will create
	// a new root span as ctx is not annotated with a parent span.
	ctx, span = r.ac.AnnotateCtxWithSpan(ctx, spanName, tracing.WithForceRealSpan())

More a curiosity question - can you link me to any resources that can help me understand the scenarios & semantics surrounding "resumed" jobs?

I'm not sure we have any docs around the specifics of job resumption. We have an old RFC on the jobs system though I think it has undergone fairly significant changes since then. In general, job creation and job execution are independent in the jobs infrastructure. A job when created is recorded once in the system.jobs table following which it is an acceptable candidate for adoption. Every node in the cluster runs a registry, and one of the processes in this registry polls the system.jobs table for adoptable jobs. When a registry adopts a job, it recreates the job from the information stored during creation, and begins execution. A job can subsequently be PAUSED and RESUMED multiple times, and this works by effectively toggling the adoptability of the job. There is no guarantee that the node that created the job/was previously running the job, will be the one to pick up the execution of the job on unpause.
A job might also be a candidate for resumption if the node running the job crashes. In this case, the current execution of the job will fail but will not mark the job entry as failed, thereby allowing other idle nodes to pickup its execution.

I'm happy to talk more about the job subsystem offline, if you have any questions!

@miretskiy miretskiy requested a review from abarganier May 18, 2021 13:11
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 6 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @irfansharif, and @pbardea)


pkg/jobs/registry.go, line 540 at r1 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

This is a good idea, I haven't added it in yet because I want to flesh this out a little.

A span can either be a noop (this is the default unless tracing is on) or a "real" span. A noop span records nothing. A "real" span can further be in verbose mode or not. A non-verbose, real span allows for recording structured payloads but discards all free form logging. A verbose, real span allows for all forms of recording. In the initial google doc, knz had suggested we stick to structured payloads for tracing job execution.

So I'm inclined to allow jobs to opt-in to being a real, root span, but not allowing them to choose between verbose/non-verbose. What do you think?

@adityamaru FYI: opting into being traced is easy. I would just add a marker interface to jobs:

type TraceableJob interface {}

You can even add some methods there too (like: I want to be traced verbose or some such).
Have registry test if the resumer also implements this interface and create spans appropriately.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of pkg/tracing in this way LGTM.

@@ -248,8 +248,23 @@ func (r *Registry) runJob(
defer cleanup()
spanName := fmt.Sprintf(`%s-%d`, typ, job.ID())
var span *tracing.Span
ctx, span = r.ac.AnnotateCtxWithSpan(ctx, spanName)

// Wrap the context with a root span that is tied to the execution of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd drop the "Wrap the context" verbiage, lets just say "Create a new root span to trace the execution of the current instance of job...".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -158,7 +158,7 @@ func (ac *AmbientContext) annotateCtxInternal(ctx context.Context) context.Conte
//
// The caller is responsible for closing the span (via Span.Finish).
func (ac *AmbientContext) AnnotateCtxWithSpan(
ctx context.Context, opName string,
ctx context.Context, opName string, os ...tracing.SpanOption,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diffs in these files can be reverted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod the discussion about opting into this.

Also, should there be a release note somewhere?

Copy link
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbardea added Yevgeniy's suggestion for opting in - https://github.com/cockroachdb/cockroach/pull/65322/files#diff-651e5b9db62dc86f9ef2a2d1937717a587117d43e9a589db3a43531817f2ca3bR257

In the future, we can add a ForceVerboseSpan if we deem it necessary. re: release note, I added one to the second commit about our new column in SHOW JOBS.

edit: oops i think i broke all the tests with an NPE.

Previously, a job on creation would create a new context
and wrap it with a "forked span" from the parent ctx. This
meant that the span would inherit the traceID from its
remote parent. When the job began executing, we would create
yet another span. This span would be a child of the "forked
span" mentioned above, if the job was being started on the
same registry that created it. If the job was being resumed
by a new registry, we would create a root span.

This change deletes the root span created during job creation,
and always creates a root span when the job begins executing.
We did not use the span created on job creation for anything
and therefore this seems like an appropriate simplification.

Creating a root span allows us to identify only those spans linked
to job execution using the trace_id assigned to this root span.
This will be helpful when we build tooling to improve job
observability that will pull all the relevant spans.

This change also stores the trace_id associated with the current
job resumer in the jobs' payload. This allows us to use it at
a later time to inspect what a job is doing at a given point in
time!

Release note: None
This change adds the trace_id associated with the
current execution of the job via the crdb_internal.jobs
table and as a column in `SHOW JOBS`. This will allow
users to pull the inflight traces for a job for debugging
purposes.

Release note (sql change): `SHOW JOBS` now shows the `trace_id`
of the trace that is associated with the current execution of the
job. This allows pulling inflight traces for a job for debugging
purposes.
Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @adityamaru, @irfansharif, and @miretskiy)


pkg/jobs/adopt.go, line 258 at r1 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

I'm not sure we have any docs around the specifics of job resumption. We have an old RFC on the jobs system though I think it has undergone fairly significant changes since then. In general, job creation and job execution are independent in the jobs infrastructure. A job when created is recorded once in the system.jobs table following which it is an acceptable candidate for adoption. Every node in the cluster runs a registry, and one of the processes in this registry polls the system.jobs table for adoptable jobs. When a registry adopts a job, it recreates the job from the information stored during creation, and begins execution. A job can subsequently be PAUSED and RESUMED multiple times, and this works by effectively toggling the adoptability of the job. There is no guarantee that the node that created the job/was previously running the job, will be the one to pick up the execution of the job on unpause.
A job might also be a candidate for resumption if the node running the job crashes. In this case, the current execution of the job will fail but will not mark the job entry as failed, thereby allowing other idle nodes to pickup its execution.

I'm happy to talk more about the job subsystem offline, if you have any questions!

Makes sense - thanks for the clear explanation, much appreciated

@adityamaru
Copy link
Contributor Author

TFTRs!

bors r+

Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @adityamaru, @irfansharif, and @miretskiy)


pkg/jobs/adopt.go, line 268 at r3 (raw file):

		ju *JobUpdater) error {
		md.Payload.TraceID = span.TraceID()
		ju.UpdatePayload(md.Payload)

Should this be progress?

Update: as per offline discussion we'll change this in a follow-up


pkg/jobs/jobs.go, line 100 at r3 (raw file):

	// ForceRealSpan forces the registry to create a real Span instead of a
	// low-overhead non-recordable noop span.
	ForceRealSpan() bool

nit, take it or leave it: but it looks like every implementation of this is going to return True, so we can not even have a return value here and implementing it indicates that we want a real span/traceable job.

@craig
Copy link
Contributor

craig bot commented May 18, 2021

Build succeeded:

@adityamaru
Copy link
Contributor Author

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @adityamaru, @irfansharif, and @miretskiy)

pkg/jobs/adopt.go, line 268 at r3 (raw file):

		ju *JobUpdater) error {
		md.Payload.TraceID = span.TraceID()
		ju.UpdatePayload(md.Payload)

Should this be progress?

Update: as per offline discussion we'll change this in a follow-up

pkg/jobs/jobs.go, line 100 at r3 (raw file):

	// ForceRealSpan forces the registry to create a real Span instead of a
	// low-overhead non-recordable noop span.
	ForceRealSpan() bool

nit, take it or leave it: but it looks like every implementation of this is going to return True, so we can not even have a return value here and implementing it indicates that we want a real span/traceable job.

good point, addressed over in #65454.

adityamaru added a commit to adityamaru/cockroach that referenced this pull request May 19, 2021
The trace_id associated with a job is updated on every
resumption of the job. If the job is paused and resumed
several times, we don't want to rewrite the payload every
single time. Thus, we move the field to the job progress
which is expected to be rewritten frequently.

Also addresses a comment from cockroachdb#65322.

Release note: None
craig bot pushed a commit that referenced this pull request May 19, 2021
62719: kvserver: simplify and document handling of ReplicaPlaceholder r=erikgrinaker a=tbg

This commit cleans up the handling of ReplicaPlaceholder and documents their
semantics better. I embarked on this while looking into an unrelated test
flake. It occurred to me that we were not installing placeholders before the
snapshot transfer, but only after. This seems like a bad idea now that our
average snapshot size has gone up, as placeholders help avoid duplicate work;
we now install them before accepting the range data.

In practice, we currently only allow once snapshot inflight per store, so it
is not clear that that improvement is buying us anything. I think that the
improved clarity in this commit will, though.

Concretely, we now have two simple rules for placeholders:

1. they only exist for uninitialized replicas (and must exist for those, or we
   can end up with overlapping replicas)
2. you write it --> you (and *only* you) remove it.


1 was true before but the documentation was not clear. 2. was not true, as
there were a few out-of-band places that removed placeholders (and didn't
clearly explain why - sounded like we were previously leaking placeholders,
maybe @ajwerner remembers?).

There's also a bit of extra cleanup that clarifies things - for example, since
caller to `applySnapshot` already checked for an empty snapshot, the
placeholder removal within it effectively dead code and is now removed. Also,
in receiveSnapshot, there was an optimistic check that allowed us to discard
snapshots early. It was always confusing to keep track of which one is which
and now that we've lifted the authoritative check to the top of
receiveSnapshot, we got to delete the other one.

Release note: None


65454: jobs: move trace_id from payload to progress r=pbardea a=adityamaru

The trace_id associated with a job is updated on every
resumption of the job. If the job is paused and resumed
several times, we don't want to rewrite the payload every
single time. Thus, we move the field to the job progress
which is expected to be rewritten frequently.

Also addresses a comment from #65322.

Release note: None

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Aditya Maru <[email protected]>
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Jul 15, 2021
In cockroachdb#65322 we added code to record a trace ID into job progress. This code
added a new error return path to runJob which did not unregister the job
from the in-memory state in the registry. This PR fixes that oversight
by deferring that call and allowing jobs to be retried when writing that
trace ID fails.

Release note (bug fix): Fixed a bug in jobs where failures to write to the
jobs table can prevent subsequent adoption of a job until the previous
node dies or the jobs is paused.
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Jul 16, 2021
In cockroachdb#65322 we added code to record a trace ID into job progress. This code
added a new error return path to runJob which did not unregister the job
from the in-memory state in the registry. This PR fixes that oversight
by deferring that call and allowing jobs to be retried when writing that
trace ID fails.

Release note (bug fix): Fixed a bug in jobs where failures to write to the
jobs table can prevent subsequent adoption of a job until the previous
node dies or the jobs is paused.
craig bot pushed a commit that referenced this pull request Jul 16, 2021
67671: jobs: fix bug where failure to record the trace ID prevents adoption, remove RTT r=ajwerner a=ajwerner

In #65322 we added code to record a trace ID into job progress. This code
added a new error return path to runJob which did not unregister the job
from the in-memory state in the registry. This PR fixes that oversight
by deferring that call and allowing jobs to be retried when writing that
trace ID fails.

**jobs: only write the trace ID into jobs if nonzero**
We don't trace most jobs. It's an extra write to the jobs table for each
run to just mark it as 0.

Release note (performance improvement): Eliminated a round-trip when running
most jobs.

Release note (bug fix): Fixed a bug in jobs where failures to write to the
jobs table can prevent subsequent adoption of a job until the previous
node dies or the jobs is paused.

Co-authored-by: Andrew Werner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants