-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: surface the trace_id linked to job execution #65322
Conversation
@irfansharif adding you as a reviewer mainly for the first commit that changes the nature of the span associated with a job. |
Hmm the CI failure indicates that I have to |
I think I just scared myself because of a unit test that was mocking out a job record and not going through the registry. We should not be leaking any spans. For posterity:
This is RFAL! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, LGTM. This seems like it will give some nice immediate returns! 😎 I don't think I have enough context re: the new root spans being created in the 1st commit, but curious to hear @irfansharif's thoughts.
Reviewed 7 of 7 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @irfansharif, and @pbardea)
pkg/jobs/adopt.go, line 258 at r1 (raw file):
// If the job is being resumed by a registry, AnnotateCtxWithSpan will create // a new root span as ctx is not annotated with a parent span. ctx, span = r.ac.AnnotateCtxWithSpan(ctx, spanName, tracing.WithForceRealSpan())
More a curiosity question - can you link me to any resources that can help me understand the scenarios & semantics surrounding "resumed" jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like discussed offline, I'd be curious to see what this PR would look like if we created a root span when only running the job (both the very first time, and also during resumption), instead of creating the span when the job is first created, and not using it until executing it. That way, when running the job, we can unconditionally create a root span instead of "check if a span exists, if so use it, if not, create a new one because we're being resumed".
Also, from our offline discussions, the first commit message wants to be changed. Forked spans don't have to propagate payloads back to parent span. Everything else around storing the trace ID (and updating it on resumption) for future instropectability sounds sane to me. I'm excited for us to flesh out the primitives to read in-flight trace data from across the cluster, instead of the simpler primitive we have today that only renders a full recording of the trace's remote children only after it's done.
pkg/jobs/adopt.go
Outdated
defer span.Finish() | ||
if err := job.Update(ctx, nil /* txn */, func(txn *kv.Txn, md JobMetadata, | ||
ju *JobUpdater) error { | ||
md.Payload.TraceId = span.TraceID() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proto field should be named TraceID, there's a gogoproto override that lets you do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, done.
pkg/jobs/jobs.go
Outdated
|
||
// FinishSpan will mark the span linked to the StartableJob as Finish()'d. | ||
func (sj *StartableJob) FinishSpan() { | ||
if sj.span != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nil check is unnecessary, it's fine to call .Finish on a nil span. Inlining sj.span.Finish at callsites seems fine to me.
cockroach/pkg/util/tracing/span.go
Line 79 in ab73621
if sp == nil || sp.i.isNoop() || atomic.AddInt32(&sp.numFinishCalled, 1) != 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, deleted the method though since we're no longer bookkeeping a span on job creation.
pkg/jobs/registry.go
Outdated
resumerCtx = tracing.ContextWithSpan(resumerCtx, span) | ||
} | ||
resumerCtx, span = r.settings.Tracer.StartSpanCtx(resumerCtx, "job", | ||
tracing.WithForceRealSpan()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often do these jobs execute? How many active jobs are typicall present in any running cluster, under stable operation? Real spans have a tiny amount of overhead, and before this PR, they're only created for sampled statements. Just sanity checking that this isn't in the hotpath or anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along this line, one idea is to give jobs a verbosity level of their own that they can declare when creating their job record. That way particular types of jobs can opt-in to be traced (e.g. we're pretty confident there won't be a crazy amount of backup/restore jobs so those jobs can always enable tracing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea, I haven't added it in yet because I want to flesh this out a little.
A span can either be a noop (this is the default unless tracing is on) or a "real" span. A noop span records nothing. A "real" span can further be in verbose mode or not. A non-verbose, real span allows for recording structured payloads but discards all free form logging. A verbose, real span allows for all forms of recording. In the initial google doc, knz had suggested we stick to structured payloads for tracing job execution.
So I'm inclined to allow jobs to opt-in to being a real, root span, but not allowing them to choose between verbose/non-verbose. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good, my mention of jobs having a "verbosity" was to air on the side of making that flag an enum in the proto rather than a bool so that we can adjust later on if we want to add more fine-grained control (if it was to be something that we put in the proto -- I could imagine a future where some jobs might want to be sampled and all executions of that job would want to be traced, but I'm +1 on a traceable job interface like Yevgeniy suggested below).
Updated to only creating a root span when we start job execution. Also updated the first commit message to reflect this. |
I'm not sure we have any docs around the specifics of job resumption. We have an old RFC on the jobs system though I think it has undergone fairly significant changes since then. In general, job creation and job execution are independent in the jobs infrastructure. A job when created is recorded once in the I'm happy to talk more about the job subsystem offline, if you have any questions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @irfansharif, and @pbardea)
pkg/jobs/registry.go, line 540 at r1 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
This is a good idea, I haven't added it in yet because I want to flesh this out a little.
A span can either be a noop (this is the default unless tracing is on) or a "real" span. A noop span records nothing. A "real" span can further be in verbose mode or not. A non-verbose, real span allows for recording structured payloads but discards all free form logging. A verbose, real span allows for all forms of recording. In the initial google doc, knz had suggested we stick to structured payloads for tracing job execution.
So I'm inclined to allow jobs to opt-in to being a real, root span, but not allowing them to choose between verbose/non-verbose. What do you think?
@adityamaru FYI: opting into being traced is easy. I would just add a marker interface to jobs:
type TraceableJob interface {}
You can even add some methods there too (like: I want to be traced verbose or some such).
Have registry test if the resumer also implements this interface and create spans appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of pkg/tracing in this way LGTM.
pkg/jobs/adopt.go
Outdated
@@ -248,8 +248,23 @@ func (r *Registry) runJob( | |||
defer cleanup() | |||
spanName := fmt.Sprintf(`%s-%d`, typ, job.ID()) | |||
var span *tracing.Span | |||
ctx, span = r.ac.AnnotateCtxWithSpan(ctx, spanName) | |||
|
|||
// Wrap the context with a root span that is tied to the execution of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd drop the "Wrap the context" verbiage, lets just say "Create a new root span to trace the execution of the current instance of job...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
pkg/util/log/ambient_context.go
Outdated
@@ -158,7 +158,7 @@ func (ac *AmbientContext) annotateCtxInternal(ctx context.Context) context.Conte | |||
// | |||
// The caller is responsible for closing the span (via Span.Finish). | |||
func (ac *AmbientContext) AnnotateCtxWithSpan( | |||
ctx context.Context, opName string, | |||
ctx context.Context, opName string, os ...tracing.SpanOption, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diffs in these files can be reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mod the discussion about opting into this.
Also, should there be a release note somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pbardea added Yevgeniy's suggestion for opting in - https://github.com/cockroachdb/cockroach/pull/65322/files#diff-651e5b9db62dc86f9ef2a2d1937717a587117d43e9a589db3a43531817f2ca3bR257
In the future, we can add a ForceVerboseSpan
if we deem it necessary. re: release note, I added one to the second commit about our new column in SHOW JOBS
.
edit: oops i think i broke all the tests with an NPE.
Previously, a job on creation would create a new context and wrap it with a "forked span" from the parent ctx. This meant that the span would inherit the traceID from its remote parent. When the job began executing, we would create yet another span. This span would be a child of the "forked span" mentioned above, if the job was being started on the same registry that created it. If the job was being resumed by a new registry, we would create a root span. This change deletes the root span created during job creation, and always creates a root span when the job begins executing. We did not use the span created on job creation for anything and therefore this seems like an appropriate simplification. Creating a root span allows us to identify only those spans linked to job execution using the trace_id assigned to this root span. This will be helpful when we build tooling to improve job observability that will pull all the relevant spans. This change also stores the trace_id associated with the current job resumer in the jobs' payload. This allows us to use it at a later time to inspect what a job is doing at a given point in time! Release note: None
This change adds the trace_id associated with the current execution of the job via the crdb_internal.jobs table and as a column in `SHOW JOBS`. This will allow users to pull the inflight traces for a job for debugging purposes. Release note (sql change): `SHOW JOBS` now shows the `trace_id` of the trace that is associated with the current execution of the job. This allows pulling inflight traces for a job for debugging purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @adityamaru, @irfansharif, and @miretskiy)
pkg/jobs/adopt.go, line 258 at r1 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
I'm not sure we have any docs around the specifics of job resumption. We have an old RFC on the jobs system though I think it has undergone fairly significant changes since then. In general, job creation and job execution are independent in the jobs infrastructure. A job when created is recorded once in the
system.jobs
table following which it is an acceptable candidate for adoption. Every node in the cluster runs a registry, and one of the processes in this registry polls thesystem.jobs
table for adoptable jobs. When a registry adopts a job, it recreates the job from the information stored during creation, and begins execution. A job can subsequently bePAUSED
andRESUMED
multiple times, and this works by effectively toggling the adoptability of the job. There is no guarantee that the node that created the job/was previously running the job, will be the one to pick up the execution of the job on unpause.
A job might also be a candidate for resumption if the node running the job crashes. In this case, the current execution of the job will fail but will not mark the job entry as failed, thereby allowing other idle nodes to pickup its execution.I'm happy to talk more about the job subsystem offline, if you have any questions!
Makes sense - thanks for the clear explanation, much appreciated
TFTRs! bors r+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @adityamaru, @irfansharif, and @miretskiy)
pkg/jobs/adopt.go, line 268 at r3 (raw file):
ju *JobUpdater) error { md.Payload.TraceID = span.TraceID() ju.UpdatePayload(md.Payload)
Should this be progress?
Update: as per offline discussion we'll change this in a follow-up
pkg/jobs/jobs.go, line 100 at r3 (raw file):
// ForceRealSpan forces the registry to create a real Span instead of a // low-overhead non-recordable noop span. ForceRealSpan() bool
nit, take it or leave it: but it looks like every implementation of this is going to return True
, so we can not even have a return value here and implementing it indicates that we want a real span/traceable job.
Build succeeded: |
good point, addressed over in #65454. |
The trace_id associated with a job is updated on every resumption of the job. If the job is paused and resumed several times, we don't want to rewrite the payload every single time. Thus, we move the field to the job progress which is expected to be rewritten frequently. Also addresses a comment from cockroachdb#65322. Release note: None
62719: kvserver: simplify and document handling of ReplicaPlaceholder r=erikgrinaker a=tbg This commit cleans up the handling of ReplicaPlaceholder and documents their semantics better. I embarked on this while looking into an unrelated test flake. It occurred to me that we were not installing placeholders before the snapshot transfer, but only after. This seems like a bad idea now that our average snapshot size has gone up, as placeholders help avoid duplicate work; we now install them before accepting the range data. In practice, we currently only allow once snapshot inflight per store, so it is not clear that that improvement is buying us anything. I think that the improved clarity in this commit will, though. Concretely, we now have two simple rules for placeholders: 1. they only exist for uninitialized replicas (and must exist for those, or we can end up with overlapping replicas) 2. you write it --> you (and *only* you) remove it. 1 was true before but the documentation was not clear. 2. was not true, as there were a few out-of-band places that removed placeholders (and didn't clearly explain why - sounded like we were previously leaking placeholders, maybe @ajwerner remembers?). There's also a bit of extra cleanup that clarifies things - for example, since caller to `applySnapshot` already checked for an empty snapshot, the placeholder removal within it effectively dead code and is now removed. Also, in receiveSnapshot, there was an optimistic check that allowed us to discard snapshots early. It was always confusing to keep track of which one is which and now that we've lifted the authoritative check to the top of receiveSnapshot, we got to delete the other one. Release note: None 65454: jobs: move trace_id from payload to progress r=pbardea a=adityamaru The trace_id associated with a job is updated on every resumption of the job. If the job is paused and resumed several times, we don't want to rewrite the payload every single time. Thus, we move the field to the job progress which is expected to be rewritten frequently. Also addresses a comment from #65322. Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Aditya Maru <[email protected]>
In cockroachdb#65322 we added code to record a trace ID into job progress. This code added a new error return path to runJob which did not unregister the job from the in-memory state in the registry. This PR fixes that oversight by deferring that call and allowing jobs to be retried when writing that trace ID fails. Release note (bug fix): Fixed a bug in jobs where failures to write to the jobs table can prevent subsequent adoption of a job until the previous node dies or the jobs is paused.
In cockroachdb#65322 we added code to record a trace ID into job progress. This code added a new error return path to runJob which did not unregister the job from the in-memory state in the registry. This PR fixes that oversight by deferring that call and allowing jobs to be retried when writing that trace ID fails. Release note (bug fix): Fixed a bug in jobs where failures to write to the jobs table can prevent subsequent adoption of a job until the previous node dies or the jobs is paused.
67671: jobs: fix bug where failure to record the trace ID prevents adoption, remove RTT r=ajwerner a=ajwerner In #65322 we added code to record a trace ID into job progress. This code added a new error return path to runJob which did not unregister the job from the in-memory state in the registry. This PR fixes that oversight by deferring that call and allowing jobs to be retried when writing that trace ID fails. **jobs: only write the trace ID into jobs if nonzero** We don't trace most jobs. It's an extra write to the jobs table for each run to just mark it as 0. Release note (performance improvement): Eliminated a round-trip when running most jobs. Release note (bug fix): Fixed a bug in jobs where failures to write to the jobs table can prevent subsequent adoption of a job until the previous node dies or the jobs is paused. Co-authored-by: Andrew Werner <[email protected]>
Please see individual commit messages.
Informs: #64992