-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing: aggregate OperationMetadata on span Finish() #81079
tracing: aggregate OperationMetadata on span Finish() #81079
Conversation
fda6549
to
ec175fc
Compare
I'd review #81020 first, and then I'll rebase this on top of that. |
ec175fc
to
c6667ea
Compare
c6667ea
to
2e263fb
Compare
2e263fb
to
e522300
Compare
1487c26
to
607e18a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @benbardin, and @stevendanna)
-- commits
line 15 at r3:
"its" what?
pkg/util/tracing/crdbspan.go
line 128 at r3 (raw file):
// children in its Recording. childrenMetadata therefore provides a bucketed // view of the various operations that are being traced as part of a span. childrenMetadata map[string]tracingpb.RecordedSpan_OperationMetadata
does this belong in recording
? (i.e., it's only populated when the span is recording, right?)
pkg/util/tracing/crdbspan.go
line 543 at r3 (raw file):
// finished_D // // We'd expect `parent` to have:
s/we'd expect parent to have/parent will have...
pkg/util/tracing/crdbspan.go
line 547 at r3 (raw file):
rootChild := childRecording[0] // Record the finished rootChilds' metadata. s.mu.childrenMetadata[rootChild.Operation] = s.mu.childrenMetadata[rootChild.Operation].Combine(
Instead of all this, I think it'd be better if recordings always included the ChildrenMetadata
. Have you considered having GetRecording()
do all this aggregation?
pkg/util/tracing/crdbspan.go
line 555 at r3 (raw file):
// Record the metadata of rootChilds' finished children. // // ChildrenMetadata was populated in GetRecording(...) with rootChilds'
I think this comment lies. GetRecording()
doesn't seem to have a role here, does it?
pkg/util/tracing/tracingpb/recorded_span.proto
line 112 at r2 (raw file):
reserved 5,10,11; // Next ID: 19
nit: I don't see this elsewhere around here. I don't think it's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, @benbardin, and @stevendanna)
pkg/util/tracing/crdbspan.go
line 547 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Instead of all this, I think it'd be better if recordings always included the
ChildrenMetadata
. Have you considered havingGetRecording()
do all this aggregation?
Every tracingpb.RecordedSpan in a Recording will always have its ChildrenMetadata
populated in GetRecording()
via the call to getRecordingNoChildrenLocked
. This represents the spans' Finish()ed children. When the span is finishing we need to do more than just that:
- We need to create an entry for the finishing span itself, which is what we are doing here.
- We need to recurse on the finishing spans' open children and create entries for them and their children.
I don't think it makes sense to do 1) in GetRecording
because I do not expect the span whose recording I am getting to be part of the ChildrenMetadata
map I am populating in the recording. That is why we do it here when we are finishing the span and updating its parent.
- is done during GetRecording for Structured because of the fact that we only return 1 tracingpb.RecordedSpan. I'm not convinced it makes sense to do it in the case of GetRecording for Verbose since we do return all tracingpb.RecordedSpans in the Recording, so we can do the recursion on render/finish. This seems more in line with our definition of the two modes?
pkg/util/tracing/crdbspan.go
line 555 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think this comment lies.
GetRecording()
doesn't seem to have a role here, does it?
getRecordingNoChildrenLocked
which is called from GetRecording in the Verbose mode is responsible for populating ChildrenMetadata
in each tracingpb.RecordedSpan. It is populated with all the Finish()ed children of the span at the time the recording is pulled.
607e18a
to
0379b5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @benbardin, and @stevendanna)
Previously, andreimatei (Andrei Matei) wrote…
"its" what?
its Recording - fixed
pkg/util/tracing/crdbspan.go
line 128 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
does this belong in
recording
? (i.e., it's only populated when the span is recording, right?)
Done.
pkg/util/tracing/crdbspan.go
line 543 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/we'd expect parent to have/parent will have...
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, @benbardin, and @stevendanna)
pkg/util/tracing/crdbspan.go
line 547 at r3 (raw file):
I don't think it makes sense to do 1) in GetRecording because I do not expect the span whose recording I am getting to be part of the ChildrenMetadata map I am populating in the recording. That is why we do it here when we are finishing the span and updating its parent.
This part makes sense.
2 is done during GetRecording for Structured because of the fact that we only return 1 tracingpb.RecordedSpan. I'm not convinced it makes sense to do it in the case of GetRecording for Verbose since we do return all tracingpb.RecordedSpans in the Recording, so we can do the recursion on render/finish. This seems more in line with our definition of the two modes?
I think verbose and structured recordings should behave the same with respect to returning the childrenMetadata
; I think they should both return it, and they should both include the stats about open children. Even though one can recurse through a verbose recording and compute whatever they want, I think it'd be very useful that this standardized childrenMeta
is always there (by the argument that a verbose recording can be processed by the caller, you wouldn't need to include the stats of the finished spans either). Otherwise, I'm pretty sure the difference in this respect between structured recordings and verbose recordings is going to surprise us every time.
Also, I think the distinction you're trying to draw about open children and finished children is inconsistent. In a structure like the following:
root
c1
c11
c2
if c1
finishes, and I collect root's
(verbose) recording, I get: {c11, unfinished:true}, {c1}
. So, I get info about some unfinished children (c11) but not others (c2). That's surprising!
WDYT?
Not to mention that, if we make verbose recordings include the stats for open children, the code in this function should be uniform regardless of the recording mode. As it stands, there's too much code :P. I think we should aim for this code to only need to look at rootChild
and not the other spans.
pkg/util/tracing/crdbspan.go
line 555 at r3 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
getRecordingNoChildrenLocked
which is called from GetRecording in the Verbose mode is responsible for populatingChildrenMetadata
in each tracingpb.RecordedSpan. It is populated with all the Finish()ed children of the span at the time the recording is pulled.
I see.
pkg/util/tracing/crdbspan.go
line 178 at r4 (raw file):
// children in its Recording. childrenMetadata therefore provides a bucketed // view of the various operations that are being traced as part of a span. childrenMetadata map[string]tracingpb.RecordedSpan_OperationMetadata
since we're using this proto in code that otherwise does not have much to do with RecordedSpan
, let's make the OperationMetadata
be defined as a top-level message, instead of a sub-message inside of RecordedSpan
pkg/util/tracing/crdbspan.go
line 573 at r4 (raw file):
// been flattened into `childRecording` by `GetRecording(...)` for _, rec := range childRecording[1:] { if !rec.Finished {
see, I think this if !rec.Finished
is really ugly. Ain't nobody gonna understand what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @benbardin, and @stevendanna)
pkg/util/tracing/crdbspan.go
line 547 at r3 (raw file):
That's surprising!
Indeed, you're right. It makes sense to push this logic into GetRecording for both Verbose and Structured. I've changed the PR to reflect this.
pkg/util/tracing/crdbspan.go
line 555 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I see.
Done.
pkg/util/tracing/crdbspan.go
line 178 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
since we're using this proto in code that otherwise does not have much to do with
RecordedSpan
, let's make theOperationMetadata
be defined as a top-level message, instead of a sub-message inside ofRecordedSpan
Done.
pkg/util/tracing/crdbspan.go
line 573 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
see, I think this
if !rec.Finished
is really ugly. Ain't nobody gonna understand what's going on.
agreed, moved this logic to GetRecording.
0379b5c
to
b31ca53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider including the new stats in Recording.String()
. If it turns out to be overwhelming, we can back it out, or trim it, or something. But I'd be curious to see how the full thing looks for starters.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, @benbardin, and @stevendanna)
e94c3ec
to
4da1e18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do this yup, I was going to work on making this usable from the built-in that pulls a recording as the next PR. As part of that I'll also stick this in the stringer.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @benbardin, and @stevendanna)
pkg/util/tracing/crdbspan.go
line 492 at r5 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: join these two paragraphs into one, to make it clear that they're in contrast to one another. Otherwise, I don't think the first one will mean much to readers.
I think you were referring to the two comments. I've deleted one of them since it wasn't really adding much.
pkg/util/tracing/crdbspan.go
line 496 at r5 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Looks better!
Done.
pkg/util/tracing/crdbspan.go
line 430 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Instead of taking this lock, I think it'd be better to take the data you need from
openChildRecording[0]
. It's all there, right?
Done.
pkg/util/tracing/crdbspan.go
line 443 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
can't this be
childrenMetadata[opName] = childrenMetadata[opName].Combine(metadata)
?
If so, consider extracting this loop into a function, called perhaps
RollupChildrenMeta(childrenMeta, openChildRecording[0].ChildrenMetadata)
, cause you have the same loop elsewhere and I think it'd benefit from a name.
Done.
pkg/util/tracing/crdbspan.go
line 457 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/from s' children/from s' open children
I scratched my head for a bit trying to understand if the
Combine
calls below are necessary
childrenMetadata contains entries from both open and finished children, so here we're just rolling up all those entries into the root of the recording.
pkg/util/tracing/crdbspan.go
line 495 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
instead of adding this new
includeRootStructuredEvents
param, I think I'd rather keepgetChildrenStructuredEventsRecursivelyLocked
(and rename it back togetStructuredEventsRecursivelyLocked
), always including the root, and assignres.StructuredRecords = buffer
below. And then the whole bit below about managing the space inres.StructuredRecords
goes away. I think the only thing that we lose is wasted work copying the structured events ingetRecordingNoChildrenLocked
, but I don't think that's a big deal.I'd write this as
res := s.getRecordingNoChildrenLocked( tracingpb.RecordingStructured, false, // finishing - since we're only asking for the structured recording, the argument doesn't matter ) res.StructuredRecords = s.getStructuredEventsRecursivelyLocked(res.StructuredRecords[:], includeDetachedChildren)
hmm res.StructuredRecords = s.getStructuredEventsRecursivelyLocked(res.StructuredRecords[:], includeDetachedChildren)
would double count the structured records in the root won't it? res.StructuredRecords[:]
would already have the events for the root, and then we'd again append the roots' events to that slice.
I've changed it to take in a buffer instead and then deleted the entire block around memory management below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru, @benbardin, and @stevendanna)
pkg/util/tracing/crdbspan.go
line 457 at r6 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
childrenMetadata contains entries from both open and finished children, so here we're just rolling up all those entries into the root of the recording.
oh so then can we replace this whole block with rootSpan.ChildrenMetada = childrenMetadata
?
pkg/util/tracing/crdbspan.go
line 495 at r6 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
hmm
res.StructuredRecords = s.getStructuredEventsRecursivelyLocked(res.StructuredRecords[:], includeDetachedChildren)
would double count the structured records in the root won't it?res.StructuredRecords[:]
would already have the events for the root, and then we'd again append the roots' events to that slice.I've changed it to take in a buffer instead and then deleted the entire block around memory management below.
ugh, sorry, I meant res.StructuredRecords[:0]
. I think you can use that to avoid allocating buffer
.
pkg/util/tracing/crdbspan.go
line 411 at r7 (raw file):
} s.mu.Lock()
nit: put everything until the unlock in a { }
block for visibility, now that the block has gotten pretty large.
pkg/util/tracing/crdbspan.go
line 480 at r7 (raw file):
defer s.mu.Unlock() // Fetch s' recording without including its children.
nit: this comment doesn't add anything to the function name
pkg/util/tracing/crdbspan.go
line 494 at r7 (raw file):
// open. childrenMetadata := make(map[string]tracingpb.OperationMetadata) s.getChildrenMetadataRecursivelyLocked(childrenMetadata,
can we pass in res.ChildrenMetadata
here, and get rid of childrenMetadata
4da1e18
to
e78b1e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @benbardin, and @stevendanna)
pkg/util/tracing/crdbspan.go
line 457 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
oh so then can we replace this whole block with
rootSpan.ChildrenMetada = childrenMetadata
?
Done.
pkg/util/tracing/crdbspan.go
line 495 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
ugh, sorry, I meant
res.StructuredRecords[:0]
. I think you can use that to avoid allocatingbuffer
.
Done.
pkg/util/tracing/crdbspan.go
line 494 at r7 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
can we pass in
res.ChildrenMetadata
here, and get rid ofchildrenMetadata
Done.
I'm seeing the CI failure on other builds in staging too https://teamcity.cockroachdb.com/viewLog.html?buildId=5547912&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_CheckGeneratedCode. I'll bring it up with whoever owns web UI tests but I don't think it is related to this diff. TFTR! bors r=andreimatei |
Build failed: |
bors r=andreimatei |
Build failed: |
This change adds a `ChildrenMetadata` map to `crdbspan` that is a mapping from operation to the operations' aggregated metadata. This map is updated whenever a child of the `crdbSpan` finishes, with metadata from all spans in the finishing childs' Recording. The map is therefore a bucketed view of all the operations being traced by a span. The motivation for this change is to surface more metadata about the suboperations being traced in a spans' Recording. This could in turn provide more o11y into why a job is slow/stuck, or where the performance of a distributed operation is bottlenecked. As part of a span Finish()ing, the span fetches its Recording with the spans' configured verbosity. Prior to this change the recording would then be processed as follows: *Verbose Recording* In the case of Verbose recording the spans in the recording are added to the parents' `finishedChildren` slice provided we have not exceeded the maximum number of children a parent can track. *Structured Recording* In the case of a Structured recording, only the StructuredEvents from the spans in the recording are copied into the parent. With this change, in both the Verbose and Structured recording mode, a finishing span is also responsible for rolling up the OperationMetadata of all the spans in its recording. This involves updating the parents' `childrenMetadata` mapping with: 1) an entry for the finishing span. 2) an entry for each of the finishing spans' Finish()ed children. 3) an entry for each of the finishing spans' open children, and their children recursively The logic for 2) and 3) is subsumed in the method responsible for getting the finishing spans' recording. Notably, GetRecording(...) for both Structured and Verbose recordings, populate the root of the recording with OperationMetadata of all finished and open children in the recording. As an example when we are done finishing `child`: ``` parent child grandchild ``` We'd expect `parent` to have: `{child: 2s, grandchild: 1s}` Given that Finish()ing a child, and importing a remote recording into a span share the same code path, the above semantics also apply to a remote recording being imported into a parent span. Fixes: cockroachdb#80391 Release note: None
e78b1e4
to
4ddc350
Compare
Now its TestDockerCLI acting up 😢 , let's try again. bors r=andreimatei |
Build failed: |
Another flake, this time on #83253 bors r=andreimatei |
Build succeeded: |
When the children of a finished span won't fit in a parent with a verbose recording type, we don't store the children but still want their structured records. Unfortunately, we did this in the recordFinishedChildredLocked case by falling through from the verbose recording case to the structured recording case. In cockroachdb#81079, we added an assertion in the structured recording case that only applies to structured recordings. The result is Finish() failing with: panic: RecordingStructured has 12 recordings; expected 1 Now, we don't fall through and instead gather the structured records of all children. Release note: None
In cockroachdb#81079, we added an assertion that failed if a child recording of a RecordingStructured span had more than one span recording. However, this is problematic for couple of reasons: 1) The assertion was on a code path shared with RecordingVerbose spans; and, 2) A RecordingStructured span can have a RecordingVerbose child. The RecordingVerbose child is likely to have more than one span recording. As a result, we have seen roachtests failing with panic: RecordingStructured has 12 recordings; expected 1 Here, we remove the assertion. Release note: None
In cockroachdb#81079, we added an assertion that failed if a child recording of a RecordingStructured span had more than one span recording. However, this is problematic for couple of reasons: 1) The assertion was on a code path shared with RecordingVerbose spans; and, 2) A RecordingStructured span can have a RecordingVerbose child. The RecordingVerbose child is likely to have more than one span recording. As a result, we have seen roachtests failing with panic: RecordingStructured has 12 recordings; expected 1 Here, we remove the assertion. Release note: None
83625: tracing: remove panic in Finish r=adityamaru, a=stevendanna In #81079, we added an assertion that failed if a child recording of a RecordingStructured span had more than one span recording. However, this is problematic for couple of reasons: 1) The assertion was on a code path shared with RecordingVerbose spans; and, 2) A RecordingStructured span can have a RecordingVerbose child. The RecordingVerbose child is likely to have more than one span recording. As a result, we have seen roachtests failing with panic: RecordingStructured has 12 recordings; expected 1 Here, we remove the assertion. Fixes #83502 Release note: None 83642: server/systemconfigwatcher/systemconfigwatchertest: fix race r=ajwerner a=ajwerner Fixes #83258 ``` ================== WARNING: DATA RACE Write at 0x00c001b75c00 by goroutine 264: github.com/cockroachdb/cockroach/pkg/roachpb.KeyValueByKey.Swap() github.com/cockroachdb/cockroach/pkg/roachpb/pkg/roachpb/data.go:2400 +0x190 github.com/cockroachdb/cockroach/pkg/roachpb.(*KeyValueByKey).Swap() <autogenerated>:1 +0xb3 sort.medianOfThree() GOROOT/src/sort/sort.go:90 +0x72 sort.doPivot() GOROOT/src/sort/sort.go:114 +0x9e sort.quickSort() GOROOT/src/sort/sort.go:203 +0xac sort.Sort() GOROOT/src/sort/sort.go:231 +0x64 github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher/systemconfigwatchertest.runTest.func2() github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher/systemconfigwatchertest/test_system_config_watcher.go:110 +0x4d1 github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher/systemconfigwatchertest.runTest.func3.1() github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher/systemconfigwatchertest/test_system_config_watcher.go:118 +0x4e github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithinError.func1() github.com/cockroachdb/cockroach/pkg/testutils/soon.go:69 +0x7a github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration() github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:207 +0x191 github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithinError() github.com/cockroachdb/cockroach/pkg/testutils/soon.go:75 +0x1d1 github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin() github.com/cockroachdb/cockroach/pkg/testutils/soon.go:57 +0x96 github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon() github.com/cockroachdb/cockroach/pkg/testutils/soon.go:40 +0x8d github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher/systemconfigwatchertest.runTest.func3() github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher/systemconfigwatchertest/test_system_config_watcher.go:117 +0x164 github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher/systemconfigwatchertest.runTest() github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher/systemconfigwatchertest/test_system_config_watcher.go:121 +0x5cf github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher/systemconfigwatchertest.TestSystemConfigWatcher.func1() github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher/systemconfigwatchertest/test_system_config_watcher.go:62 +0xdc testing.tRunner() GOROOT/src/testing/testing.go:1259 +0x22f testing.(*T).Run·dwrap·21() GOROOT/src/testing/testing.go:1306 +0x47 Previous read at 0x00c001b75c00 by goroutine 502: runtime.slicecopy() GOROOT/src/runtime/slice.go:284 +0x0 github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeedbuffer.MergeKVs() github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeedbuffer/kvs.go:47 +0x354 github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher.(*Cache).handleUpdate() github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher/cache.go:253 +0x431 github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher.(*Cache).handleUpdate-fm() github.com/cockroachdb/cockroach/pkg/server/systemconfigwatcher/cache.go:232 +0x11e github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeedcache.(*Watcher).handleUpdate() github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeedcache/watcher.go:346 +0x277 github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeedcache.(*Watcher).Run() github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeedcache/watcher.go:327 +0x1b04 github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeedcache.Start.func1() github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed/rangefeedcache/watcher.go:189 +0x3d1 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2() github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 +0x551 ``` Clone the values to make sure we don't have a race. Release note: None Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
This change adds a
ChildrenMetadata
map tocrdbspan
that is amapping from operation to the operations' aggregated metadata. This
map is updated whenever a child of the
crdbSpan
finishes, with metadatafrom all spans in the finishing childs' Recording. The map is therefore
a bucketed view of all the operations being traced by a span.
The motivation for this change is to surface more metadata about the
suboperations being traced in a spans' Recording. This could in turn provide
more o11y into why a job is slow/stuck, or where the performance of a
distributed operation is bottlenecked.
As part of a span Finish()ing, the span fetches its Recording with the spans'
configured verbosity. Prior to this change the recording would then be
processed as follows:
Verbose Recording
In the case of Verbose recording the spans in the recording are added to the parents'
finishedChildren
slice provided we have not exceeded the maximum number ofchildren a parent can track.
Structured Recording
In the case of a Structured recording, only the StructuredEvents from the spans in
the recording are copied into the parent.
With this change, in both the Verbose and Structured recording mode, a finishing span
is also responsible for rolling up the OperationMetadata of all the spans in its
recording. This involves updating the parents'
childrenMetadata
mapping with:The logic for 2) and 3) is subsumed in the method responsible for getting
the finishing spans' recording. Notably, GetRecording(...) for both Structured and Verbose
recordings, populate the root of the recording with OperationMetadata of all
finished and open children in the recording.
As an example when we are done finishing
child
:We'd expect
parent
to have:{child: 10s, finished_C: 4s, finished_D: 3s, open_A: 3s, finished_B: 1s}
Given that Finish()ing a child, and importing a remote recording into a span
share the same code path, the above semantics also apply to a remote recording
being imported into a parent span.
Fixes: #80391
Release note: None