-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bulk: implement a bulk aggregator that uses tracing to surface information about suboperations #80388
Labels
A-disaster-recovery
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-disaster-recovery
Comments
adityamaru
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-disaster-recovery
labels
Apr 22, 2022
cc @cockroachdb/bulk-io |
This was referenced Apr 22, 2022
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
May 14, 2022
This change introduces an Aggregator that can be registered as an EventListener and LazyTag on a tracing span. The Aggregator will be notified on every StructuredEvent emitted in the span's recording, and will subsume the event into a running aggregate stored in memory. This running aggregate will be rendered on the span's recording as a LazyTag that can be consumed via the `tracez` page on the debug console. Currently, the backup processor is the only bulk processor that creates and registers an Aggregator. As an initial prototype this change teaches every ExportRequest to emit ExportStats that will be aggregated by the Aggregator. Note, this change also removes various StructuredEvents that were emitted by the backup processor but have never proved useful. These were added as part of the `cockroach debug job-trace` command which we hope to replace with this Aggregator. Informs: cockroachdb#80388 Release note: None
This was referenced May 14, 2022
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Jul 7, 2022
This change introduces an Aggregator object that is capable of listening for Structured events emitted in a recording, aggregating them and rendering them as LazyTags. We also introduce an AggregatorEvent interface that can be implemented by a Structured event thereby making it eligible for aggregation in the Aggregator. The first user of the Aggregator will be every backup data processor that is setup on the nodes in the cluster during a backup. The Aggregator lives as long as the processor, and listens for Aggregator events emitted by any span that is a child of the processors' span. This includes both local children as well as remote children whose recordings have been imported into a local span. The Aggregator stores running aggregates of each AggregatorEvent it is notified about, bucketed by the events' tag name. This aggregate will be rendered on the tracing span as a LazyTag. This change teaches every ExportRequest to emit an AggregatorEvent. Going forward we expect many more operations in bulk jobs to define and emit such events providing visibility into otherwise opaque operations. We cleanup some of the StructuredEvents that were previously added but have not proved useful, and also change some of the tracing span operation names to be more intuitive. Fixes: cockroachdb#80388 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 12, 2022
This change introduces an Aggregator object that is capable of listening for Structured events emitted in a recording, aggregating them and rendering them as LazyTags. We also introduce an AggregatorEvent interface that can be implemented by a Structured event thereby making it eligible for aggregation in the Aggregator. The first user of the Aggregator will be every backup data processor that is setup on the nodes in the cluster during a backup. The Aggregator lives as long as the processor, and listens for Aggregator events emitted by any span that is a child of the processors' span. This includes both local children as well as remote children whose recordings have been imported into a local span. The Aggregator stores running aggregates of each AggregatorEvent it is notified about, bucketed by the events' tag name. This aggregate will be rendered on the tracing span as a LazyTag. This change teaches every ExportRequest to emit an AggregatorEvent. Going forward we expect many more operations in bulk jobs to define and emit such events providing visibility into otherwise opaque operations. We cleanup some of the StructuredEvents that were previously added but have not proved useful, and also change some of the tracing span operation names to be more intuitive. Fixes: cockroachdb#80388 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 16, 2022
This change introduces an Aggregator object that is capable of listening for Structured events emitted in a recording, aggregating them and rendering them as LazyTags. We also introduce an AggregatorEvent interface that can be implemented by a Structured event thereby making it eligible for aggregation in the Aggregator. The first user of the Aggregator will be every backup data processor that is setup on the nodes in the cluster during a backup. The Aggregator lives as long as the processor, and listens for Aggregator events emitted by any span that is a child of the processors' span. This includes both local children as well as remote children whose recordings have been imported into a local span. The Aggregator stores running aggregates of each AggregatorEvent it is notified about, bucketed by the events' tag name. This aggregate will be rendered on the tracing span as a LazyTag. This change teaches every ExportRequest to emit an AggregatorEvent. Going forward we expect many more operations in bulk jobs to define and emit such events providing visibility into otherwise opaque operations. We cleanup some of the StructuredEvents that were previously added but have not proved useful, and also change some of the tracing span operation names to be more intuitive. To view these aggregated events once can navigate to the `/tracez` endpoint of the debug console to take a snapshot and search for either `BACKUP` or the job ID to filter for tracing spans on that node. The span associated with the backup processor will be decorated with tags that correspond to the fields in the introduced `ExportStats` proto message. Note these stats are only aggregated on a per node basis. Fixes: cockroachdb#80388 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 17, 2022
This change introduces an Aggregator object that is capable of listening for Structured events emitted in a recording, aggregating them and rendering them as LazyTags. We also introduce an AggregatorEvent interface that can be implemented by a Structured event thereby making it eligible for aggregation in the Aggregator. The first user of the Aggregator will be every backup data processor that is setup on the nodes in the cluster during a backup. The Aggregator lives as long as the processor, and listens for Aggregator events emitted by any span that is a child of the processors' span. This includes both local children as well as remote children whose recordings have been imported into a local span. The Aggregator stores running aggregates of each AggregatorEvent it is notified about, bucketed by the events' tag name. This aggregate will be rendered on the tracing span as a LazyTag. This change teaches every ExportRequest to emit an AggregatorEvent. Going forward we expect many more operations in bulk jobs to define and emit such events providing visibility into otherwise opaque operations. We cleanup some of the StructuredEvents that were previously added but have not proved useful, and also change some of the tracing span operation names to be more intuitive. To view these aggregated events once can navigate to the `/tracez` endpoint of the debug console to take a snapshot and search for either `BACKUP` or the job ID to filter for tracing spans on that node. The span associated with the backup processor will be decorated with tags that correspond to the fields in the introduced `ExportStats` proto message. Note these stats are only aggregated on a per node basis. Fixes: cockroachdb#80388 Release note: None
craig bot
pushed a commit
that referenced
this issue
Aug 17, 2022
84043: bulk, backupccl: introduce a Structured event Aggregator r=adityamaru a=adityamaru This change introduces an Aggregator object that is capable of listening for Structured events emitted in a recording, aggregating them and rendering them as LazyTags. We also introduce an AggregatorEvent interface that can be implemented by a Structured event thereby making it eligible for aggregation in the Aggregator. The first user of the Aggregator will be every backup data processor that is setup on the nodes in the cluster during a backup. The Aggregator lives as long as the processor, and listens for Aggregator events emitted by any span that is a child of the processors' span. This includes both local children as well as remote children whose recordings have been imported into a local span. The Aggregator stores running aggregates of each AggregatorEvent it is notified about, bucketed by the events' tag name. This aggregate will be rendered on the tracing span as a LazyTag. This change teaches every ExportRequest to emit an AggregatorEvent. Going forward we expect many more operations in bulk jobs to define and emit such events providing visibility into otherwise opaque operations. We cleanup some of the StructuredEvents that were previously added but have not proved useful, and also change some of the tracing span operation names to be more intuitive. To view these aggregated events once can navigate to the `/tracez` endpoint of the debug console to take a snapshot and search for either `BACKUP` or the job ID to filter for tracing spans on that node. The span associated with the backup processor will be decorated with tags that correspond to the fields in the introduced `ExportStats` proto message. Note these stats are only aggregated on a per node basis. Fixes: #80388 Release note: None Release justification: low risk change for improved observability into backups. 86322: ttl: use JobRegistry.UpdateJobWithTxn to avoid race condition r=rafiss a=ecwall fixes #85800 UpdateJobWithTxn updates the job progress with useReadLock=true to prevent concurrent update attempts from undoing each other. Release justification: Fix test flake. Release note: None Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Evan Wall <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-disaster-recovery
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-disaster-recovery
Stemming from a conversation with @andreimatei and @dt.
A bulk aggregator is an object that is responsible for aggregating and rendering information about suboperations spawned by a bulk processor. To begin with, we will focus on aggregating two data points:
The duration of each operation.
Interesting
StructuredEvent
s that can be used to convey more detailed metrics such as sst bytes ingested, wait time per store, number of sst batcher flushes etc.For 1) we will start by addressing #80391. This will allow us to simply ask our root tracing span for a snapshot of its
childSpan -> duration
mapping at render time. Note, the bulk aggregator will be set as aLazyTag
on the processor's root context, which allows us to write bespoke logic in ourRender
function.For 2) we need to build out the structured event listener in #80395. This is an important piece because we do not want the bulk aggregator to miss events as they are rotated out of the tracing span's ring buffer. Bulk jobs are known to run for hours and send thousands of RPCs per node, so it is certain that structured events will fall out of the buffer during the execution of the job. We would register the bulk aggregator as an event listener on the processor's root span, and this will allow us to intercept interesting events, and roll them up into a running aggregate.
The goal is to build a generic aggregator that can keep track of any suboperation that wraps its execution in a child span, without the overhead of notifying the aggregator at the various call sites. This is very attractive since threading
notify
calls in different parts of bulk code is a lot more of a lift than maintaining good hygiene about wrapping important operations in child spans.There is more exploratory work to be done on how we will present all the information in the bulk aggregator on the tracez page but that is outside the scope of this issue. #80198 broke some ground and we might be able to use some of the changes there.
Epic: CRDB-10262
Jira issue: CRDB-15736
Epic CRDB-10262
The text was updated successfully, but these errors were encountered: