-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs/cdc: add metrics for paused jobs #89752
Conversation
Seems that we already have metrics for this #85467 (comment). Closing for now. |
I don't think this is correct: idleness is an entirely different concept
related to the number
of jobs that are currently running, but that are idle -- so that they can
be shutdown by serverless.
This has nothing to do with this pr.
…On Thu, Oct 13, 2022 at 4:53 PM Jayant Shrivastava ***@***.***> wrote:
Seems that we already have metrics for this #85467 (comment)
<#85467 (comment)>.
Closing for now.
—
Reply to this email directly, view it on GitHub
<#89752 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANA4FVCO7YJ2OU5QC5DV5ZTWDBZD7ANCNFSM6AAAAAARCMUUKI>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
297bc36
to
a557268
Compare
211a9cb
to
8cd140d
Compare
a35029b
to
1706650
Compare
1706650
to
eb534e2
Compare
Leaving a note for when I get back to this. https://github.com/jayshrivastava/cockroach/tree/rowfetcher-2
|
eb534e2
to
3082ac7
Compare
9411852
to
fb89d23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @miretskiy, @samiskin, and @shermanCRL)
pkg/jobs/registry.go
line 934 at r5 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
you probably want to set noncancellable bit?
Done.
pkg/jobs/registry.go
line 972 at r5 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
right; but you're not declaring; you're creating it.
Do declare:var metricUpdate map[jobspb.Type]int
Done.
pkg/jobs/testing_knobs.go
line 93 at r5 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
not sure we need a pointer -- i guess it's consistent, so ... fine...
Done.
369144c
to
ef93a0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 28 files at r8, 4 of 6 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jayshrivastava and @samiskin)
pkg/jobs/registry.go
line 1148 at r9 (raw file):
err = ctx.Err() return case <-time.After(PollJobsMetricsInterval.Get(&r.settings.SV)):
you probably want to create timer for this.
pkg/upgrade/upgrades/create_jobs_metrics_polling_job.go
line 44 at r9 (raw file):
} // If there isn't a row for the key visualizer job, create the job.
coment needs updating.
I suspect this code is repeated with key visualizer logic? Consider adding a helper (createBootstrapJob
or some such)
ef93a0f
to
0360c2c
Compare
bors r=miretskiy TYFR! |
Build failed (retrying...): |
Build failed (retrying...): |
bors r- |
Canceled. |
0360c2c
to
87fcb73
Compare
This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are counted at an interval defined by the cluster setting `jobs.metrics.interval.poll`. This is implemented by a job which periodically queries `system.jobs` to count the number of paused jobs. This job is of the newly added type `jobspb.TypePollJobsStats`. When a node starts it's job registry, it will create an adoptable stats polling job if it does not exist already using a transaction. This change adds a test which pauses and resumes changefeeds while asserting the value of the `jobs.changefeed.currently_paused` metric. It also adds a logictest to ensure one instance of the stats polling job is created in a cluster. Resolves: cockroachdb#85467 Release note (general change): This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are updated at an interval defined by the cluster setting `jobs.metrics.interval.poll`, which is defauled to 10 seconds. Epic: None
87fcb73
to
690da3e
Compare
bors r=miretskiy |
Build succeeded: |
Prior PR cockroachdb#89752 added a metrics poller job which produces per job type stats on the number of paused jobs. This PR extends metrics poller to also collect stats related to protected timestamps created by jobs. Namely, two new metrics, per job type are added: * `jobs.<job type>.protected_record_count` -- keeps track of the number of protected timestamp records help by the jobs. * `jobs.<job type>.protected_age_sec` -- keeps track of the age of the oldest protected timestamp held by those jobs. The metrics improve observability into protected timestamp system, and allow operators to alert when protected timestamp records are too old since that prevents garbage collection from occuring (and if GC is not performed for too long, the cluster performance would degrade). Follow on work will also make this functionality available for schedules. Epic: CRDB-21953 Fixes cockroachdb#78354 Release note (enterprise change): Jobs that utilize protected timestamp system (BACKUP, CHANGEFEED, IMPORT, etc) now produce metrics that can be monitored to detect cases when job leaves stale protected timestamp, preventing garbage collection from occuring.
Prior PR cockroachdb#89752 added a metrics poller job which produces per job type stats on the number of paused jobs. This PR extends metrics poller to also collect stats related to protected timestamps created by jobs. Namely, two new metrics, per job type are added: * `jobs.<job type>.protected_record_count` -- keeps track of the number of protected timestamp records help by the jobs. * `jobs.<job type>.protected_age_sec` -- keeps track of the age of the oldest protected timestamp held by those jobs. The metrics improve observability into protected timestamp system, and allow operators to alert when protected timestamp records are too old since that prevents garbage collection from occuring (and if GC is not performed for too long, the cluster performance would degrade). Follow on work will also make this functionality available for schedules. Epic: CRDB-21953 Fixes cockroachdb#78354 Release note (enterprise change): Jobs that utilize protected timestamp system (BACKUP, CHANGEFEED, IMPORT, etc) now produce metrics that can be monitored to detect cases when job leaves stale protected timestamp, preventing garbage collection from occuring.
Prior PR cockroachdb#89752 added a metrics poller job which produces per job type stats on the number of paused jobs. This PR extends metrics poller to also collect stats related to protected timestamps created by jobs. Namely, two new metrics, per job type are added: * `jobs.<job type>.protected_record_count` -- keeps track of the number of protected timestamp records help by the jobs. * `jobs.<job type>.protected_age_sec` -- keeps track of the age of the oldest protected timestamp held by those jobs. The metrics improve observability into protected timestamp system, and allow operators to alert when protected timestamp records are too old since that prevents garbage collection from occuring (and if GC is not performed for too long, the cluster performance would degrade). Follow on work will also make this functionality available for schedules. Epic: CRDB-21953 Fixes cockroachdb#78354 Release note (enterprise change): Jobs that utilize protected timestamp system (BACKUP, CHANGEFEED, IMPORT, etc) now produce metrics that can be monitored to detect cases when job leaves stale protected timestamp, preventing garbage collection from occuring.
This change adds new metrics to count paused jobs for every job type. For
example, the metric for paused changefeed jobs is
jobs.changefeed.currently_paused
. These metrics are counted at aninterval defined by the cluster setting
jobs.metrics.interval.poll
.This is implemented by a job which periodically queries
system.jobs
to count the number of paused jobs. This job is of the newly added type
jobspb.TypePollJobsStats
. When a node starts it's job registry, it willcreate an adoptable stats polling job if it does not exist already using a
transaction.
This change adds a test which pauses and resumes changefeeds while asserting
the value of the
jobs.changefeed.currently_paused
metric. It also adds alogictest to ensure one instance of the stats polling job is created in a
cluster.
Resolves: #85467
Release note (general change): This change adds new metrics to count
paused jobs for every job type. For example, the metric for paused
changefeed jobs is
jobs.changefeed.currently_paused
. These metricsare updated at an interval defined by the cluster setting
jobs.metrics.interval.poll
, which is defauled to 10 seconds.Epic: None