Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs/cdc: add metrics for paused jobs #89752

Merged
merged 1 commit into from
Feb 8, 2023

Conversation

jayshrivastava
Copy link
Contributor

@jayshrivastava jayshrivastava commented Oct 11, 2022

This change adds new metrics to count paused jobs for every job type. For
example, the metric for paused changefeed jobs is
jobs.changefeed.currently_paused. These metrics are counted at an
interval defined by the cluster setting jobs.metrics.interval.poll.

This is implemented by a job which periodically queries system.jobs
to count the number of paused jobs. This job is of the newly added type
jobspb.TypePollJobsStats. When a node starts it's job registry, it will
create an adoptable stats polling job if it does not exist already using a
transaction.

This change adds a test which pauses and resumes changefeeds while asserting
the value of the jobs.changefeed.currently_paused metric. It also adds a
logictest to ensure one instance of the stats polling job is created in a
cluster.

Resolves: #85467

Release note (general change): This change adds new metrics to count
paused jobs for every job type. For example, the metric for paused
changefeed jobs is jobs.changefeed.currently_paused. These metrics
are updated at an interval defined by the cluster setting
jobs.metrics.interval.poll, which is defauled to 10 seconds.

Epic: None

@jayshrivastava
Copy link
Contributor Author

jayshrivastava commented Oct 11, 2022

edit (v2):
image

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jayshrivastava jayshrivastava marked this pull request as ready for review October 11, 2022 15:15
@jayshrivastava jayshrivastava requested a review from a team as a code owner October 11, 2022 15:15
@jayshrivastava jayshrivastava requested review from shermanCRL and removed request for a team October 11, 2022 15:15
pkg/ccl/changefeedccl/metrics.go Outdated Show resolved Hide resolved
pkg/ccl/changefeedccl/changefeed_stmt.go Outdated Show resolved Hide resolved
@jayshrivastava
Copy link
Contributor Author

Seems that we already have metrics for this #85467 (comment). Closing for now.

@miretskiy
Copy link
Contributor

miretskiy commented Oct 13, 2022 via email

@jayshrivastava jayshrivastava requested a review from a team as a code owner October 14, 2022 17:54
@jayshrivastava jayshrivastava force-pushed the paused-metrics branch 3 times, most recently from 211a9cb to 8cd140d Compare October 14, 2022 18:11
@jayshrivastava jayshrivastava force-pushed the paused-metrics branch 4 times, most recently from a35029b to 1706650 Compare October 17, 2022 15:15
pkg/jobs/adopt.go Outdated Show resolved Hide resolved
@jayshrivastava
Copy link
Contributor Author

Leaving a note for when I get back to this. https://github.com/jayshrivastava/cockroach/tree/rowfetcher-2
Need to work on

  • adding a virtual index on status to the internal jobs table so we don't have to scan the whole thing
  • scheduling this metric update less often and also having one node do it

@jayshrivastava jayshrivastava requested a review from a team October 31, 2022 20:38
@jayshrivastava jayshrivastava requested review from a team as code owners October 31, 2022 20:38
@jayshrivastava jayshrivastava changed the title changefeedccl: add metrics for paused changefeed jobs jobs/cdc: add metrics for paused jobs Oct 31, 2022
@jayshrivastava jayshrivastava force-pushed the paused-metrics branch 3 times, most recently from 9411852 to fb89d23 Compare February 6, 2023 21:13
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @miretskiy, @samiskin, and @shermanCRL)


pkg/jobs/registry.go line 934 at r5 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

you probably want to set noncancellable bit?

Done.


pkg/jobs/registry.go line 972 at r5 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

right; but you're not declaring; you're creating it.
Do declare:

var metricUpdate map[jobspb.Type]int

Done.


pkg/jobs/testing_knobs.go line 93 at r5 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

not sure we need a pointer -- i guess it's consistent, so ... fine...

Done.

@jayshrivastava jayshrivastava force-pushed the paused-metrics branch 2 times, most recently from 369144c to ef93a0f Compare February 6, 2023 21:55
@shermanCRL shermanCRL removed their request for review February 7, 2023 15:03
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 28 files at r8, 4 of 6 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jayshrivastava and @samiskin)


pkg/jobs/registry.go line 1148 at r9 (raw file):

				err = ctx.Err()
				return
			case <-time.After(PollJobsMetricsInterval.Get(&r.settings.SV)):

you probably want to create timer for this.


pkg/upgrade/upgrades/create_jobs_metrics_polling_job.go line 44 at r9 (raw file):

		}

		// If there isn't a row for the key visualizer job, create the job.

coment needs updating.

I suspect this code is repeated with key visualizer logic? Consider adding a helper (createBootstrapJob or some such)

@jayshrivastava
Copy link
Contributor Author

bors r=miretskiy TYFR!

@craig
Copy link
Contributor

craig bot commented Feb 7, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 7, 2023

Build failed (retrying...):

@jayshrivastava
Copy link
Contributor Author

bors r-

@craig
Copy link
Contributor

craig bot commented Feb 7, 2023

Canceled.

This change adds new metrics to count paused jobs for every job type. For
example, the metric for paused changefeed jobs is
`jobs.changefeed.currently_paused`. These metrics are counted at an
interval defined by the cluster setting `jobs.metrics.interval.poll`.

This is implemented by a job which periodically queries `system.jobs`
to count the number of paused jobs. This job is of the newly added type
`jobspb.TypePollJobsStats`. When a node starts it's job registry, it will
create an adoptable stats polling job if it does not exist already using a
transaction.

This change adds a test which pauses and resumes changefeeds while asserting
the value of the `jobs.changefeed.currently_paused` metric. It also adds a
logictest to ensure one instance of the stats polling job is created in a
cluster.

Resolves: cockroachdb#85467

Release note (general change): This change adds new metrics to count
paused jobs for every job type. For example, the metric for paused
changefeed jobs is `jobs.changefeed.currently_paused`. These metrics
are updated at an interval defined by the cluster setting
`jobs.metrics.interval.poll`, which is defauled to 10 seconds.

Epic: None
@jayshrivastava
Copy link
Contributor Author

bors r=miretskiy

@craig
Copy link
Contributor

craig bot commented Feb 8, 2023

Build succeeded:

@craig craig bot merged commit 844a370 into cockroachdb:master Feb 8, 2023
@jayshrivastava jayshrivastava deleted the paused-metrics branch February 8, 2023 18:15
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Feb 15, 2023
Prior PR cockroachdb#89752 added
a metrics poller job which produces per job type stats on the
number of paused jobs.

This PR extends metrics poller to also collect stats related
to protected timestamps created by jobs.
Namely, two new metrics, per job type are added:

* `jobs.<job type>.protected_record_count` -- keeps track of the number
  of protected timestamp records help by the jobs.
* `jobs.<job type>.protected_age_sec` -- keeps track of the age
  of the oldest protected timestamp held by those jobs.

The metrics improve observability into protected timestamp system,
and allow operators to alert when protected timestamp records are
too old since that prevents garbage collection from occuring
(and if GC is not performed for too long, the cluster performance
would degrade).

Follow on work will also make this functionality available for
schedules.

Epic: CRDB-21953
Fixes cockroachdb#78354

Release note (enterprise change): Jobs that utilize protected timestamp
system (BACKUP, CHANGEFEED, IMPORT, etc) now produce metrics that
can be monitored to detect cases when job leaves stale protected
timestamp, preventing garbage collection from occuring.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Feb 16, 2023
Prior PR cockroachdb#89752 added
a metrics poller job which produces per job type stats on the
number of paused jobs.

This PR extends metrics poller to also collect stats related
to protected timestamps created by jobs.
Namely, two new metrics, per job type are added:

* `jobs.<job type>.protected_record_count` -- keeps track of the number
  of protected timestamp records help by the jobs.
* `jobs.<job type>.protected_age_sec` -- keeps track of the age
  of the oldest protected timestamp held by those jobs.

The metrics improve observability into protected timestamp system,
and allow operators to alert when protected timestamp records are
too old since that prevents garbage collection from occuring
(and if GC is not performed for too long, the cluster performance
would degrade).

Follow on work will also make this functionality available for
schedules.

Epic: CRDB-21953
Fixes cockroachdb#78354

Release note (enterprise change): Jobs that utilize protected timestamp
system (BACKUP, CHANGEFEED, IMPORT, etc) now produce metrics that
can be monitored to detect cases when job leaves stale protected
timestamp, preventing garbage collection from occuring.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Feb 23, 2023
Prior PR cockroachdb#89752 added
a metrics poller job which produces per job type stats on the
number of paused jobs.

This PR extends metrics poller to also collect stats related
to protected timestamps created by jobs.
Namely, two new metrics, per job type are added:

* `jobs.<job type>.protected_record_count` -- keeps track of the number
  of protected timestamp records help by the jobs.
* `jobs.<job type>.protected_age_sec` -- keeps track of the age
  of the oldest protected timestamp held by those jobs.

The metrics improve observability into protected timestamp system,
and allow operators to alert when protected timestamp records are
too old since that prevents garbage collection from occuring
(and if GC is not performed for too long, the cluster performance
would degrade).

Follow on work will also make this functionality available for
schedules.

Epic: CRDB-21953
Fixes cockroachdb#78354

Release note (enterprise change): Jobs that utilize protected timestamp
system (BACKUP, CHANGEFEED, IMPORT, etc) now produce metrics that
can be monitored to detect cases when job leaves stale protected
timestamp, preventing garbage collection from occuring.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jobs: add metric for number of paused jobs
3 participants