spanconfig/job: improve retry behaviour under failures #78117

irfansharif · 2022-03-19T05:13:11Z

Previously if the reconciliation job failed (say, with retryable buffer
overflow errors from the sqlwatcher¹), we relied on the jobs
subsystem's backoff mechanism to re-kick the reconciliation job. The
retry loop there however is far too large, and has a max back-off of
24H; far too long for the span config reconciliation job. Instead we can
control the retry behavior directly within the reconciliation job,
something this PR now does. We still want to bound the number of
internal retries, possibly bouncing the job around to elsewhere in the
cluster afterwards. To do so, we now use the spanconfig.Manager's
periodic checks (every 10m per node) -- we avoid the jobs subsystem
retry loop by marking every error as a permanent one.

Release justification: low risk, high benefit change
Release note: None

In future PRs we'll introduce tests adding 100k-1M tables in large
batches; when sufficiently large it's possible to blow past the
sqlwatcher's rangefeed buffer limits on incremental updates. In
these scenarios we want to gracefully fail + recover by re-starting
the reconciler and re-running the initial scan. ↩

cockroach-teamcity · 2022-03-19T05:13:19Z

This change is

arulajmani

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @irfansharif)

pkg/spanconfig/spanconfigmanager/manager_test.go, line 284 at r1 (raw file):

		syncutil.Mutex
		err   error
		count int

Unused?

pkg/spanconfig/spanconfigmanager/manager_test.go, line 315 at r1 (raw file):

			if jobs.Status(jobStatus) != status {
				return errors.Newf("expected jobID %d to have status %, got %s", jobID, jobs.StatusFailed, jobStatus)

s/jobs.StatusFailed/status/g?

Previously if the reconciliation job failed (say, with retryable buffer overflow errors from the sqlwatcher[1]), we relied on the jobs subsystem's backoff mechanism to re-kick the reconciliation job. The retry loop there however is far too large, and has a max back-off of 24H; far too long for the span config reconciliation job. Instead we can control the retry behavior directly within the reconciliation job, something this PR now does. We still want to bound the number of internal retries, possibly bouncing the job around to elsewhere in the cluster afterwards. To do so, we now use the spanconfig.Manager's periodic checks (every 10m per node) -- we avoid the jobs subsystem retry loop by marking every error as a permanent one. [1]: In future PRs we'll introduce tests adding 100k-1M tables in large batches; when sufficiently large it's possible to blow past the sqlwatcher's rangefeed buffer limits on incremental updates. In these scenarios we want to gracefully fail + recover by re-starting the reconciler and re-running the initial scan. Release justification: low risk, high benefit change Release note: None

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @arulajmani)

pkg/spanconfig/spanconfigmanager/manager_test.go, line 284 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Unused?

Done.

Code quote:

TestReconciliationJobErrorFailsJob

pkg/spanconfig/spanconfigmanager/manager_test.go, line 315 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

s/jobs.StatusFailed/status/g?

Done.

irfansharif · 2022-03-22T01:57:25Z

Thanks!

bors r+

craig · 2022-03-22T04:04:22Z

Build succeeded:

GitHub CI (Cockroach)

Fixes cockroachdb#75831, an annoying bug in the intersection between the span configs infrastructure + backup/restore. It's possible to observe mismatched descriptor types for the same ID post-RESTORE, an invariant the span configs infrastructure relies on. This paper simply papers over this mismatch, kicking off a full reconciliation process to recover if it occurs. Doing something "better" is a lot more invasive, the options being: - pausing the reconciliation job during restore (prototyped in cockroachdb#80339); - observing a reconciler checkpoint in the restore job (work since we would have flushed out RESTORE's descriptor deletions and separately handle the RESTORE's descriptor additions -- them having different types would not fire the assertion); - re-keying restored descriptors to not re-use the same IDs as existing schema objects. While here, we add a bit of plumbing/testing to make the future work/testing for \cockroachdb#73694 (using reconciler checkpoints on retries) easier. This PR also sets the stage for the following pattern around use of checkpoints: 1. We'll use checkpoints and incrementally reconciler during job-internal retries (added in cockroachdb#78117); 2. We'll always fully reconcile (i.e. ignore checkpoints) when the job itself is bounced around. We do this because we need to fully reconcile across job restarts if the reason for the restart is due to RESTORE-induced errors. This is a bit unfortunate, and if we want to improve on (2), we'd have to persist job state (think "poison pill") that ensures that we ignore the persisted checkpoint. As of this PR, the only use of job-persisted checkpoints are the migrations rolling out this infrastructure. That said, now we'll have a mechanism to force a full reconciliation attempt -- we can: -- get $job_id SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION' PAUSE JOB $job_id RESUME JOB $job_id Release note: None

79379: kvserver: avoid races where replication changes can get interrupted r=aayushshah15 a=aayushshah15 This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes #57129 Relates to #79118 Release note: none Jira issue: CRDB-14769 79853: changefeedccl: support a CSV format for changefeeds r=sherman-grewal a=sherman-grewal In this PR, we introduce a new CSV format for changefeeds. Note that this format is only supported with the initial_scan='only' option. For instance, one can now execute: CREATE CHANGEFEED FOR foo WITH format=csv, initial_scan='only'; Release note (enterprise change): Support a CSV format for changefeeds. Only works with initial_scan='only', and does not work with diff/resolved options. 80397: spanconfig: handle mismatched desc types post-restore r=irfansharif a=irfansharif Fixes #75831, an annoying bug in the intersection between the span configs infrastructure + backup/restore. It's possible to observe mismatched descriptor types for the same ID post-RESTORE, an invariant the span configs infrastructure relies on. This paper simply papers over this mismatch, kicking off a full reconciliation process to recover if it occurs. Doing something "better" is a lot more invasive, the options being: - pausing the reconciliation job during restore (prototyped in #80339); - observing a reconciler checkpoint in the restore job (work since we would have flushed out RESTORE's descriptor deletions and separately handle the RESTORE's descriptor additions -- them having different types would not fire the assertion); - re-keying restored descriptors to not re-use the same IDs as existing schema objects. While here, we add a bit of plumbing/testing to make the future work/testing for \#73694 (using reconciler checkpoints on retries) easier. This PR also sets the stage for the following pattern around use of checkpoints: 1. We'll use checkpoints and incrementally reconciler during job-internal retries (added in #78117); 2. We'll always fully reconcile (i.e. ignore checkpoints) when the job itself is bounced around. We do this because we need to fully reconcile across job restarts if the reason for the restart is due to RESTORE-induced errors. This is a bit unfortunate, and if we want to improve on (2), we'd have to persist job state (think "poison pill") that ensures that we ignore the persisted checkpoint. As of this PR, the only use of job-persisted checkpoints are the migrations rolling out this infrastructure. That said, now we'll have a mechanism to force a full reconciliation attempt -- we can: ``` -- get $job_id SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION' PAUSE JOB $job_id RESUME JOB $job_id ``` Release note: None 80410: ui: display closed sessions, add username and session status filter r=gtr a=gtr Fixes #67888, #79914. Previously, the sessions page UI did not support displaying closed sessions and did not support the ability to filter by username or session status. This commit adds the "Closed" session status to closed sessions and adds the ability to filter by username and session status. Session Status: https://user-images.githubusercontent.com/35943354/164794955-5a48d6c2-589d-4f05-b476-b30b114662ee.mov Usernames: https://user-images.githubusercontent.com/35943354/164797165-f00f9760-7127-4f2a-96bd-88f691395693.mov Release note (ui change): sessions overview and session details pages now display closed sessions; sessions overview page now has username and session status filters Co-authored-by: Aayush Shah <[email protected]> Co-authored-by: Sherman Grewal <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Gerardo Torres <[email protected]>

Fixes #75831, an annoying bug in the intersection between the span configs infrastructure + backup/restore. It's possible to observe mismatched descriptor types for the same ID post-RESTORE, an invariant the span configs infrastructure relies on. This paper simply papers over this mismatch, kicking off a full reconciliation process to recover if it occurs. Doing something "better" is a lot more invasive, the options being: - pausing the reconciliation job during restore (prototyped in #80339); - observing a reconciler checkpoint in the restore job (work since we would have flushed out RESTORE's descriptor deletions and separately handle the RESTORE's descriptor additions -- them having different types would not fire the assertion); - re-keying restored descriptors to not re-use the same IDs as existing schema objects. While here, we add a bit of plumbing/testing to make the future work/testing for \#73694 (using reconciler checkpoints on retries) easier. This PR also sets the stage for the following pattern around use of checkpoints: 1. We'll use checkpoints and incrementally reconciler during job-internal retries (added in #78117); 2. We'll always fully reconcile (i.e. ignore checkpoints) when the job itself is bounced around. We do this because we need to fully reconcile across job restarts if the reason for the restart is due to RESTORE-induced errors. This is a bit unfortunate, and if we want to improve on (2), we'd have to persist job state (think "poison pill") that ensures that we ignore the persisted checkpoint. As of this PR, the only use of job-persisted checkpoints are the migrations rolling out this infrastructure. That said, now we'll have a mechanism to force a full reconciliation attempt -- we can: -- get $job_id SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION' PAUSE JOB $job_id RESUME JOB $job_id Release note: None

irfansharif requested review from ajwerner and arulajmani March 19, 2022 05:13

irfansharif requested a review from a team as a code owner March 19, 2022 05:13

irfansharif added the backport-22.1.x label Mar 19, 2022

arulajmani approved these changes Mar 21, 2022

View reviewed changes

irfansharif force-pushed the 220318.reconcilier-retry branch from 978c558 to 470cc68 Compare March 22, 2022 01:56

irfansharif commented Mar 22, 2022

View reviewed changes

craig bot merged commit 10e0c5d into cockroachdb:master Mar 22, 2022

blathers-crl bot mentioned this pull request Mar 22, 2022

release-22.1: spanconfig/job: improve retry behaviour under failures #78220

Merged

irfansharif deleted the 220318.reconcilier-retry branch March 22, 2022 05:31

irfansharif mentioned this pull request Apr 22, 2022

spanconfig: handle mismatched desc types post-restore #80397

Merged

blathers-crl bot mentioned this pull request Apr 27, 2022

release-22.1: spanconfig: handle mismatched desc types post-restore #80603

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanconfig/job: improve retry behaviour under failures #78117

spanconfig/job: improve retry behaviour under failures #78117

irfansharif commented Mar 19, 2022 •

edited

Loading

cockroach-teamcity commented Mar 19, 2022

arulajmani left a comment

irfansharif left a comment

irfansharif commented Mar 22, 2022

craig bot commented Mar 22, 2022

spanconfig/job: improve retry behaviour under failures #78117

spanconfig/job: improve retry behaviour under failures #78117

Conversation

irfansharif commented Mar 19, 2022 • edited Loading

Footnotes

cockroach-teamcity commented Mar 19, 2022

arulajmani left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif commented Mar 22, 2022

craig bot commented Mar 22, 2022

irfansharif commented Mar 19, 2022 •

edited

Loading