Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig/job: improve retry behaviour under failures #78117

Merged
merged 1 commit into from
Mar 22, 2022

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Mar 19, 2022

Previously if the reconciliation job failed (say, with retryable buffer
overflow errors from the sqlwatcher1), we relied on the jobs
subsystem's backoff mechanism to re-kick the reconciliation job. The
retry loop there however is far too large, and has a max back-off of
24H; far too long for the span config reconciliation job. Instead we can
control the retry behavior directly within the reconciliation job,
something this PR now does. We still want to bound the number of
internal retries, possibly bouncing the job around to elsewhere in the
cluster afterwards. To do so, we now use the spanconfig.Manager's
periodic checks (every 10m per node) -- we avoid the jobs subsystem
retry loop by marking every error as a permanent one.

Release justification: low risk, high benefit change
Release note: None

Footnotes

  1. In future PRs we'll introduce tests adding 100k-1M tables in large
    batches; when sufficiently large it's possible to blow past the
    sqlwatcher's rangefeed buffer limits on incremental updates. In
    these scenarios we want to gracefully fail + recover by re-starting
    the reconciler and re-running the initial scan.

@irfansharif irfansharif requested a review from a team as a code owner March 19, 2022 05:13
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @irfansharif)


pkg/spanconfig/spanconfigmanager/manager_test.go, line 284 at r1 (raw file):

		syncutil.Mutex
		err   error
		count int

Unused?


pkg/spanconfig/spanconfigmanager/manager_test.go, line 315 at r1 (raw file):

			if jobs.Status(jobStatus) != status {
				return errors.Newf("expected jobID %d to have status %, got %s", jobID, jobs.StatusFailed, jobStatus)

s/jobs.StatusFailed/status/g?

Previously if the reconciliation job failed (say, with retryable buffer
overflow errors from the sqlwatcher[1]), we relied on the jobs
subsystem's backoff mechanism to re-kick the reconciliation job. The
retry loop there however is far too large, and has a max back-off of
24H; far too long for the span config reconciliation job. Instead we can
control the retry behavior directly within the reconciliation job,
something this PR now does. We still want to bound the number of
internal retries, possibly bouncing the job around to elsewhere in the
cluster afterwards. To do so, we now use the spanconfig.Manager's
periodic checks (every 10m per node) -- we avoid the jobs subsystem
retry loop by marking every error as a permanent one.

[1]: In future PRs we'll introduce tests adding 100k-1M tables in large
     batches; when sufficiently large it's possible to blow past the
     sqlwatcher's rangefeed buffer limits on incremental updates. In
     these scenarios we want to gracefully fail + recover by re-starting
     the reconciler and re-running the initial scan.

Release justification: low risk, high benefit change
Release note: None
@irfansharif irfansharif force-pushed the 220318.reconcilier-retry branch from 978c558 to 470cc68 Compare March 22, 2022 01:56
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @arulajmani)


pkg/spanconfig/spanconfigmanager/manager_test.go, line 284 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Unused?

Done.

Code quote:

TestReconciliationJobErrorFailsJob

pkg/spanconfig/spanconfigmanager/manager_test.go, line 315 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

s/jobs.StatusFailed/status/g?

Done.

@irfansharif
Copy link
Contributor Author

Thanks!

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 22, 2022

Build succeeded:

@craig craig bot merged commit 10e0c5d into cockroachdb:master Mar 22, 2022
@irfansharif irfansharif deleted the 220318.reconcilier-retry branch March 22, 2022 05:31
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 22, 2022
Fixes cockroachdb#75831, an annoying bug in the intersection between the span
configs infrastructure + backup/restore.

It's possible to observe mismatched descriptor types for the same ID
post-RESTORE, an invariant the span configs infrastructure relies on.
This paper simply papers over this mismatch, kicking off a full
reconciliation process to recover if it occurs. Doing something "better"
is a lot more invasive, the options being:
- pausing the reconciliation job during restore (prototyped in cockroachdb#80339);
- observing a reconciler checkpoint in the restore job (work since we
  would have flushed out RESTORE's descriptor deletions and separately
  handle the RESTORE's descriptor additions -- them having different
  types would not fire the assertion);
- re-keying restored descriptors to not re-use the same IDs as existing
  schema objects.

While here, we add a bit of plumbing/testing to make the future
work/testing for \cockroachdb#73694 (using reconciler checkpoints on retries)
easier. This PR also sets the stage for the following pattern around use
of checkpoints:
1. We'll use checkpoints and incrementally reconciler during job-internal
   retries (added in cockroachdb#78117);
2. We'll always fully reconcile (i.e. ignore checkpoints) when the job
   itself is bounced around.

We do this because we need to fully reconcile across job restarts if the
reason for the restart is due to RESTORE-induced errors. This is a bit
unfortunate, and if we want to improve on (2), we'd have to persist job
state (think "poison pill") that ensures that we ignore the persisted
checkpoint. As of this PR, the only use of job-persisted checkpoints are
the migrations rolling out this infrastructure. That said, now we'll
have a mechanism to force a full reconciliation attempt -- we can:

   -- get $job_id
   SELECT job_id FROM [SHOW AUTOMATIC JOBS]
   WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

   PAUSE JOB $job_id
   RESUME JOB $job_id

Release note: None
craig bot pushed a commit that referenced this pull request Apr 26, 2022
79379: kvserver: avoid races where replication changes can get interrupted r=aayushshah15 a=aayushshah15

This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes #57129
Relates to #79118

Release note: none

Jira issue: CRDB-14769

79853: changefeedccl: support a CSV format for changefeeds r=sherman-grewal a=sherman-grewal

In this PR, we introduce a new CSV format for changefeeds.
Note that this format is only supported with the
initial_scan='only' option. For instance, one can now
execute:

CREATE CHANGEFEED FOR foo WITH format=csv, initial_scan='only';

Release note (enterprise change): Support a CSV format for
changefeeds. Only works with initial_scan='only', and
does not work with diff/resolved options.

80397: spanconfig: handle mismatched desc types post-restore r=irfansharif a=irfansharif

Fixes #75831, an annoying bug in the intersection between the span
configs infrastructure + backup/restore.

It's possible to observe mismatched descriptor types for the same ID
post-RESTORE, an invariant the span configs infrastructure relies on.
This paper simply papers over this mismatch, kicking off a full
reconciliation process to recover if it occurs. Doing something "better"
is a lot more invasive, the options being:
- pausing the reconciliation job during restore (prototyped in #80339);
- observing a reconciler checkpoint in the restore job (work since we
  would have flushed out RESTORE's descriptor deletions and separately
  handle the RESTORE's descriptor additions -- them having different
  types would not fire the assertion);
- re-keying restored descriptors to not re-use the same IDs as existing
  schema objects.

While here, we add a bit of plumbing/testing to make the future
work/testing for \#73694 (using reconciler checkpoints on retries)
easier. This PR also sets the stage for the following pattern around use
of checkpoints:
1. We'll use checkpoints and incrementally reconciler during job-internal
   retries (added in #78117);
2. We'll always fully reconcile (i.e. ignore checkpoints) when the job
   itself is bounced around.

We do this because we need to fully reconcile across job restarts if the
reason for the restart is due to RESTORE-induced errors. This is a bit
unfortunate, and if we want to improve on (2), we'd have to persist job
state (think "poison pill") that ensures that we ignore the persisted
checkpoint. As of this PR, the only use of job-persisted checkpoints are
the migrations rolling out this infrastructure. That said, now we'll
have a mechanism to force a full reconciliation attempt -- we can:

```
   -- get $job_id
   SELECT job_id FROM [SHOW AUTOMATIC JOBS]
   WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

   PAUSE JOB $job_id
   RESUME JOB $job_id
```

Release note: None

80410: ui: display closed sessions, add username and session status filter r=gtr a=gtr

Fixes #67888, #79914.

Previously, the sessions page UI did not support displaying closed
sessions and did not support the ability to filter by username or
session status. This commit adds the "Closed" session status to closed
sessions and adds the ability to filter by username and session status.

Session Status:
https://user-images.githubusercontent.com/35943354/164794955-5a48d6c2-589d-4f05-b476-b30b114662ee.mov

Usernames:
https://user-images.githubusercontent.com/35943354/164797165-f00f9760-7127-4f2a-96bd-88f691395693.mov

Release note (ui change): sessions overview and session details pages now
display closed sessions; sessions overview page now has username and session
status filters

Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: Sherman Grewal <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Gerardo Torres <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Apr 27, 2022
Fixes #75831, an annoying bug in the intersection between the span
configs infrastructure + backup/restore.

It's possible to observe mismatched descriptor types for the same ID
post-RESTORE, an invariant the span configs infrastructure relies on.
This paper simply papers over this mismatch, kicking off a full
reconciliation process to recover if it occurs. Doing something "better"
is a lot more invasive, the options being:
- pausing the reconciliation job during restore (prototyped in #80339);
- observing a reconciler checkpoint in the restore job (work since we
  would have flushed out RESTORE's descriptor deletions and separately
  handle the RESTORE's descriptor additions -- them having different
  types would not fire the assertion);
- re-keying restored descriptors to not re-use the same IDs as existing
  schema objects.

While here, we add a bit of plumbing/testing to make the future
work/testing for \#73694 (using reconciler checkpoints on retries)
easier. This PR also sets the stage for the following pattern around use
of checkpoints:
1. We'll use checkpoints and incrementally reconciler during job-internal
   retries (added in #78117);
2. We'll always fully reconcile (i.e. ignore checkpoints) when the job
   itself is bounced around.

We do this because we need to fully reconcile across job restarts if the
reason for the restart is due to RESTORE-induced errors. This is a bit
unfortunate, and if we want to improve on (2), we'd have to persist job
state (think "poison pill") that ensures that we ignore the persisted
checkpoint. As of this PR, the only use of job-persisted checkpoints are
the migrations rolling out this infrastructure. That said, now we'll
have a mechanism to force a full reconciliation attempt -- we can:

   -- get $job_id
   SELECT job_id FROM [SHOW AUTOMATIC JOBS]
   WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

   PAUSE JOB $job_id
   RESUME JOB $job_id

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants