Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig: handle mismatched desc types post-restore #80397

Merged
merged 1 commit into from
Apr 27, 2022

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Apr 22, 2022

Fixes #75831, an annoying bug in the intersection between the span
configs infrastructure + backup/restore.

It's possible to observe mismatched descriptor types for the same ID
post-RESTORE, an invariant the span configs infrastructure relies on.
This paper simply papers over this mismatch, kicking off a full
reconciliation process to recover if it occurs. Doing something "better"
is a lot more invasive, the options being:

  • pausing the reconciliation job during restore (prototyped in spanconfig: handle duplicate descriptor ID post-restore #80339);
  • observing a reconciler checkpoint in the restore job (work since we
    would have flushed out RESTORE's descriptor deletions and separately
    handle the RESTORE's descriptor additions -- them having different
    types would not fire the assertion);
  • re-keying restored descriptors to not re-use the same IDs as existing
    schema objects.

While here, we add a bit of plumbing/testing to make the future
work/testing for #73694 (using reconciler checkpoints on retries)
easier. This PR also sets the stage for the following pattern around use
of checkpoints:

  1. We'll use checkpoints and incrementally reconciler during job-internal
    retries (added in spanconfig/job: improve retry behaviour under failures #78117);
  2. We'll always fully reconcile (i.e. ignore checkpoints) when the job
    itself is bounced around.

We do this because we need to fully reconcile across job restarts if the
reason for the restart is due to RESTORE-induced errors. This is a bit
unfortunate, and if we want to improve on (2), we'd have to persist job
state (think "poison pill") that ensures that we ignore the persisted
checkpoint. As of this PR, the only use of job-persisted checkpoints are
the migrations rolling out this infrastructure. That said, now we'll
have a mechanism to force a full reconciliation attempt -- we can:

   -- get $job_id
   SELECT job_id FROM [SHOW AUTOMATIC JOBS]
   WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

   PAUSE JOB $job_id
   RESUME JOB $job_id

Release note: None

@irfansharif irfansharif requested a review from a team as a code owner April 22, 2022 18:03
@irfansharif irfansharif requested a review from a team April 22, 2022 18:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Fixes cockroachdb#75831, an annoying bug in the intersection between the span
configs infrastructure + backup/restore.

It's possible to observe mismatched descriptor types for the same ID
post-RESTORE, an invariant the span configs infrastructure relies on.
This paper simply papers over this mismatch, kicking off a full
reconciliation process to recover if it occurs. Doing something "better"
is a lot more invasive, the options being:
- pausing the reconciliation job during restore (prototyped in cockroachdb#80339);
- observing a reconciler checkpoint in the restore job (work since we
  would have flushed out RESTORE's descriptor deletions and separately
  handle the RESTORE's descriptor additions -- them having different
  types would not fire the assertion);
- re-keying restored descriptors to not re-use the same IDs as existing
  schema objects.

While here, we add a bit of plumbing/testing to make the future
work/testing for \cockroachdb#73694 (using reconciler checkpoints on retries)
easier. This PR also sets the stage for the following pattern around use
of checkpoints:
1. We'll use checkpoints and incrementally reconciler during job-internal
   retries (added in cockroachdb#78117);
2. We'll always fully reconcile (i.e. ignore checkpoints) when the job
   itself is bounced around.

We do this because we need to fully reconcile across job restarts if the
reason for the restart is due to RESTORE-induced errors. This is a bit
unfortunate, and if we want to improve on (2), we'd have to persist job
state (think "poison pill") that ensures that we ignore the persisted
checkpoint. As of this PR, the only use of job-persisted checkpoints are
the migrations rolling out this infrastructure. That said, now we'll
have a mechanism to force a full reconciliation attempt -- we can:

   -- get $job_id
   SELECT job_id FROM [SHOW AUTOMATIC JOBS]
   WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

   PAUSE JOB $job_id
   RESUME JOB $job_id

Release note: None
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 10 of 10 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru and @arulajmani)

@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 26, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Apr 26, 2022

Build failed:

@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 27, 2022

Build succeeded:

@craig craig bot merged commit d6240ac into cockroachdb:master Apr 27, 2022
@irfansharif irfansharif deleted the 220422.retry-combine branch April 28, 2022 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spanconfig: assertion failure in sqlconfigwatcher.combine
3 participants