spanconfig: handle mismatched desc types post-restore #80397

irfansharif · 2022-04-22T18:03:37Z

Fixes #75831, an annoying bug in the intersection between the span
configs infrastructure + backup/restore.

It's possible to observe mismatched descriptor types for the same ID
post-RESTORE, an invariant the span configs infrastructure relies on.
This paper simply papers over this mismatch, kicking off a full
reconciliation process to recover if it occurs. Doing something "better"
is a lot more invasive, the options being:

pausing the reconciliation job during restore (prototyped in spanconfig: handle duplicate descriptor ID post-restore #80339);
observing a reconciler checkpoint in the restore job (work since we
would have flushed out RESTORE's descriptor deletions and separately
handle the RESTORE's descriptor additions -- them having different
types would not fire the assertion);
re-keying restored descriptors to not re-use the same IDs as existing
schema objects.

While here, we add a bit of plumbing/testing to make the future
work/testing for #73694 (using reconciler checkpoints on retries)
easier. This PR also sets the stage for the following pattern around use
of checkpoints:

We'll use checkpoints and incrementally reconciler during job-internal
retries (added in spanconfig/job: improve retry behaviour under failures #78117);
We'll always fully reconcile (i.e. ignore checkpoints) when the job
itself is bounced around.

We do this because we need to fully reconcile across job restarts if the
reason for the restart is due to RESTORE-induced errors. This is a bit
unfortunate, and if we want to improve on (2), we'd have to persist job
state (think "poison pill") that ensures that we ignore the persisted
checkpoint. As of this PR, the only use of job-persisted checkpoints are
the migrations rolling out this infrastructure. That said, now we'll
have a mechanism to force a full reconciliation attempt -- we can:

   -- get $job_id
   SELECT job_id FROM [SHOW AUTOMATIC JOBS]
   WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

   PAUSE JOB $job_id
   RESUME JOB $job_id

Release note: None

cockroach-teamcity · 2022-04-22T18:03:46Z

This change is

Fixes cockroachdb#75831, an annoying bug in the intersection between the span configs infrastructure + backup/restore. It's possible to observe mismatched descriptor types for the same ID post-RESTORE, an invariant the span configs infrastructure relies on. This paper simply papers over this mismatch, kicking off a full reconciliation process to recover if it occurs. Doing something "better" is a lot more invasive, the options being: - pausing the reconciliation job during restore (prototyped in cockroachdb#80339); - observing a reconciler checkpoint in the restore job (work since we would have flushed out RESTORE's descriptor deletions and separately handle the RESTORE's descriptor additions -- them having different types would not fire the assertion); - re-keying restored descriptors to not re-use the same IDs as existing schema objects. While here, we add a bit of plumbing/testing to make the future work/testing for \cockroachdb#73694 (using reconciler checkpoints on retries) easier. This PR also sets the stage for the following pattern around use of checkpoints: 1. We'll use checkpoints and incrementally reconciler during job-internal retries (added in cockroachdb#78117); 2. We'll always fully reconcile (i.e. ignore checkpoints) when the job itself is bounced around. We do this because we need to fully reconcile across job restarts if the reason for the restart is due to RESTORE-induced errors. This is a bit unfortunate, and if we want to improve on (2), we'd have to persist job state (think "poison pill") that ensures that we ignore the persisted checkpoint. As of this PR, the only use of job-persisted checkpoints are the migrations rolling out this infrastructure. That said, now we'll have a mechanism to force a full reconciliation attempt -- we can: -- get $job_id SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION' PAUSE JOB $job_id RESUME JOB $job_id Release note: None

ajwerner

Reviewed 10 of 10 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru and @arulajmani)

irfansharif · 2022-04-26T18:22:12Z

bors r+

craig · 2022-04-26T20:55:02Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-04-26T22:14:57Z

Build failed:

GitHub CI (Cockroach)

irfansharif · 2022-04-26T23:20:09Z

bors r+

craig · 2022-04-27T00:20:18Z

Build succeeded:

GitHub CI (Cockroach)

irfansharif requested review from ajwerner, arulajmani and adityamaru April 22, 2022 18:03

irfansharif requested a review from a team as a code owner April 22, 2022 18:03

irfansharif requested a review from a team April 22, 2022 18:03

irfansharif force-pushed the 220422.retry-combine branch from f2e4f7f to 1e34408 Compare April 22, 2022 18:54

irfansharif added the backport-22.1.x label Apr 25, 2022

ajwerner approved these changes Apr 26, 2022

View reviewed changes

craig bot merged commit d6240ac into cockroachdb:master Apr 27, 2022

blathers-crl bot mentioned this pull request Apr 27, 2022

release-22.1: spanconfig: handle mismatched desc types post-restore #80603

Merged

irfansharif deleted the 220422.retry-combine branch April 28, 2022 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanconfig: handle mismatched desc types post-restore #80397

spanconfig: handle mismatched desc types post-restore #80397

irfansharif commented Apr 22, 2022 •

edited

Loading

cockroach-teamcity commented Apr 22, 2022

ajwerner left a comment

irfansharif commented Apr 26, 2022

craig bot commented Apr 26, 2022

craig bot commented Apr 26, 2022

irfansharif commented Apr 26, 2022

craig bot commented Apr 27, 2022

spanconfig: handle mismatched desc types post-restore #80397

spanconfig: handle mismatched desc types post-restore #80397

Conversation

irfansharif commented Apr 22, 2022 • edited Loading

cockroach-teamcity commented Apr 22, 2022

ajwerner left a comment

Choose a reason for hiding this comment

irfansharif commented Apr 26, 2022

craig bot commented Apr 26, 2022

craig bot commented Apr 26, 2022

irfansharif commented Apr 26, 2022

craig bot commented Apr 27, 2022

irfansharif commented Apr 22, 2022 •

edited

Loading