-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spanconfig: assertion failure in sqlconfigwatcher.combine #75831
Comments
Stashing away in some project board, will either get to it when free (esp. if it comes up more prominently in CI) or during stability next month. |
+cc @arulajmani. Repros pretty easily under stress in the original SHA after adding a fatal at that point. Doesn't repro on master, so I wonder if it was fixed by @adityamaru's #75122. That PR sanitized how we sorted sqlwatcher updates:
$ git diff
diff --git i/pkg/spanconfig/spanconfigsqlwatcher/buffer.go w/pkg/spanconfig/spanconfigsqlwatcher/buffer.go
index 7def45a2da..f81ec3d512 100644
--- i/pkg/spanconfig/spanconfigsqlwatcher/buffer.go
+++ w/pkg/spanconfig/spanconfigsqlwatcher/buffer.go
@@ -298,5 +298,6 @@ func combine(d1 catalog.DescriptorType, d2 catalog.DescriptorType) (catalog.Desc
if d2 == catalog.Any {
return d1, nil
}
+ log.Fatalf(context.TODO(), "cannot combine %s and %s", d1, d2)
return catalog.Any, errors.AssertionFailedf("cannot combine %s and %s", d1, d2)
} Using the following to repro:
Aside: I understand the impetus of not wanting to crash the server with Fatals for every small thing, but there's something to be said about developing more hardened components if we were more liberal with our fatals given they'd fail tests/post issues/etc. Short of building proper plumbing for errors.AssertionFailedf to also file things, for internal data structures I sleep better at night with fatals littered around. |
would this really be such a big lift? It feels like we should do this at the jobs layer. We've already done this at the SQL layer. |
Mind filing an issue? What I think I want at some level is errors.AssertionFailed errors failing tests. |
Bisecting for fix, so bad = good and good = bad.
Bisect pointed to #76003 as the PR that "fixed" the issue. Probably it's masked over something else. I'll stop being lazy and actually investigate. |
|
Repro for cockroachdb#75831. dev test pkg/ccl/backupccl -f=TestRestoreOldBackupMissingOfflineIndexes -v --show-logs dev test pkg/ccl/backupccl -f=TestRestoreOldBackupMissingOfflineIndexes --stress --timeout 1m Over a rangefeed established over `system.descriptor`, we can observe two descriptors with same ID but over different objects (database and schema types respectively). I220420 12:57:10.625054 13320 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:243 [n1,job=754938775291265025,rangefeed=sql-watcher-descriptor-rangefeed] 1984 xxx: received rangefeed event at 1650459430.623719000,0 for descriptor schema:<name:"public" id:51 state:PUBLIC offline_reason:"" modification_time:<wall_time:1650459430623719000 > version:1 parent_id:50 privileges:<users:<user_proto:"admin" privileges:2 with_grant_option:2 > users:<user_proto:"public" privileges:516 with_grant_option:0 > users:<user_proto:"root" privileges:2 with_grant_option:2 > owner_proto:"admin" version:2 > > I220420 12:57:10.747180 13320 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:243 [n1,job=754938775291265025,rangefeed=sql-watcher-descriptor-rangefeed] 2059 xxx: received rangefeed event at 1650459430.740418000,0 for descriptor database:<name:"postgres" id:51 modification_time:<wall_time:1650459430740418000 > version:2 privileges:<users:<user_proto:"admin" privileges:2 with_grant_option:2 > users:<user_proto:"root" privileges:2 with_grant_option:2 > owner_proto:"root" version:2 > schemas:<key:"public" value:<id:68 dropped:false > > state:PUBLIC offline_reason:"" default_privileges:<type:DATABASE > > Release note: None
#80239 has a repro. Over a rangefeed established over
|
It seems we start off with the
But this has implications for the incremental reconciler that filters for missing table IDs in order to delete span configs that no longer apply: cockroach/pkg/spanconfig/spanconfigreconciler/reconciler.go Lines 435 to 436 in 63a248c
If a table is being dropped due to RESTORE, the incremental reconciler wants to know about it. Unintuitively, the behavior today where we error out is actually saner but only incidentally so -- we'll fail the reconciliation job and on a subsequent attempt start with the full reconciliation attempt (because we're not using persisted checkpoints yet: #73694, at which point this too will be busted). |
The only case where a descriptor gets overwritten at all is full cluster restore. During a full cluster restore, I do think it's possible to overwrite a descriptor, though I don't recall the details. The best thing I think we can do in this case is kick off a new full reconciliation. Given the idea is that before a full cluster restore, the cluster has absolutely zero user-created data, this shouldn't be more expensive than the incremental. What do you think about the idea of adding logic to deal with this case by just restarting a full reconciliation? |
What’s a reliable way to detect that we’re observing descriptor rewrites from where the span configs infrastructure is placed? Is it this |
Fixes cockroachdb#75831. Release note: None
Recap: cockroach/pkg/ccl/backupccl/restore_planning.go Lines 1710 to 1718 in 46b0a77
It will then write fresh descriptors from the backup image here: cockroach/pkg/ccl/backupccl/restore_job.go Lines 926 to 927 in 46b0a77
It's possible that the descriptors we're writing out in step 2 have the same IDs as the descriptors we're dropping above, but are of different types. It's possible for e.g. that we're replacing a schema descriptor with ID=X with a database descriptor having the same ID. The span configs infrastructure observes a stream of rangefeed events over Talked with Arul+Aditya to come up with a few ideas on what we could do, ranging from: I have a sketch PR to fix this using (b) over at #80339. (a) would've been fine too, but might be undesirable to couple restore success to our ability to observe a reconciler checkpoint. I'm not sure how reasonable that concern is, I feel (a) is simpler to implement and fewer moving pieces (don't have to pause/resume a job), but implemented (b) anyway. |
I'm also open to not doing anything for 22.1. The "fix" here is invasive, but we're currently recovering from the internal error just fine. Perhaps we could file an issue and re-evaluate for 22.2. |
Fixes cockroachdb#75831, an annoying bug in the intersection between the span configs infrastructure + backup/restore. It's possible to observe mismatched descriptor types for the same ID post-RESTORE, an invariant the span configs infrastructure relies on. This paper simply papers over this mismatch, kicking off a full reconciliation process to recover if it occurs. Doing something "better" is a lot more invasive, the options being: - pausing the reconciliation job during restore (prototyped in cockroachdb#80339); - observing a reconciler checkpoint in the restore job (work since we would have flushed out RESTORE's descriptor deletions and separately handle the RESTORE's descriptor additions -- them having different types would not fire the assertion); - re-keying restored descriptors to not re-use the same IDs as existing schema objects. While here, we add a bit of plumbing/testing to make the future work/testing for \cockroachdb#73694 (using reconciler checkpoints on retries) easier. This PR also sets the stage for the following pattern around use of checkpoints: 1. We'll use checkpoints and incrementally reconciler during job-internal retries (added in cockroachdb#78117); 2. We'll always fully reconcile (i.e. ignore checkpoints) when the job itself is bounced around. We do this because we need to fully reconcile across job restarts if the reason for the restart is due to RESTORE-induced errors. This is a bit unfortunate, and if we want to improve on (2), we'd have to persist job state (think "poison pill") that ensures that we ignore the persisted checkpoint. As of this PR, the only use of job-persisted checkpoints are the migrations rolling out this infrastructure. That said, now we'll have a mechanism to force a full reconciliation attempt -- we can: -- get $job_id SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION' PAUSE JOB $job_id RESUME JOB $job_id Release note: None
Fixes #75831, an annoying bug in the intersection between the span configs infrastructure + backup/restore. It's possible to observe mismatched descriptor types for the same ID post-RESTORE, an invariant the span configs infrastructure relies on. This paper simply papers over this mismatch, kicking off a full reconciliation process to recover if it occurs. Doing something "better" is a lot more invasive, the options being: - pausing the reconciliation job during restore (prototyped in #80339); - observing a reconciler checkpoint in the restore job (work since we would have flushed out RESTORE's descriptor deletions and separately handle the RESTORE's descriptor additions -- them having different types would not fire the assertion); - re-keying restored descriptors to not re-use the same IDs as existing schema objects. While here, we add a bit of plumbing/testing to make the future work/testing for \#73694 (using reconciler checkpoints on retries) easier. This PR also sets the stage for the following pattern around use of checkpoints: 1. We'll use checkpoints and incrementally reconciler during job-internal retries (added in #78117); 2. We'll always fully reconcile (i.e. ignore checkpoints) when the job itself is bounced around. We do this because we need to fully reconcile across job restarts if the reason for the restart is due to RESTORE-induced errors. This is a bit unfortunate, and if we want to improve on (2), we'd have to persist job state (think "poison pill") that ensures that we ignore the persisted checkpoint. As of this PR, the only use of job-persisted checkpoints are the migrations rolling out this infrastructure. That said, now we'll have a mechanism to force a full reconciliation attempt -- we can: -- get $job_id SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION' PAUSE JOB $job_id RESUME JOB $job_id Release note: None
The following run at be1b6c4
contains the following in the log output
I haven't had chance to dig in more deeply here as I'm currently hunting down some other issues.
Jira issue: CRDB-12854
The text was updated successfully, but these errors were encountered: