Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig: assertion failure in sqlconfigwatcher.combine #75831

Closed
stevendanna opened this issue Feb 1, 2022 · 12 comments · Fixed by #80397
Closed

spanconfig: assertion failure in sqlconfigwatcher.combine #75831

stevendanna opened this issue Feb 1, 2022 · 12 comments · Fixed by #80397
Assignees
Labels
A-zone-configs C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered).

Comments

@stevendanna
Copy link
Collaborator

stevendanna commented Feb 1, 2022

The following run at be1b6c4

bazel test //pkg/ccl/backupccl:backupccl_test --test_env=GOTRACEBACK=all --test_filter=TestRestoreOldBackupMissingOfflineIndexes --test_arg -test.v --test_arg -show-logs --test_output streamed

contains the following in the log output

I220201 22:18:27.369121 5468 sql/schema_changer.go:2480  [n1,job=732966107046084609] 2293  removing table descriptor and zone config for table 63
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294  job 732966105911525377: running execution encountered retriable error: non-cancelable: cannot combine schema and database
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +(1) attached stack trace                                                                                                                                   
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  -- stack trace:                            
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine                         
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1148                        
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | [...repeated from below...]            
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +Wraps: (2) non-cancelable                                                                                                                                  
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +Wraps: (3) assertion failure                                                                                                                               
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +Wraps: (4) attached stack trace              
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  -- stack trace:                                                                                                                                          
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigsqlwatcher.combine           
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigsqlwatcher/buffer.go:162
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigsqlwatcher.(*buffer).flush
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigsqlwatcher/buffer.go:136
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigsqlwatcher.(*SQLWatcher).watch
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigsqlwatcher/sqlwatcher.go:165
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigsqlwatcher.(*SQLWatcher).WatchForSQLUpdates
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigsqlwatcher/sqlwatcher.go:90
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigreconciler.(*incrementalReconciler).reconcile
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigreconciler/reconciler.go:359
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigreconciler.(*Reconciler).Reconcile
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigreconciler/reconciler.go:173
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigjob.(*resumer).Resume
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigjob/job.go:87
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func2
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1123
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1124                               
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:401
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func1
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:323
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  | runtime.goexit
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +  |      GOROOT/src/runtime/asm_amd64.s:1581
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +Wraps: (5) cannot combine schema and database
I220201 22:18:27.370439 4381 jobs/registry.go:1150  [n1] 2294 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *assert.withAssertionFailure (4) *withstack.withStack (5) *errutil.leafError
E220201 22:18:27.370616 4381 jobs/adopt.go:406  [n1] 2295  job 732966105911525377: adoption completed with error non-cancelable: cannot combine schema and database
I220201 22:18:27.370969 5468 sql/gcjob/table_garbage_collection.go:102  [n1,job=732966107046084609] 2296  clearing data for table 65

I haven't had chance to dig in more deeply here as I'm currently hunting down some other issues.

Jira issue: CRDB-12854

@stevendanna stevendanna added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 1, 2022
@irfansharif irfansharif added A-zone-configs C-test-failure Broken test (automatically or manually discovered). labels Feb 1, 2022
@irfansharif
Copy link
Contributor

Stashing away in some project board, will either get to it when free (esp. if it comes up more prominently in CI) or during stability next month.

@irfansharif irfansharif self-assigned this Apr 19, 2022
@irfansharif
Copy link
Contributor

+cc @arulajmani. Repros pretty easily under stress in the original SHA after adding a fatal at that point. Doesn't repro on master, so I wonder if it was fixed by @adityamaru's #75122. That PR sanitized how we sorted sqlwatcher updates:

func (s events) Less(i, j int) bool {

$ git diff
diff --git i/pkg/spanconfig/spanconfigsqlwatcher/buffer.go w/pkg/spanconfig/spanconfigsqlwatcher/buffer.go
index 7def45a2da..f81ec3d512 100644
--- i/pkg/spanconfig/spanconfigsqlwatcher/buffer.go
+++ w/pkg/spanconfig/spanconfigsqlwatcher/buffer.go
@@ -298,5 +298,6 @@ func combine(d1 catalog.DescriptorType, d2 catalog.DescriptorType) (catalog.Desc
 	if d2 == catalog.Any {
 		return d1, nil
 	}
+	log.Fatalf(context.TODO(), "cannot combine %s and %s", d1, d2)
 	return catalog.Any, errors.AssertionFailedf("cannot combine %s and %s", d1, d2)
 }

Using the following to repro:

$ dev test pkg/ccl/backupccl -f=TestRestoreOldBackupMissingOfflineIndexes --stress --timeout 1m --ignore-cache

Aside: I understand the impetus of not wanting to crash the server with Fatals for every small thing, but there's something to be said about developing more hardened components if we were more liberal with our fatals given they'd fail tests/post issues/etc. Short of building proper plumbing for errors.AssertionFailedf to also file things, for internal data structures I sleep better at night with fatals littered around.

@ajwerner
Copy link
Contributor

Short of building proper plumbing for errors.AssertionFailedf to also file things, for internal data structures I sleep better at night with fatals littered around.

would this really be such a big lift? It feels like we should do this at the jobs layer. We've already done this at the SQL layer.

@irfansharif
Copy link
Contributor

irfansharif commented Apr 20, 2022

Mind filing an issue? What I think I want at some level is errors.AssertionFailed errors failing tests.

@irfansharif
Copy link
Contributor

Bisecting for fix, so bad = good and good = bad.

$ git bisect start 9246ff5f80ec098165b55 be1b6c4fd4778b784658459f77d 
be1b6c4fd4778b7846584 - good
9246ff5f80ec098165b55 - bad
e2d10082f6fcd380a3873 - bad
fbd07c8b2e18be3f54f18 - bad
a3cfd6345d5a2afda33ca - bad
4a4dfead350f5d334a22e - good
71e8cac55389008e9cef6 - bad
718222f048262a78c32fc - bad
6f4c69b0b3fae65eabb3c - bad
270dca1002697a7213161 - good
fe416fd5a85229652e0ab - bad
d1489492a1bcfc9f198f3 - good
bde18248bb96e441f6b8c - good
bef34a58b2eeff34b695b - bad

Bisect pointed to #76003 as the PR that "fixed" the issue. Probably it's masked over something else. I'll stop being lazy and actually investigate.

@ajwerner
Copy link
Contributor

Mind filing an issue? What I think I want at some level is errors.AssertionFailed errors failing tests.

#80235

irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 20, 2022
Repro for cockroachdb#75831.

  dev test pkg/ccl/backupccl -f=TestRestoreOldBackupMissingOfflineIndexes -v --show-logs
  dev test pkg/ccl/backupccl -f=TestRestoreOldBackupMissingOfflineIndexes --stress --timeout 1m

Over a rangefeed established over `system.descriptor`, we can observe
two descriptors with same ID but over different objects (database
and schema types respectively).

I220420 12:57:10.625054 13320 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:243 [n1,job=754938775291265025,rangefeed=sql-watcher-descriptor-rangefeed] 1984 xxx: received rangefeed event at 1650459430.623719000,0 for descriptor schema:<name:"public" id:51 state:PUBLIC offline_reason:"" modification_time:<wall_time:1650459430623719000 > version:1 parent_id:50 privileges:<users:<user_proto:"admin" privileges:2 with_grant_option:2 > users:<user_proto:"public" privileges:516 with_grant_option:0 > users:<user_proto:"root" privileges:2 with_grant_option:2 > owner_proto:"admin" version:2 > >
I220420 12:57:10.747180 13320 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:243 [n1,job=754938775291265025,rangefeed=sql-watcher-descriptor-rangefeed] 2059 xxx: received rangefeed event at 1650459430.740418000,0 for descriptor database:<name:"postgres" id:51 modification_time:<wall_time:1650459430740418000 > version:2 privileges:<users:<user_proto:"admin" privileges:2 with_grant_option:2 > users:<user_proto:"root" privileges:2 with_grant_option:2 > owner_proto:"root" version:2 > schemas:<key:"public" value:<id:68 dropped:false > > state:PUBLIC offline_reason:"" default_privileges:<type:DATABASE > >

Release note: None
@irfansharif
Copy link
Contributor

#80239 has a repro. Over a rangefeed established over system.descriptor, post-restore we can observe two descriptors with same ID but over different objects (database and schema types respectively). Am I missing something? Or was this not a guarantee we could've relied on?

I220420 12:57:10.625054 13320 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:243 [n1,job=754938775291265025,rangefeed=sql-watcher-descriptor-rangefeed] 1984 xxx: received rangefeed event at 1650459430.623719000,0 for descriptor schema:<name:"public" id:51 state:PUBLIC offline_reason:"" modification_time:<wall_time:1650459430623719000 > version:1 parent_id:50 privileges:<users:<user_proto:"admin" privileges:2 with_grant_option:2 > users:<user_proto:"public" privileges:516 with_grant_option:0 > users:<user_proto:"root" privileges:2 with_grant_option:2 > owner_proto:"admin" version:2 > >
I220420 12:57:10.747180 13320 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:243 [n1,job=754938775291265025,rangefeed=sql-watcher-descriptor-rangefeed] 2059 xxx: received rangefeed event at 1650459430.740418000,0 for descriptor database:<name:"postgres" id:51 modification_time:<wall_time:1650459430740418000 > version:2 privileges:<users:<user_proto:"admin" privileges:2 with_grant_option:2 > users:<user_proto:"root" privileges:2 with_grant_option:2 > owner_proto:"root" version:2 > schemas:<key:"public" value:<id:68 dropped:false > > state:PUBLIC offline_reason:"" default_privileges:<type:DATABASE > >

@irfansharif
Copy link
Contributor

It seems we start off with the public schema with ID=51, try restoring a database postgres also with ID=51, the stream rangefeed events see both (one of which I think is a deletion event for the schema descriptor) and we're not equipped to deal with differently typed descriptors appearing on the same stream with the same ID. If someone (@ajwerner, @stevendanna?) could spell out for us what exactly happens during restores when descriptor IDs conflict with what's already present, that'd help. Does the existing state get wiped? Possibly replaced with descriptors with the same ID but different types? What are the invariants we can rely on? Do we need to relax the assertion here and just not try to resolve the type?

return catalog.Any, errors.AssertionFailedf("cannot combine %s and %s", d1, d2)

But this has implications for the incremental reconciler that filters for missing table IDs in order to delete span configs that no longer apply:

descriptorUpdate := update.GetDescriptorUpdate()
if descriptorUpdate.Type != catalog.Table {

If a table is being dropped due to RESTORE, the incremental reconciler wants to know about it. Unintuitively, the behavior today where we error out is actually saner but only incidentally so -- we'll fail the reconciliation job and on a subsequent attempt start with the full reconciliation attempt (because we're not using persisted checkpoints yet: #73694, at which point this too will be busted).

@ajwerner
Copy link
Contributor

The only case where a descriptor gets overwritten at all is full cluster restore. During a full cluster restore, I do think it's possible to overwrite a descriptor, though I don't recall the details. The best thing I think we can do in this case is kick off a new full reconciliation. Given the idea is that before a full cluster restore, the cluster has absolutely zero user-created data, this shouldn't be more expensive than the incremental. What do you think about the idea of adding logic to deal with this case by just restarting a full reconciliation?

@irfansharif
Copy link
Contributor

What’s a reliable way to detect that we’re observing descriptor rewrites from where the span configs infrastructure is placed? Is it this combine attempt? That’ll work though it’s somewhat unintuitive. Probably too late to do something more involved anyway.

@irfansharif
Copy link
Contributor

Recap: RESTORE, in the planning stage, will first drop defaultdb and postgres (and the corresponding public schemas) here.

// When running a full cluster restore, we drop the defaultdb and postgres
// databases that are present in a new cluster.
// This is done so that they can be restored the same way any other user
// defined database would be restored from the backup.
if restoreStmt.DescriptorCoverage == tree.AllDescriptors {
if err := dropDefaultUserDBs(ctx, p.ExecCfg()); err != nil {
return err
}
}

It will then write fresh descriptors from the backup image here:

// Write the new descriptors which are set in the OFFLINE state.
if err := ingesting.WriteDescriptors(

It's possible that the descriptors we're writing out in step 2 have the same IDs as the descriptors we're dropping above, but are of different types. It's possible for e.g. that we're replacing a schema descriptor with ID=X with a database descriptor having the same ID. The span configs infrastructure observes a stream of rangefeed events over system.descriptor and relies on the invariant that descriptors of the same ID have to be of the same type.

Talked with Arul+Aditya to come up with a few ideas on what we could do, ranging from:
(a) observing a reconciler checkpoint during RESTORE (that would work because we would have flushed out the descriptor deletions and separately handle the descriptor additions -- them having different types would not fire anything);
(b) pausing the reconciliation job before writing the new descriptors, resuming it after, ensuring a full reconciliation on resume;
(c) some other span config internal hacks too sketchy to mention.

I have a sketch PR to fix this using (b) over at #80339. (a) would've been fine too, but might be undesirable to couple restore success to our ability to observe a reconciler checkpoint. I'm not sure how reasonable that concern is, I feel (a) is simpler to implement and fewer moving pieces (don't have to pause/resume a job), but implemented (b) anyway.

@irfansharif
Copy link
Contributor

I'm also open to not doing anything for 22.1. The "fix" here is invasive, but we're currently recovering from the internal error just fine. Perhaps we could file an issue and re-evaluate for 22.2.

irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 22, 2022
Fixes cockroachdb#75831, an annoying bug in the intersection between the span
configs infrastructure + backup/restore.

It's possible to observe mismatched descriptor types for the same ID
post-RESTORE, an invariant the span configs infrastructure relies on.
This paper simply papers over this mismatch, kicking off a full
reconciliation process to recover if it occurs. Doing something "better"
is a lot more invasive, the options being:
- pausing the reconciliation job during restore (prototyped in cockroachdb#80339);
- observing a reconciler checkpoint in the restore job (work since we
  would have flushed out RESTORE's descriptor deletions and separately
  handle the RESTORE's descriptor additions -- them having different
  types would not fire the assertion);
- re-keying restored descriptors to not re-use the same IDs as existing
  schema objects.

While here, we add a bit of plumbing/testing to make the future
work/testing for \cockroachdb#73694 (using reconciler checkpoints on retries)
easier. This PR also sets the stage for the following pattern around use
of checkpoints:
1. We'll use checkpoints and incrementally reconciler during job-internal
   retries (added in cockroachdb#78117);
2. We'll always fully reconcile (i.e. ignore checkpoints) when the job
   itself is bounced around.

We do this because we need to fully reconcile across job restarts if the
reason for the restart is due to RESTORE-induced errors. This is a bit
unfortunate, and if we want to improve on (2), we'd have to persist job
state (think "poison pill") that ensures that we ignore the persisted
checkpoint. As of this PR, the only use of job-persisted checkpoints are
the migrations rolling out this infrastructure. That said, now we'll
have a mechanism to force a full reconciliation attempt -- we can:

   -- get $job_id
   SELECT job_id FROM [SHOW AUTOMATIC JOBS]
   WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

   PAUSE JOB $job_id
   RESUME JOB $job_id

Release note: None
@craig craig bot closed this as completed in d6240ac Apr 27, 2022
blathers-crl bot pushed a commit that referenced this issue Apr 27, 2022
Fixes #75831, an annoying bug in the intersection between the span
configs infrastructure + backup/restore.

It's possible to observe mismatched descriptor types for the same ID
post-RESTORE, an invariant the span configs infrastructure relies on.
This paper simply papers over this mismatch, kicking off a full
reconciliation process to recover if it occurs. Doing something "better"
is a lot more invasive, the options being:
- pausing the reconciliation job during restore (prototyped in #80339);
- observing a reconciler checkpoint in the restore job (work since we
  would have flushed out RESTORE's descriptor deletions and separately
  handle the RESTORE's descriptor additions -- them having different
  types would not fire the assertion);
- re-keying restored descriptors to not re-use the same IDs as existing
  schema objects.

While here, we add a bit of plumbing/testing to make the future
work/testing for \#73694 (using reconciler checkpoints on retries)
easier. This PR also sets the stage for the following pattern around use
of checkpoints:
1. We'll use checkpoints and incrementally reconciler during job-internal
   retries (added in #78117);
2. We'll always fully reconcile (i.e. ignore checkpoints) when the job
   itself is bounced around.

We do this because we need to fully reconcile across job restarts if the
reason for the restart is due to RESTORE-induced errors. This is a bit
unfortunate, and if we want to improve on (2), we'd have to persist job
state (think "poison pill") that ensures that we ignore the persisted
checkpoint. As of this PR, the only use of job-persisted checkpoints are
the migrations rolling out this infrastructure. That said, now we'll
have a mechanism to force a full reconciliation attempt -- we can:

   -- get $job_id
   SELECT job_id FROM [SHOW AUTOMATIC JOBS]
   WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

   PAUSE JOB $job_id
   RESUME JOB $job_id

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-zone-configs C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered).
Projects
None yet
4 participants