spanconfig: introduce spanconfig.Reconciler #71994

irfansharif · 2021-10-26T18:14:57Z

spanconfig: introduce spanconfig.Reconciler

Reconciler is responsible for reconciling a tenant's zone configs (SQL
construct) with the cluster's span configs (KV construct). It's the
central engine for the span configs infrastructure; a single Reconciler
instance is active for every tenant in the system.

type Reconciler interface {
  // Reconcile starts the incremental reconciliation process from
  // the given checkpoint. If it does not find MVCC history going
  // far back enough[1], it falls back to a scan of all
  // descriptors and zone configs before being able to do more
  // incremental work. The provided callback is invoked with
  // timestamps that can be safely checkpointed. A future
  // Reconciliation attempt can make use of this timestamp to
  // reduce the amount of necessary work (provided the MVCC
  // history is still available).
  //
  // [1]: It's possible for system.{zones,descriptor} to have been
  //      GC-ed away; think suspended tenants.
  Reconcile(
    ctx context.Context,
    checkpoint hlc.Timestamp,
    callback func(checkpoint hlc.Timestamp) error,
  ) error
}

Let's walk through what it does. At a high-level, we maintain an
in-memory data structure that's up-to-date with the contents of the KV
(at least the subset of spans we have access to, i.e. the keyspace
carved out for our tenant ID). We watch for changes to SQL state
(descriptors, zone configs), translate the SQL updates to the flattened
span+config form, "diff" the updates against our data structure to see
if there are any changes we need to inform KV of. If so, we do, and
ensure that our data structure is kept up-to-date. We continue watching
for future updates and repeat as necessary.

There's only single instance of the Reconciler running for a given
tenant at a given point it time (mutual exclusion/leasing is provided by
the jobs subsystem). We needn't worry about contending writers, or the
KV state being changed from underneath us. What we do have to worry
about, however, is suspended tenants' not being reconciling while
suspended. It's possible for a suspended tenant's SQL state to be GC-ed
away at older MVCC timestamps; when watching for changes, we could fail
to observe tables/indexes/partitions getting deleted. Left as is, this
would result in us never issuing a corresponding deletion requests for
the dropped span configs -- we'd be leaving orphaned span configs lying
around (taking up storage space and creating pointless empty ranges). A
"full reconciliation pass" is our attempt to find all these extraneous
entries in KV and to delete them.

We can use our span config data structure here too, one that's
pre-populated with the contents of KV. We translate the entire SQL state
into constituent spans and configs, diff against our data structure to
generate KV updates that we then apply. We follow this with clearing out
all these spans in our data structure, leaving behind all extraneous
entries to be found in KV -- entries we can then simply issue deletes
for.

----

server,kvaccessor: record span configs during tenant creation/gc

For newly created tenants, we want to ensure hard splits on tenant
boundaries. The source of truth for split points in the span configs
subsystem is the contents of system.span_configurations. To ensure
hard splits, we insert a single key record at the tenant prefix. In a
future commit we'll introduce the spanconfig.Reconciler process, which
runs per tenant and governs all config entries under each tenant's
purview. This has implications for this initial record we're talking
about (this record might get cleaned up for e.g.); we'll explore it in
tests for the Reconciler itself.

Creating a single key record is easy enough -- we could've written
directly to system.span_configurations. When a tenant is GC-ed
however, we need to clear out all tenant state transactionally. To that
end we plumb in a txn-scoped KVAccessor into the planner where
crdb_internal.destroy_tenant is executed. This lets us easily delete
all abandoned tenant span config records.

Note: We get rid of spanconfig.experimental_kvaccessor.enabled. Access
to spanconfigs infrastructure is already sufficiently gated through
COCKROACH_EXPERIMENTAL_SPAN_CONFIGS. Now that
crdb_internal.create_tenant attempts to write through the KVAccessor,
it's cumbersome to have to enable the setting manually in every
multi-tenant test (increasingly the default) enabling some part of the
span configs infrastructure.

This commit introduces the need for a migration -- for existing clusters
with secondary tenants, when upgrading we need to install this initial
record at the tenant prefix for all extant tenants (and make sure to
continue doing so for subsequent newly created tenants). This is to
preserve the hard-split-on-tenant-boundary invariant we wish to provide.
It's possible for an upgraded multi-tenant cluster to have dormant sql
pods that have never reconciled. If KV switches over to the span configs
subsystem, splitting only on the contents of
system.span_configurations, we'll fail to split on all tenant
boundaries. We'll introduce this migration in a future PR (before
enabling span configs by default).

Release note: None

Only the last two commits here are of interest (first one is from #73531).

cockroach-teamcity · 2021-10-26T18:15:07Z

This change is

We first introduced spanconfig.StoreWriter in cockroachdb#70287. Here we extend the interface to accept a batch of updates instead of just one. type StoreWriter interface { Apply(ctx context.Context, dryrun bool, updates ...Update) ( deleted []roachpb.Span, added []roachpb.SpanConfigEntry, ) } The implementation is subtle -- we're not processing one update at a time. The semantics we're after is applying a batch of updates atomically on a consistent snapshot of the underlying store. This comes up in the upcoming spanconfig.Reconciler (cockroachdb#71994) -- there, following a zone config/descriptor change, we want to update KV state in a single request/RTT instead of an RTT per descendent table. The intended usage is better captured in the aforementioned PR; let's now talk about what this entails for the datastructure. To apply a single update, we want to find all overlapping spans and clear out just the intersections. If the update is adding a new span config, we'll also want to add the corresponding store entry after. We do this by deleting all overlapping spans in their entirety and re-adding the non-overlapping segments. Pseudo-code: for entry in store.overlapping(update.span): union, intersection = union(update.span, entry), intersection(update.span, entry) pre = span{union.start_key, intersection.start_key} post = span{intersection.end_key, union.end_key} delete {span=entry.span, conf=entry.conf} if entry.contains(update.span.start_key): # First entry overlapping with update. add {span=pre, conf=entry.conf} if non-empty if entry.contains(update.span.end_key): # Last entry overlapping with update. add {span=post, conf=entry.conf} if non-empty if adding: add {span=update.span, conf=update.conf} # add ourselves When extending to a set of updates, things are more involved. Let's assume that the updates are non-overlapping and sorted by start key. As before, we want to delete overlapping entries in their entirety and re-add the non-overlapping segments. With multiple updates, it's possible that a segment being re-added will overlap another update. If processing one update at a time in sorted order, we want to only re-add the gap between the consecutive updates. keyspace a b c d e f g h i j existing state [--------X--------) updates [--A--) [--B--) When processing [a,c):A, after deleting [b,h):X, it would be incorrect to re-add [c,h):X since we're also looking to apply [g,i):B. Instead of re-adding the trailing segment right away, we carry it forward and process it when iterating over the second, possibly overlapping update. In our example, when iterating over [g,i):B we can subtract the overlap from [c,h):X and only re-add [c,g):X. It's also possible for the segment to extend past the second update. In the example below, when processing [d,f):B and having [b,h):X carried over, we want to re-add [c,d):X and carry forward [f,h):X to the update after (i.e. [g,i):C)). keyspace a b c d e f g h i j existing state [--------X--------) updates [--A--) [--B--) [--C--) One final note: we're iterating through the updates without actually applying any mutations. Going back to our first example, when processing [g,i):B, retrieving the set of overlapping spans would (again) retrieve [b,h):X -- an entry we've already encountered when processing [a,c):A. Re-adding non-overlapping segments naively would re-add [b,g):X -- an entry that overlaps with our last update [a,c):A. When retrieving overlapping entries, we need to exclude any that overlap with the segment that was carried over. Pseudo-code: carry-over = <empty> for update in updates: carried-over, carry-over = carry-over, <empty> if update.overlap(carried-over): # Fill in the gap between consecutive updates. add {span=span{carried-over.start_key, update.start_key}, conf=carried-over.conf} # Consider the trailing span after update; carry it forward if non-empty. carry-over = {span=span{update.end_key, carried-over.end_key}, conf=carried-over.conf} else: add {span=carried-over.span, conf=carried-over.conf} if non-empty for entry in store.overlapping(update.span): if entry.overlap(processed): continue # already processed union, intersection = union(update.span, entry), intersection(update.span, entry) pre = span{union.start_key, intersection.start_key} post = span{intersection.end_key, union.end_key} delete {span=entry.span, conf=entry.conf} if entry.contains(update.span.start_key): # First entry overlapping with update. add {span=pre, conf=entry.conf} if non-empty if entry.contains(update.span.end_key): # Last entry overlapping with update. carry-over = {span=post, conf=entry.conf} if adding: add {span=update.span, conf=update.conf} # add ourselves add {span=carry-over.span, conf=carry-over.conf} if non-empty We've extended the randomized testing suite to generate batches of updates at a time. We've also added a few illustrated datadriven tests. Release note: None

The only mutable state across concurrent WatchForSQLUpdate calls was the internal buffer, which does not need to hang off the surrounding struct. This obviates the need for the factory pattern we were using -- callers can set up multiple SQLWatchers concurrently as is (see TestSQLWatcherMultiple). This commit simply hides the factory under the package boundary; in a future commit we'll shed it altogether. This has the benefit of reducing the number of symbols in pkg/spanconfig and making it symmetric with the other spanconfig dependencies typically found together (KVAccessor, SQLTranslator). It's also every-so-slightly less verbose to use in the upcoming spanconfig.Reconciler (cockroachdb#71994). Release note: None

nvanbenschoten

Reviewed 9 of 35 files at r22, 27 of 27 files at r23, 8 of 8 files at r24, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)

pkg/migration/migrations/seed_tenant_span_configs.go, line 82 at r23 (raw file):

				// This tenant already has span config entries. It was either
				// already migrated (migrations need to be idempotent) or it was
				// created after PreSeedTenantSpanConfigs was activated. THere's

s/THere's/There's/

pkg/spanconfig/spanconfig.go, line 48 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Leave a note about what happens if someone tries to chains calls to this like kvaccessor.WithTxn(txn1).WithTxn(txn2).

Is this worth supporting? Could doing so mask very subtle bugs? Consider a log.Fatal in this case instead.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

It's waiting for a lease to expire somewhere, right?

No, it's transactionally checking against the job table to make sure that no job exists, and then (and only then) transactionally creating the new job. I'm reading the code snippet linked above and the library method linked below, though I may be misunderstanding something. Does our use of txns against the jobs table not give us the true mutual exclusion we're wanting?

cockroach/pkg/jobs/utils.go

Line 28 in e4caa44

func RunningJobExists(

That depends on what removes jobs from the table. If a node begins running a job and then crashes or otherwise becomes unresponsive, how do we recover and ensure that someone else can adopt the job? I assume there's a time-based lease at some level here or none of this would be fault-tolerant.

Maybe that's what the sqlliveness table stores. I see reference to a crdb_internal.sql_liveness_is_alive builtin function, which is checking a lease expiration. server.sqlliveness.ttl and server.sqlliveness.heartbeat imply a 40 second lease that's extended every 5 seconds.

We don't need to bottom out on this in this PR, but every new job that assumes mutual exclusion should be having these discussions at some point.

pkg/spanconfig/spanconfigtestutils/recorder.go, line 76 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done (in a stand-alone commit). Called this constructor "Deletion" instead, so it reads store.Apply(..., spanconfig.Deletion(sp)).

Thanks, I think this turned out well. It's much easier to read now.

pkg/sql/tenant.go, line 127 at r20 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Good catch, split the migration across two versions. The first will start allowing newly created tenants to install fresh span configs, the second will install configs for tenants that don't have any. Added appropriate tests, etc.

Is that sufficient to avoid the race? If these are minor versions, do we have a guarantee that the fast node that runs both migrations isn't two minor versions ahead of the old node?

What's the hazard with removing this version check entirely? Is it that the spanconfig table might not exist? Is so, maybe we should be using transactions to avoid the race. Could we transactionally check whether the table exists yet, and perform the insert if it does? And if it doesn't, we know that the migration to seed the table hasn't run yet, so we rely on that.

pkg/sql/tenant.go, line 128 at r23 (raw file):

	if !execCfg.Settings.Version.IsActive(ctx, clusterversion.PreSeedTenantSpanConfigs) {
		return err

Should this be return nil?

pkg/sql/tenant.go, line 440 at r23 (raw file):

		scKVAccessor := execCfg.SpanConfigKVAccessor.WithTxn(txn)
		entries, err := scKVAccessor.GetSpanConfigEntriesFor(ctx, []roachpb.Span{tenantSpan})

What will happen here if the spanconfigs table hasn't been created yet?

irfansharif

PTAL -- should've addressed everything.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)

pkg/spanconfig/spanconfig.go, line 48 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this worth supporting? Could doing so mask very subtle bugs? Consider a log.Fatal in this case instead.

Done.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

I assume there's a time-based lease at some level here or none of this would be fault-tolerant.

cockroach/pkg/spanconfig/spanconfigmanager/manager.go

Lines 147 to 154 in efed816

    
           // Periodically check if the span config reconciliation job exists and start 
        
           // it if it doesn't. 
        
           timer := timeutil.NewTimer() 
        
           defer timer.Stop() 
        
           triggerJobCheck() 
        
           for { 
        
           	timer.Reset(checkReconciliationJobInterval.Get(&m.settings.SV))

It's simpler, the span config manager (runs for every sql pod) simply polls the jobs table periodically to make sure that there's at least one job that's running, and if not, creating it. Perhaps @arulajmani or @ajwerner can correct me if I'm wrong, but if we're holding pkg/jobs right in the spanconfig manager, and creating jobs in the way described above, I believe we already have guaranteed mutual exclusion. If not, I'd love to learn how -- it's a guarantee we're relying on in the Reconciler.

pkg/spanconfig/spanconfigtestutils/recorder.go, line 76 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Thanks, I think this turned out well. It's much easier to read now.

Thanks for the suggestion! I agree.

pkg/sql/tenant.go, line 127 at r20 (raw file):

Is that sufficient to avoid the race? If these are minor versions, do we have a guarantee that the fast node that runs both migrations isn't two minor versions ahead of the old node?

I believe it's sufficient. The migrations infrastructure provides two guarantees:

All cluster versions, and their associated migrations, are stepped through in order
Before bumping the cluster version X, all nodes will have already observed cluster version bump X-1. This includes newly added nodes, etc. See here.

These guarantees let you string together arbitrarily complex migrations.

Is it that the spanconfig table might not exist?

Given the SpanConfigurationsTablemigration precedes both these migrations, the table is guaranteed to exist.

What's the hazard with removing this version check entirely?

I don't think there's much of a hazard -- the system table already exists (was baked into 21.2). That said, I think it's also fine to gate new functionality (writing of these records) with an explicit cluster version instead of allowing it to happen unconditionally if create_tenant is executed against a 22.1 node in a mixed-version cluster. It makes it slightly easier to reason about.

pkg/sql/tenant.go, line 128 at r23 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this be return nil?

Done.

pkg/sql/tenant.go, line 440 at r23 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What will happen here if the spanconfigs table hasn't been created yet?

That's not possible, see comment above about the guarantees provided by the migrations infrastructure.

arulajmani

Reviewed 11 of 41 files at r14, 69 of 69 files at r16, 34 of 69 files at r17, 13 of 58 files at r18, 35 of 35 files at r22, 1 of 8 files at r24.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)

pkg/spanconfig/spanconfig.go, line 56 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done, went with the unified interface. It does however add a wart -- this interface doesn't really apply to the Connector (just panics in the implementation). I'm happy to change it out to the usual factory pattern if you'd like.

Mild preference to switching this to using the usual factory pattern. Like you mentioned, there's the wart with the Connector and it'll also sidestep Nathan's concern elsewhere about chaining WithTxns.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 121 at r24 (raw file):

// checkpoint. For changes to, say, RANGE DEFAULT, the RPC request proto is
// proportional to the number of schema objects.
func (r *Reconciler) Reconcile(

Did you consider having fullReconciler and incrementalReconciler implement the Reconciler interface and have the contents of the job do what this function does instead?

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 360 at r24 (raw file):

			}

			toDelete, toUpsert := r.storeWithKVContents.Apply(ctx, false /* dryrun */, updates...)

This means that the incrementalReconciler is single use, right? For example, you could've applied updates to the store but the RPC below could fail. I don't think this is a problem with the way things are structured currently, but maybe consider making this more explicit by either passing in the store to this function or making a factory for this struct.

arulajmani · 2021-12-13T21:04:39Z

@irfansharif I'm not sure why, but reviewable isn't being great to me on this PR. Even for this second batch of comments looks like the line numbers are getting lost in the UI but you can see them in the comment. I don't have that many, but sorry for the trouble here.

irfansharif

TFTR!

reviewable isn't being great to me on this PR [..] sorry for the trouble here

Nothing to be sorry about! Thanks for the review. Perhaps the native github reviews work better?

Mild preference to switching this to using the usual factory pattern.

I also like the factory pattern better, is it alright to do it in a follow on? I even had it in an earlier revision, it's a very small change. I keep catching painful (only because of slow UI build times, so not actually that painful) skews because of the new cluster versions and I'd like to start enabling this stuff on master before everyone's gone for the holidays.

Did you consider having fullReconciler and incrementalReconciler implement the Reconciler interface and have the contents of the job do what this function does instead?

Not really, I think I prefer with the current breakdown, especially with it being as decomposed as it is into the various stages. Part of the reason is that we're using the full reconciliation process to also cheaply construct a spanconfig.Store with kvaccessor contents. I also like keeping the job as dumb as it is, focused only on retries, and keep more business logic here where it's easier to test/inject knobs/etc. See below.

This means that the incrementalReconciler is single use, right? For example, you could've applied updates to the store but the RPC below could fail.

Even the fullReconciler is. This is what originally motivated the dryrun option, thinking I'd do diffs := store.apply(..., dryrun=true); issue-rpcs(diffs); store.apply(..., dryrun=false), but felt unnecessary at the end. In the happy case things will succeed. In the unhappy case, it's more than fine to retry at a higher level (span config job/manager), and preferable IMO because it keeps the retry/recovery logic out of the reconciliation code and at the level where it's going to exist in some form anyway (if the timestamp's being watched are already gc-ed for e.g., the jobs need to to start reconciliation all over).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)

nvanbenschoten

Reviewed 2 of 11 files at r25, 8 of 8 files at r26, 6 of 42 files at r27, 30 of 33 files at r28, 8 of 8 files at r29, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)

pkg/spanconfig/spanconfigkvaccessor/kvaccessor.go, line 66 at r29 (raw file):

func (k *KVAccessor) WithTxn(ctx context.Context, txn *kv.Txn) spanconfig.KVAccessor {
	if k.optionalTxn != nil {
		log.Fatalf(ctx, "KVAccessor already scoped to txn (was .WithTxn(...) chained multiple times?")

nit: missing closing paren.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I assume there's a time-based lease at some level here or none of this would be fault-tolerant.

cockroach/pkg/spanconfig/spanconfigmanager/manager.go

Lines 147 to 154 in efed816

// Periodically check if the span config reconciliation job exists and start

// it if it doesn't.

timer := timeutil.NewTimer()

defer timer.Stop()

triggerJobCheck()

for {

timer.Reset(checkReconciliationJobInterval.Get(&m.settings.SV))

It's simpler, the span config manager (runs for every sql pod) simply polls the jobs table periodically to make sure that there's at least one job that's running, and if not, creating it. Perhaps @arulajmani or @ajwerner can correct me if I'm wrong, but if we're holding pkg/jobs right in the spanconfig manager, and creating jobs in the way described above, I believe we already have guaranteed mutual exclusion. If not, I'd love to learn how -- it's a guarantee we're relying on in the Reconciler.

I think it might be helpful to ask the inverse question. What happens if the node running the job fails? Someone else must take over, right? Similarly, what happens if the node running the job hits a minute-long GC pause?

pkg/sql/tenant.go, line 127 at r20 (raw file):

Before bumping the cluster version X, all nodes will have already observed cluster version bump X-1. This includes newly added nodes, etc. See here.

Got it, thanks for the link. I was under the impression that this was only true of major version numbers, but that must be an outdated understanding.

pkg/sql/tenant.go, line 440 at r23 (raw file):

Previously, irfansharif (irfan sharif) wrote…

That's not possible, see comment above about the guarantees provided by the migrations infrastructure.

But we don't have a version gate in this code path. I must be missing something.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think it might be helpful to ask the inverse question. What happens if the node running the job fails? Someone else must take over, right? Similarly, what happens if the node running the job hits a minute-long GC pause?

The only true claim of mutual exclusion we have is when we update the job checkpoint, and even then, only if we properly use the job object passed into the resumer. It does seem possible that under the right sort of scenario, the current job may fail or the current pod may lose its lease and then a new pod may start executing. Currently none of the job leasing code considers synchrony assumptions.

Here are the safeguards:

When a job instance goes to update its status, it ensures that the current lease is the lease under which it was resumed.
When an underlying sqlliveness session is observed to be expired by another instance, that session's record will be deleted.
Job registry objects will periodically revoke leases from jobs which are leased by sessions which are known to not be live.

The above mechanisms provide mutual exclusion for updates to the records itself.

Now, there's also the fact that sql pods will kill themselves if they think that they are expired. Sadly, that doesn't actually provide us with any real protection. Firstly, the process may not detect it is expired until another node has already removed it; session expiration occurs in the MVCC time domain and not the clock time domain. Secondly, I don't think we're that principled or detecting the expiration promptly while making an outstanding request to extend the session. Lastly, the process of receiving from a timer and acting on it is not instantaneous.

So, if we want to make deep claims about mutual exclusion we'll need something deeper than what we've got.

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 87 at r17 (raw file):

Quoted 4 lines of code…

// [1]: #73399 proposes a new KV request type that would let us more easily
//      ascertain when a tenant's sql statements have been reconciled in KV,
//      which we could then use to only suspend pods after all outstanding work
//      ("reconciliation") has been done.

I may be misreading your note. The intention of the issue was not really to more easily ascertain when a tenant's sql statements had been reconciled, but rather to more rapidly trigger reconciliation. The reconciliation state itself will be represented by a single row in KV, so the caller can easily either watch that row or poll it to wait for reconciliation. The more pressing desire is to make reconciliation happen promptly after a write which causes changes. If we could do that, then it'd be reasonable to wait for reconciliation before returning to the client.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, ajwerner wrote…

The only true claim of mutual exclusion we have is when we update the job checkpoint, and even then, only if we properly use the job object passed into the resumer. It does seem possible that under the right sort of scenario, the current job may fail or the current pod may lose its lease and then a new pod may start executing. Currently none of the job leasing code considers synchrony assumptions.

Here are the safeguards:

When a job instance goes to update its status, it ensures that the current lease is the lease under which it was resumed.

When an underlying sqlliveness session is observed to be expired by another instance, that session's record will be deleted.

Job registry objects will periodically revoke leases from jobs which are leased by sessions which are known to not be live.

The above mechanisms provide mutual exclusion for updates to the records itself.

Now, there's also the fact that sql pods will kill themselves if they think that they are expired. Sadly, that doesn't actually provide us with any real protection. Firstly, the process may not detect it is expired until another node has already removed it; session expiration occurs in the MVCC time domain and not the clock time domain. Secondly, I don't think we're that principled or detecting the expiration promptly while making an outstanding request to extend the session. Lastly, the process of receiving from a timer and acting on it is not instantaneous.

So, if we want to make deep claims about mutual exclusion we'll need something deeper than what we've got.

Oh! I was under the mistaken assumption that job resumption entailed touching the job record itself transactionally, and so did the lease expiration (don't ask me how, I see it doesn't make sense). Filed #73789 to make sure it's addressed pre-22.1 (+cc @arulajmani).

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 87 at r17 (raw file):

Previously, ajwerner wrote…

// [1]: #73399 proposes a new KV request type that would let us more easily
//      ascertain when a tenant's sql statements have been reconciled in KV,
//      which we could then use to only suspend pods after all outstanding work
//      ("reconciliation") has been done.
I may be misreading your note. The intention of the issue was not really to more easily ascertain when a tenant's sql statements had been reconciled, but rather to more rapidly trigger reconciliation. The reconciliation state itself will be represented by a single row in KV, so the caller can easily either watch that row or poll it to wait for reconciliation. The more pressing desire is to make reconciliation happen promptly after a write which causes changes. If we could do that, then it'd be reasonable to wait for reconciliation before returning to the client.

Got it, updated. Elsewhere @nvanbenschoten mentioned possibly wanting to make pod suspension a more explicit protocol in the database. A request like proposed in #73399 could help us there too: we could block all new sql txns temporarily right as we're about to suspend, ensure we've reconciled, thus suspending only pods with no outstanding work.

pkg/sql/tenant.go, line 127 at r20 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Before bumping the cluster version X, all nodes will have already observed cluster version bump X-1. This includes newly added nodes, etc. See here.

Got it, thanks for the link. I was under the impression that this was only true of major version numbers, but that must be an outdated understanding.

Yea, it's something we codified with the long-running migrations project.

pkg/sql/tenant.go, line 440 at r23 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

But we don't have a version gate in this code path. I must be missing something.

Doh, didn't notice that was for the GCTenantRecord code path -- added.

irfansharif · 2021-12-15T13:58:11Z

(Bump.)

nvanbenschoten

Reviewed 30 of 33 files at r31, 8 of 8 files at r32, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)

pkg/spanconfig/spanconfigreconciler/reconciler.go, line 85 at r17 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Oh! I was under the mistaken assumption that job resumption entailed touching the job record itself transactionally, and so did the lease expiration (don't ask me how, I see it doesn't make sense). Filed #73789 to make sure it's addressed pre-22.1 (+cc @arulajmani).

Thanks for filing. Let's continue the discussion over in #73789 so it doesn't hold up this PR any longer.

arulajmani

I also like the factory pattern better, is it alright to do it in a follow on? I even had it in an earlier revision, it's a very small change. I keep catching painful (only because of slow UI build times, so not actually that painful) skews because of the new cluster versions and I'd like to start enabling this stuff on master before everyone's gone for the holidays.

Sounds good, fine by me!

Even the fullReconciler is. This is what originally motivated the dryrun option, thinking I'd do diffs := store.apply(..., dryrun=true); issue-rpcs(diffs); store.apply(..., dryrun=false), but felt unnecessary at the end. In the happy case things will succeed. In the unhappy case, it's more than fine to retry at a higher level (span config job/manager), and preferable IMO because it keeps the retry/recovery logic out of the reconciliation code and at the level where it's going to exist in some form anyway (if the timestamp's being watched are already gc-ed for e.g., the jobs need to to start reconciliation all over).

I'm fine with retrying at a higher level, but we should add a comment on the {incremental,full}Reconcilers that these things are meant to be single use. Better yet, maybe we can be defensive about this in code by either using a factory to construct these or checking on some state to ensure {incremental,full}Reconciler.reconcile is only ever called once.

For newly created tenants, we want to ensure hard splits on tenant boundaries. The source of truth for split points in the span configs subsystem is the contents of system.span_configurations. To ensure hard splits, we insert a single key record at the tenant prefix. In a future commit we'll introduce the spanconfig.Reconciler process, which runs per tenant and governs all config entries under each tenant's purview. This has implications for this initial record we're talking about (this record might get cleaned up for e.g.); we'll explore it in tests for the Reconciler itself. Creating a single key record is easy enough -- we could've written directly to system.span_configurations. When a tenant is GC-ed however, we need to clear out all tenant state transactionally. To that end we plumb in a txn-scoped KVAccessor into the planner where crdb_internal.destroy_tenant is executed. This lets us easily delete all abandoned tenant span config records. Note: We get rid of spanconfig.experimental_kvaccessor.enabled. Access to spanconfigs infrastructure is already sufficiently gated through the env var. Now that crdb_internal.create_tenant attempts to write through the KVAccessor, it's cumbersome to have to enable the setting manually in every multi-tenant test (increasingly the default) enabling some part of the span configs infrastructure. --- This commit also needs a migration -- for existing clusters with secondary tenants, when upgrading we need to install this initial record at the tenant prefix for all extant tenants (and make sure to continue doing so for subsequent newly created tenants). This is to preserve the hard-split-on-tenant-boundary invariant we wish to provide. It's possible for an upgraded multi-tenant cluster to have dormant sql pods that have never reconciled. If KV switches over to the span configs subsystem, splitting only on the contents of system.span_configurations, we'll fail to split on all tenant boundaries. To this end we introduce clusterversion.SeedTenantSpanConfigs, which allows us to seed span config data for secondary tenants. The associated migration seeds entries for existing tenants. Release note: None

Reconciler is responsible for reconciling a tenant's zone configs (SQL construct) with the cluster's span configs (KV construct). It's the central engine for the span configs infrastructure; a single Reconciler instance is active for every tenant in the system. type Reconciler interface { // Reconcile starts the incremental reconciliation process from // the given checkpoint. If it does not find MVCC history going // far back enough[1], it falls back to a scan of all // descriptors and zone configs before being able to do more // incremental work. The provided callback is invoked with // timestamps that can be safely checkpointed. A future // Reconciliation attempt can make use of this timestamp to // reduce the amount of necessary work (provided the MVCC // history is still available). // // [1]: It's possible for system.{zones,descriptor} to have been // GC-ed away; think suspended tenants. Reconcile( ctx context.Context, checkpoint hlc.Timestamp, callback func(checkpoint hlc.Timestamp) error, ) error } Let's walk through what it does. At a high-level, we maintain an in-memory data structure that's up-to-date with the contents of the KV (at least the subset of spans we have access to, i.e. the keyspace carved out for our tenant ID). We watch for changes to SQL state (descriptors, zone configs), translate the SQL updates to the flattened span+config form, "diff" the updates against our data structure to see if there are any changes we need to inform KV of. If so, we do, and ensure that our data structure is kept up-to-date. We continue watching for future updates and repeat as necessary. There's only single instance of the Reconciler running for a given tenant at a given point it time (mutual exclusion/leasing is provided by the jobs subsystem). We needn't worry about contending writers, or the KV state being changed from underneath us. What we do have to worry about, however, is suspended tenants' not being reconciling while suspended. It's possible for a suspended tenant's SQL state to be GC-ed away at older MVCC timestamps; when watching for changes, we could fail to observe tables/indexes/partitions getting deleted. Left as is, this would result in us never issuing a corresponding deletion requests for the dropped span configs -- we'd be leaving orphaned span configs lying around (taking up storage space and creating pointless empty ranges). A "full reconciliation pass" is our attempt to find all these extraneous entries in KV and to delete them. We can use our span config data structure here too, one that's pre-populated with the contents of KV. We translate the entire SQL state into constituent spans and configs, diff against our data structure to generate KV updates that we then apply. We follow this with clearing out all these spans in our data structure, leaving behind all extraneous entries to be found in KV -- entries we can then simply issue deletes for. Release note: None

By using a type definition instead of a stand-alone type, we can reduce the amount of code needed to convert from a SpanConfigEntry to an Update (something we frequently do). It also helps make it less error prone. While here, we introduce two helpful constructors for the two kinds of updates we're typically interested in -- additions and deletions. Release note: None

irfansharif · 2021-12-15T18:28:49Z

but we should add a comment on the {incremental,full}Reconcilers that these things are meant to be single use.

Done. Thanks for all the reviews, this code was fun/exciting to write and iterate on -- looking forward to it flailing! I'm going to close #67679 and break out the remaining small items into individual quick-win issues.

bors r+

craig · 2021-12-15T19:36:45Z

Build succeeded:

GitHub CI (Cockroach)

73746: spanconfig: visualize differences with gossiped SystemConfigSpan r=irfansharif a=irfansharif **spanconfig: visualize differences with gossiped SystemConfigSpan** We introduce a test-only package that compares the span configs infrastructure against the gossiped SystemConfigSpan it was designed to replace. The datadriven test spits out a visual "diff" of the two subsystems, indicating exactly where they differ (or not!) in terms of implied range boundaries and the configurations to apply. It does so by having both systems run side-by-side and scanning their contents. These range boundaries and configurations have second-order effects (various actions are executed on behalf of ranges by internal queues), but we don't bother with any of that in the test. Of particular note is the behavior of each system in the presence of secondary tenants that create arbitrary schemas and zone configs. This after all is what motivated the new infrastructure. Here's an example, as a diff, that shows how the system config span is unable to split within an active tenant's boundary unlike the new thing: ```diff configs version=current offset=37 limit=5 ---- ... /Tenant/10 range default /Tenant/11/Table/3 range default /Tenant/11/Table/4 range default /Tenant/11/Table/5 range default /Tenant/11/Table/6 range default ... configs version=legacy offset=43 ---- ... /Tenant/10 range default /Tenant/11 range default diff offset=31 limit=8 ---- --- gossiped system config span (legacy) +++ span config infrastructure (current) ... @@ -43,4 +37,37 @@ /Table/47 range system /Tenant/10 range default -/Tenant/11 range default +/Tenant/11/Table/3 range default +/Tenant/11/Table/4 range default +/Tenant/11/Table/5 range default +/Tenant/11/Table/6 range default ... ``` This commit also gets rid of the ShadowReader. We were using it previously to manually diagnose differences between the two subsystems. With the improved testing, we can get rid of the painter's tape. --- **spanconfig: rm benign differences with SystemConfigSpan** In the previous commit we demonstrated visually the differences between the gossip-backed system config span and the new infrastructure. This commit gets rid of the benign differences. We could also just not do anything here, but it's only a minor change. Release note: None --- Only the last two commits are of interest here, the earlier ones are from #71994. Co-authored-by: irfan sharif <[email protected]>

This was referenced Oct 26, 2021

*: rework zone configurations to support tenants #67679

Closed

spanconfig: introduce the spanconfig.SQLWatcher #71968

Merged

irfansharif force-pushed the 211026.sql-reconciler branch from 56f5bf4 to 0392096 Compare November 1, 2021 18:04

irfansharif mentioned this pull request Nov 13, 2021

spanconfig: introduce the spanconfig.SQLWatcher #69661

Closed

irfansharif mentioned this pull request Nov 25, 2021

spanconfig/store: support applying a batch of updates #73150

Merged

irfansharif mentioned this pull request Nov 25, 2021

spanconfig/sqlwatcher: hide factory pattern, speed up tests, surface no-op checkpoints periodically #73171

Merged

irfansharif force-pushed the 211026.sql-reconciler branch 16 times, most recently from 0eda1a2 to e5e0da5 Compare November 29, 2021 20:49

irfansharif requested a review from nvanbenschoten December 10, 2021 22:43

nvanbenschoten reviewed Dec 11, 2021

View reviewed changes

This was referenced Dec 13, 2021

spanconfig: visualize differences with gossiped SystemConfigSpan #73746

Merged

spanconfig: ensure splits on tenant boundaries within KV #73749

Closed

irfansharif force-pushed the 211026.sql-reconciler branch from bbbf084 to 74d10e3 Compare December 13, 2021 20:48

irfansharif commented Dec 13, 2021

View reviewed changes

arulajmani reviewed Dec 13, 2021

View reviewed changes

irfansharif force-pushed the 211026.sql-reconciler branch 2 times, most recently from 68aee41 to 59d2b43 Compare December 13, 2021 21:53

irfansharif commented Dec 13, 2021

View reviewed changes

irfansharif force-pushed the 211026.sql-reconciler branch from 59d2b43 to d9ccba1 Compare December 14, 2021 01:02

nvanbenschoten reviewed Dec 14, 2021

View reviewed changes

ajwerner reviewed Dec 14, 2021

View reviewed changes

irfansharif mentioned this pull request Dec 14, 2021

spanconfig: ensure safety without mutual exclusion in the reconciliation job #73789

Closed

irfansharif force-pushed the 211026.sql-reconciler branch from d9ccba1 to e3792a5 Compare December 14, 2021 13:00

irfansharif commented Dec 14, 2021

View reviewed changes

nvanbenschoten approved these changes Dec 15, 2021

View reviewed changes

arulajmani approved these changes Dec 15, 2021

View reviewed changes

irfansharif added 3 commits December 15, 2021 13:20

irfansharif force-pushed the 211026.sql-reconciler branch from e3792a5 to f22eefa Compare December 15, 2021 18:25

craig bot merged commit 5e70321 into cockroachdb:master Dec 15, 2021

irfansharif deleted the 211026.sql-reconciler branch December 15, 2021 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanconfig: introduce spanconfig.Reconciler #71994

spanconfig: introduce spanconfig.Reconciler #71994

irfansharif commented Oct 26, 2021 •

edited

Loading

cockroach-teamcity commented Oct 26, 2021

nvanbenschoten left a comment

irfansharif left a comment

arulajmani left a comment

arulajmani commented Dec 13, 2021

irfansharif left a comment •

edited

Loading

nvanbenschoten left a comment

ajwerner left a comment

irfansharif left a comment

irfansharif commented Dec 15, 2021

nvanbenschoten left a comment

arulajmani left a comment

irfansharif commented Dec 15, 2021

craig bot commented Dec 15, 2021

	// Periodically check if the span config reconciliation job exists and start
	// it if it doesn't.
	timer := timeutil.NewTimer()
	defer timer.Stop()

	triggerJobCheck()
	for {
	timer.Reset(checkReconciliationJobInterval.Get(&m.settings.SV))

spanconfig: introduce spanconfig.Reconciler #71994

spanconfig: introduce spanconfig.Reconciler #71994

Conversation

irfansharif commented Oct 26, 2021 • edited Loading

spanconfig: introduce spanconfig.Reconciler

server,kvaccessor: record span configs during tenant creation/gc

cockroach-teamcity commented Oct 26, 2021

nvanbenschoten left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

arulajmani commented Dec 13, 2021

irfansharif left a comment • edited Loading

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif commented Dec 15, 2021

nvanbenschoten left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

irfansharif commented Dec 15, 2021

craig bot commented Dec 15, 2021

irfansharif commented Oct 26, 2021 •

edited

Loading

irfansharif left a comment •

edited

Loading