Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfcs: introduce rfc for multi-tenant zone configs #66348

Merged
merged 1 commit into from
Dec 6, 2021

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Jun 11, 2021

Link to the RFC text.

Zone configs dictate data placement, replication factor, and GC behavior; they power CRDB's multi-region abstractions. They're disabled for secondary tenants due to scalability bottlenecks in how they're currently stored and disseminated. They also prevent writes before DDL in the same transaction, implement inheritance and key-mapping by coupling KV and SQL in an undesirable manner (making for code that's difficult to read and write), and don't naturally extend to a multi-tenant CRDB.

This RFC proposes a re-work of the zone configs infrastructure to enable its use for secondary tenants, and in doing so addresses the problems above. We introduce a distinction between SQL zone configs (attached to SQL objects, living in the tenant keyspace) and KV zone configs (applied to an arbitrary keyspan, living in the host tenant), re-implement inheritance by introducing unique IDs for each zone configuration object, and notify each node of updates through rangefeeds.

Release note: None

@irfansharif irfansharif requested a review from a team as a code owner June 11, 2021 05:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif force-pushed the 210610.tenant-zcfgs branch from 79750b2 to 2290bc9 Compare June 11, 2021 05:25
@irfansharif irfansharif requested review from nvanbenschoten, ajwerner and a team June 11, 2021 05:26
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick first pass. I feel good about where this is headed.

I find the re-use of the term "zone config" in "SQL zone config" and "KV zone config" forces the document to make lots of clarification. How would you feel about introducing a new term that is easier to disambiguate. What if we called the new KV-level config a "span config" (scfgs), as you do in the API? I have a feeling it would increase readability both here and in the code.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)


docs/RFCS/20210610_tenant_zone_configs.md, line 80 at r1 (raw file):

   The determination of split points and zone configs is the last remaining
   use of gossip for the `SystemConfigSpan` (we used to depend on it for table
   leases, but that's [no longer the

FYI also cluster settings #58362


docs/RFCS/20210610_tenant_zone_configs.md, line 95 at r1 (raw file):

   apply over a given key range.

With multi-tenancy, we have a few more:

The fact that replication reports, the mechanism by which we understand whether zone configs are fulfilled etc could feature here.


docs/RFCS/20210610_tenant_zone_configs.md, line 117 at r1 (raw file):

     split points implied by a tenant's zcfgs.
6. The optimizer uses zcfgs for [locality-aware
   planning](https://github.com/cockroachdb/cockroach/blob/692fa83ce377c86cf1b6f865a7583a383c458ce2/pkg/sql/opt_catalog.go#L455-L466).

nit: consider using "reference style links" which allows you to put long URLs at the bottom to not break the textual flow. https://www.markdownguide.org/basic-syntax/#reference-style-links


docs/RFCS/20210610_tenant_zone_configs.md, line 145 at r1 (raw file):

When KV decides to promote a given SQL zcfg to a KV one

I find this language confusing. Isn't it SQL that promotes its zone configs to a KV one via the reconciliation process?


docs/RFCS/20210610_tenant_zone_configs.md, line 282 at r1 (raw file):

about it.

## Reconciliation between SQL and KV zone configs

I think this section could use some more tightening in terms of what actually happens upon rangefeed event. I think I'd break this down into various levels of sophistication that an implementation could take on.


docs/RFCS/20210610_tenant_zone_configs.md, line 617 at r1 (raw file):

- For sequences, we might want them to be in the same ranges as the tables
  they're most frequently used in tandem with. Now that we're not
  unconditionally splitting on tenant boundaries, we could be smarter about

s/tenant/table/?


docs/RFCS/20210610_tenant_zone_configs.md, line 619 at r1 (raw file):

  unconditionally splitting on tenant boundaries, we could be smarter about
  this.
  - For the host tenant we also unconditionally split around the sequence

One reason I don't think the sequences matter as much is that they are non-transactional; there's no missed 1PC here.

@irfansharif irfansharif force-pushed the 210610.tenant-zcfgs branch from 2290bc9 to 4678925 Compare June 11, 2021 22:09
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we called the new KV-level config a "span config" (scfgs), as you do in the API?

Done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, and @nvanbenschoten)


docs/RFCS/20210610_tenant_zone_configs.md, line 117 at r1 (raw file):

Previously, ajwerner wrote…

nit: consider using "reference style links" which allows you to put long URLs at the bottom to not break the textual flow. https://www.markdownguide.org/basic-syntax/#reference-style-links

Nice, TIL. Done.


docs/RFCS/20210610_tenant_zone_configs.md, line 145 at r1 (raw file):

Previously, ajwerner wrote…
When KV decides to promote a given SQL zcfg to a KV one

I find this language confusing. Isn't it SQL that promotes its zone configs to a KV one via the reconciliation process?

Done.

@irfansharif irfansharif requested a review from dt June 11, 2021 22:12
@irfansharif
Copy link
Contributor Author

@dt, it'd be useful to have you sanity check the "backup/restore considerations section", talking about how zcfgs would be backed-up/restored (tl;dr -- maintaining the same semantics today around db/table restores vs. full cluster restores).

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @irfansharif)


docs/RFCS/20210610_tenant_zone_configs.md, line 95 at r1 (raw file):

Previously, ajwerner wrote…

The fact that replication reports, the mechanism by which we understand whether zone configs are fulfilled etc could feature here.

This does raise an interesting question about what source of truth the replication reports should consult in this new world. The decomposition here creates room for kv to be in conformance with a kv-zcfg but for the kv-zcfg to be lagging behind the corresponding sql-zcfg. Is this considered "conformance"? Probably not. But then we either need conformance reports to be aware of sql-zcfgs or for "conformance" to be redefined as a conjunction between kv-cfgs being in conformance and coherence between the kv-zcfgs and sql-zcfgs. How does this impact the UX of conformance reports?


docs/RFCS/20210610_tenant_zone_configs.md, line 145 at r1 (raw file):

Previously, ajwerner wrote…
When KV decides to promote a given SQL zcfg to a KV one

I find this language confusing. Isn't it SQL that promotes its zone configs to a KV one via the reconciliation process?

Agreed. I think this should say "When KV is told to ..."


docs/RFCS/20210610_tenant_zone_configs.md, line 152 at r1 (raw file):

etc).

TODO: Create a diagram for this overview?

This would be very helpful


docs/RFCS/20210610_tenant_zone_configs.md, line 191 at r1 (raw file):

is later considered by KV

This is hitting the same confusion as above. We should re-word to make the initiator of the action more clear.


docs/RFCS/20210610_tenant_zone_configs.md, line 230 at r1 (raw file):

sub-zone in the parent table's zone config. Storing sql-zcfgs in descriptors
allows us to get rid of the concept of subzones, an index/partition's zcfg can
be stored in its own descriptor (this is not in the critical path for this RFC).

IndexDescriptors are stored within TableDescriptors. Are you suggesting that we split them out and give indexes unique IDs? Or just that the IndexDescriptor proto will be given its own SQLZoneConfig field and the reconciliation loop will know how to walk through all possible sql-zcfg locations in a given TableDescriptors?


docs/RFCS/20210610_tenant_zone_configs.md, line 235 at r1 (raw file):

slightly larger

Only if a zone is actually set on a given descriptor, right? Otherwise the encoding will be identical and we'll pay a mere 8 bytes for the pointer?


docs/RFCS/20210610_tenant_zone_configs.md, line 245 at r1 (raw file):

[pseudo](https://github.com/cockroachdb/cockroach/blob/ce1c68397db8ebc222ed201fef1f9ca92485ddcd/pkg/keys/constants.go#L379-L385)
descriptor IDs allocated for them. We'll ensure that all instances where we
deal with the set of descriptors work with these placeholders. This change will

Out of curiosity, do we expect the logic necessary to deal with these pseudo-descriptors to be more or less invasive than the logic to deal with the old psuedo-zones?


docs/RFCS/20210610_tenant_zone_configs.md, line 272 at r1 (raw file):

convenient way to determine what zcfgs apply to a given key/key-range, and also
helps us conveniently answer what all the split points are: the set of all
`start_key`s.

Are we planning on creating a kv-zcfg for all tables in the system tenant? If not, will we no longer mandate splits between their tables?


docs/RFCS/20210610_tenant_zone_configs.md, line 317 at r1 (raw file):

The reconciliation loop will only update mismatched entries in kv-zcfgs through
a streaming RPC, with KV maintaining a txn on the other end. The streaming RPC

KV will hold open a transaction while waiting on input from SQL? That may not be the best idea, as it means that SQL could hold resources in KV indefinitely. It's probably better to replace a streaming RPC with a batched RPC, where the lifetime of the corresponding txn on the KV side never outlives a single batch.


docs/RFCS/20210610_tenant_zone_configs.md, line 351 at r1 (raw file):

to what was provided. If the pod finds that the counter was incremented from
underneath it, it must be the case that another pod acquired the lease from
underneath it (we'll abandon the reconciliation attempt).

Want to add this counter to the protos above?


docs/RFCS/20210610_tenant_zone_configs.md, line 383 at r1 (raw file):

`system.span_configurations` directly. The store could also periodically
persist this cache (along with the resolved timestamp); on restarts it would
then be able to continue where it left off.

This section could use a little more detail about the consistency model we expect to expose to clients of the SpanConfig interface. Will updates to spans always be seen in order? Are we using rangefeed checkpoints to guarantee this? Do we no longer need to provide cross-key ordering guarantees because inheritance is computed above this level?


docs/RFCS/20210610_tenant_zone_configs.md, line 404 at r1 (raw file):

to have applied to them the "stale" zcfgs of the parent database. If the
"global" aspect of this fallback kv-zcfg proves to be undesirable, we can later
make this per-tenant.

How does this relate to a tenant's default zone? Does this zone "fill in the gaps" between all concrete kv-zcfgs? Does it do so implicitly or explicitly?


docs/RFCS/20210610_tenant_zone_configs.md, line 430 at r1 (raw file):

best-effort convergence or similar). If we want to allow more fine-grained
control over specific tenant limits, we can consult limits set in another table
writable only by the host tenant.

Have we explored what other limits we will want in the v1 of this work? I'd imagine there are some fairly easy ways for tenants to abuse zone configurations if we're not careful. For instance, setting a replication factor of 10k or dropping the gc.ttl to 1s. These are probably easier limits to enforce, as they are per-zone and don't require coordination to enforce.


docs/RFCS/20210610_tenant_zone_configs.md, line 442 at r1 (raw file):

to update the schema/sql-zcfgs accordingly?

We'll note that a form of this problem already exists -- it's possible today to

Relating being in eventual conformance due to an async process with having a zone be indefinitely rejected by an admin-defined limit seems wrong to me. I think we'll need to find a way to synchronously surface hard rejections of zone config updates to the user when they run a zone config update. The split limit is one example, but I think something like a replication factor limit is more interesting. If we say "tenants can't set a replication factor of more than 5", should we allow a user to run ALTER TABLE t CONFIGURE ZONE num_replicas = 17? If so, are we going to ask them to check an internal table to find out that we're ignoring them?

We discussed the idea of kicking off an eager reconciliation pass when a sql-zcfg is created/updated/deleted and after its transaction has committed to get access to KV errors. Did that go anywhere?

Either way, this discussion also introduces the question of whether anyone will roll back faulty sql-zcfg, or whether their SQL pod will just spin indefinitely trying in vain to update the corresponding kv-zcfg and repeatedly being rejected.

This also begs the question - if I have one faulty zone config, can I make any other valid changes to other tables, or am I stuck?


docs/RFCS/20210610_tenant_zone_configs.md, line 517 at r1 (raw file):

behavior.

If zcfgs are being moved into descriptors, we'll want to ensure that we clear

Is this what we want? I can see an argument that the decoupling of sql-zcfgs and kv-zcfgs allows us to accomplish what we've wanted all along, which is to retain table-level and database-level zone configurations across backups. We would hold on to them in the backup descriptor, and then run the reconciliation loop upon restore to install them. Perhaps @dt could add some color here?


docs/RFCS/20210610_tenant_zone_configs.md, line 546 at r1 (raw file):

  RPC.
- On the KV side we'll:
  - Introduce `system.span_configurations` on the host tenant

@ajwerner do you think this table needs a stable ID?


docs/RFCS/20210610_tenant_zone_configs.md, line 546 at r1 (raw file):

  RPC.
- On the KV side we'll:
  - Introduce `system.span_configurations` on the host tenant

We'll need to ensure that rangefeeds are enabled on this table, even if the kv.rangefeed.enabled cluster setting is disabled. We do this in other places, IIRC.


docs/RFCS/20210610_tenant_zone_configs.md, line 557 at r1 (raw file):

      (optionally) install configs for subzones into partition/index descs as
      well
  -   Implement generating kv-zcfgs using existing sql-zcfgs representation

Do you see a world where these are not the same proto representation? Is there any need for them to store disjoint sets of information?


docs/RFCS/20210610_tenant_zone_configs.md, line 610 at r1 (raw file):

  pattern we'd want to discourage.

## Future work

How close does this work get us to being able to remove all of the special logic around the system config span?


docs/RFCS/20210610_tenant_zone_configs.md, line 617 at r1 (raw file):

Previously, ajwerner wrote…

s/tenant/table/?

+1

But this also brings up an interesting point. We do still want to split on tenant boundaries. Is this enforced by giving each tenant their own default zone config in the system.span_configurations table? And that's ok, even though the defaults will all be the same?


docs/RFCS/20210610_tenant_zone_configs.md, line 650 at r1 (raw file):

## Unresolved questions

- Should we disallow secondary setting zone configs directly? Leave it only

"secondary tenants"

This is a good question. I'm curious to hear @ajstorm's thoughts on the matter.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @irfansharif, and @nvanbenschoten)


docs/RFCS/20210610_tenant_zone_configs.md, line 95 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This does raise an interesting question about what source of truth the replication reports should consult in this new world. The decomposition here creates room for kv to be in conformance with a kv-zcfg but for the kv-zcfg to be lagging behind the corresponding sql-zcfg. Is this considered "conformance"? Probably not. But then we either need conformance reports to be aware of sql-zcfgs or for "conformance" to be redefined as a conjunction between kv-cfgs being in conformance and coherence between the kv-zcfgs and sql-zcfgs. How does this impact the UX of conformance reports?

My current thinking is that for the user (and, to some extent, to not break backwards compatibility) that they should be in terms of SQL zone configs (as they are today) and then we should provide sufficient visibility into understanding what might be going wrong telling KV about the configs.


docs/RFCS/20210610_tenant_zone_configs.md, line 272 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Are we planning on creating a kv-zcfg for all tables in the system tenant? If not, will we no longer mandate splits between their tables?

I think that'd be the idea. Or at least, it's one or the other and I'd hope we'd start with the more conservative approach.


docs/RFCS/20210610_tenant_zone_configs.md, line 286 at r1 (raw file):

In multi-tenant CRDB, each active tenant has one or more SQL pods talking to
the shared KV cluster. Every tenant will have a single "leaseholder" SQL pod
responsible for asynchronously<sup>[3](#f3)</sup> reconciling the tenant's

What if we made the protocol semi-synchronous. It seems reasonable for operations which change zone configs to be able to wait for the relevant reconciliation event and to learn about any possible failures to reconcile? This could be a reasonable use case for an "outbox" of our own. Here's a straw-man sketch of what I have in mind.

CREATE TABLE system.zone_config_checkpoint (
    k BOOL AS (true) STORED PRIMARY KEY, -- cheeky way to say it's a single row
    lease BYTES,
    checkpoint DECIMAL,
)

CREATE TABLE system.zone_config_events (
    -- have not decided the right primary key here.
    -- could do it as a singleton, probably better not to.
    descriptors INT[]
)

CREATE TABLE system.zone_config_reconciliation_errors (
    timestamp DECIMAL,
    id STRING -- some sort of ID that will help the user
    error STRING -- some error
)

The leaseholder registers itself in the first table and then update it with checkpoints when something happens or, periodically, based on checkpoints. The writers of new configs will write the relevant descriptors to the "events" table, which is an "outbox" as people are using that word these days. Then, when reconciliation happens, if any errors occur, they can be written to the errors field. We could imagine having all the nodes watching the checkpoint and errors table and learning about any errors. A nice thing here is that we can bound what needs to get reconciled and we know how caught up we are. Also, we can make sure only the leaseholder moves the checkpoint.

There are other similar approaches to get at the same thing.


docs/RFCS/20210610_tenant_zone_configs.md, line 317 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

KV will hold open a transaction while waiting on input from SQL? That may not be the best idea, as it means that SQL could hold resources in KV indefinitely. It's probably better to replace a streaming RPC with a batched RPC, where the lifetime of the corresponding txn on the KV side never outlives a single batch.

I like that approach. Importantly; I don't think we should ever, in this scheme, need to send a large batch to ensure we have safe updates of the config. It'd be good to lay out what are the safety properties here. In the past we've said that we would like it to be the case that keys in KV always have a config that at some point did indeed apply to them.


docs/RFCS/20210610_tenant_zone_configs.md, line 351 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Want to add this counter to the protos above?

This counter smells a bit to me. I think it needs some further justification. On the whole, this section feels like the area we should most focus on sharpening. Let's understand tradeoffs between fully async and semi-sychronous reconciliaition. Let's understand the difference between partial and full reconciliation. Do we need to run full reconciliation in normal operation? Is it just a sanity check?


docs/RFCS/20210610_tenant_zone_configs.md, line 442 at r1 (raw file):

whether their SQL pod will just spin indefinitely trying in vain to update the corresponding kv-zcfg and repeatedly being rejected.

This better not happen. We should not be trying more than once. We should think through the various ways the reconciliation loops responds to KV feedback. Another question is what can operators do to adjust span configs while tenants are not changing anything?

We need to nail down reconciliation as well as the things that can happen when SQL tells KV about changes.


docs/RFCS/20210610_tenant_zone_configs.md, line 546 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

@ajwerner do you think this table needs a stable ID?

I don't know about need, but not having a stable ID is likely to mean needing to instead plumb in some container and do a lookup at startup or something. I'd prefer if we did.


docs/RFCS/20210610_tenant_zone_configs.md, line 610 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How close does this work get us to being able to remove all of the special logic around the system config span?

I think this is it. Do you know of any other uses left?


docs/RFCS/20210610_tenant_zone_configs.md, line 197 at r2 (raw file):

## Storing SQL zone configs

My sense is this migration and this part of the RFC stands alone and can be a separate RFC altogether. Proposing this change at the same time complicates this document a lot. I think this change is also good, but, more or less, seems independent. That RFC can say that this change is only possible because we will no longer rely on system.zones.

Copy link
Collaborator

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @irfansharif, and @nvanbenschoten)


docs/RFCS/20210610_tenant_zone_configs.md, line 95 at r1 (raw file):

Previously, ajwerner wrote…

My current thinking is that for the user (and, to some extent, to not break backwards compatibility) that they should be in terms of SQL zone configs (as they are today) and then we should provide sufficient visibility into understanding what might be going wrong telling KV about the configs.

Is the plan that we'll address this UX issue in this RFC (and the first drop of this functionality in 21.2)? If so, we might need to add a new section to this RFC (unless I've missed something).


docs/RFCS/20210610_tenant_zone_configs.md, line 442 at r1 (raw file):

Previously, ajwerner wrote…

whether their SQL pod will just spin indefinitely trying in vain to update the corresponding kv-zcfg and repeatedly being rejected.

This better not happen. We should not be trying more than once. We should think through the various ways the reconciliation loops responds to KV feedback. Another question is what can operators do to adjust span configs while tenants are not changing anything?

We need to nail down reconciliation as well as the things that can happen when SQL tells KV about changes.

+1 for surfacing errors synchronously.


docs/RFCS/20210610_tenant_zone_configs.md, line 517 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this what we want? I can see an argument that the decoupling of sql-zcfgs and kv-zcfgs allows us to accomplish what we've wanted all along, which is to retain table-level and database-level zone configurations across backups. We would hold on to them in the backup descriptor, and then run the reconciliation loop upon restore to install them. Perhaps @dt could add some color here?

I agree. We'd likely need to do some light validation to ensure that the zcfg can be applied to the target cluster, but if that validation passes, I'd imagine we'd want to keep the zone configs in the backup image.


docs/RFCS/20210610_tenant_zone_configs.md, line 650 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"secondary tenants"

This is a good question. I'm curious to hear @ajstorm's thoughts on the matter.

That would certainly be my preference. Do we think that this proposal gets us all the way there though? Do we anticipate that tenants may want/need to set gc.ttlseconds? Range limits? Replication factor? I'd imagine in the near term we're going to have to leave open the ability to set zone configs.


docs/RFCS/20210610_tenant_zone_configs.md, line 159 at r2 (raw file):

With multi-tenant zone configs, it's instructive to start thinking about "SQL
zone configs" (just zcfgs) and "KV span configs" (abbrev. scfgs) as two

Nit: might just be me, but I worry that people might get confused with scfgs as the s could be interpreted as referring to SQL. Does it make sense to be slightly more verbose and go with zonecfgs and spancfgs?


docs/RFCS/20210610_tenant_zone_configs.md, line 183 at r2 (raw file):

it. Given its coupling to SQL descriptors, KV ends up also (unwillingly)
adopting the same inheritance complexity found in SQL. This needn't be the
case. Introducing a level of indirection will help us prevent tenants from

Nit: It may not be immediately obvious to the reader why we'd want to prevent zone config changes from being immediately applied. You might want to add a bit of additional context to explain.


docs/RFCS/20210610_tenant_zone_configs.md, line 197 at r2 (raw file):

Previously, ajwerner wrote…

My sense is this migration and this part of the RFC stands alone and can be a separate RFC altogether. Proposing this change at the same time complicates this document a lot. I think this change is also good, but, more or less, seems independent. That RFC can say that this change is only possible because we will no longer rely on system.zones.

+1. Another option is to just move this paragraph further down in the RFC. It hurts readability to open the "Storing..." section with a description of the optimizer's needs as we'd need to store the zcfg in the descriptor regardless.


docs/RFCS/20210610_tenant_zone_configs.md, line 239 at r2 (raw file):

For distinguished zcfgs (`RANGE DEFAULT`, `RANGE LIVENESS`, ...), now that
we're storing them in descriptors directly, we'll want to synthesize special
descriptors also stored in `system.descriptor`. Conveniently, they already have

Do we really need to synthesize these? Can we not have the defaults encapsulated in the code? Or are we worried about the need to migrate defaults at some point down the road?


docs/RFCS/20210610_tenant_zone_configs.md, line 294 at r2 (raw file):

apply to each one. To do so, we'll recurse through the set of SQL descriptors
(within `system.descriptors`), capture their keyspans, and materialize a scfg
for each one by applying inherited attributes following parent links on the

Am I reading this correctly in that we're going to scan all descriptors for a given tenant whenever there's any change? Or will the rangefeed only point to descriptors which have been updated?


docs/RFCS/20210610_tenant_zone_configs.md, line 345 at r2 (raw file):

This SQL pod driven reconciliation loop makes it possible that zcfgs for
inactive tenants are not acted upon. We think that's fine, especially given

When the tenant is re-activated, will the reconciliation occur? If not, the zcfg could never be applied?


docs/RFCS/20210610_tenant_zone_configs.md, line 367 at r2 (raw file):

Consider a naive implementation: for every `<span, scfg>` update, we'll insert
into a sorted tree keyed on `span.start_key`. We could improve the memory
footprint by de-duping away identical scfgs and referring to them by hash. Each

Nit: Will you have to worry about hash collisions here?


docs/RFCS/20210610_tenant_zone_configs.md, line 416 at r2 (raw file):

transactionally consult this table for enforcement and update this table if
accepted. We'll start off simple, with a fixed maximum allowed number of
tenant-defined splits. If the proposed set of scfgs implies a number of splits

Maximum over some period of time? Or maximum for all time? If the user hits this maximum, what would be the recommended mechanism to free up some splits for future use? Drop tables/partitions? Would this counter then get decremented?


docs/RFCS/20210610_tenant_zone_configs.md, line 562 at r2 (raw file):

## Drawbacks

There are a lot of moving parts, though most of it feels necessary.

Was this added as a stub? You've listed drawbacks throughout, which we might want to reiterate down here.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, @dt, @irfansharif, and @nvanbenschoten)


docs/RFCS/20210610_tenant_zone_configs.md, line 230 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

IndexDescriptors are stored within TableDescriptors. Are you suggesting that we split them out and give indexes unique IDs? Or just that the IndexDescriptor proto will be given its own SQLZoneConfig field and the reconciliation loop will know how to walk through all possible sql-zcfg locations in a given TableDescriptors?

The latter -- we would store zcfgs on Index/Partition descriptors. This could reduce some of the complexity around subzone spans that exists today. This isn't in the critical path however.


docs/RFCS/20210610_tenant_zone_configs.md, line 235 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
slightly larger

Only if a zone is actually set on a given descriptor, right? Otherwise the encoding will be identical and we'll pay a mere 8 bytes for the pointer?

Yup!


docs/RFCS/20210610_tenant_zone_configs.md, line 245 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Out of curiosity, do we expect the logic necessary to deal with these pseudo-descriptors to be more or less invasive than the logic to deal with the old psuedo-zones?

I think it'll end up being less invasive. Today, KV has to check if a key matches one of these special ranges and if it does, we translate it to a pseudo zcfg ID assigned to that range. We won't need to do this anymore once we introduce the system.spans_configuration table.

We'll still have to seed system.descriptors similar to how we seed system.zones today.


docs/RFCS/20210610_tenant_zone_configs.md, line 239 at r2 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Do we really need to synthesize these? Can we not have the defaults encapsulated in the code? Or are we worried about the need to migrate defaults at some point down the road?

I think the system tenant is allowed to change zcfgs on these ranges so we'd need to store them on disk. We have code that generates defaults for these today which we'll re-use to seed these pseudo-descriptors.

@otan
Copy link
Contributor

otan commented Jun 15, 2021


docs/RFCS/20210610_tenant_zone_configs.md, line 442 at r2 (raw file):

set of zcfgs have been fully applied, or to wait until a zcfg has.

In the short term, to surface possible errors, we'll make use of in-memory

(small comment) at the very least can this be written out to logs so e.g. an oncall can look at the issues historically?

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following recent discussions, I've updated RFC to incorporate a few significant changes. Some of the comments below no longer apply, and some responses too are stale (but included for posterity). The changes are:

  • Moving the whole "stash zcfgs in descriptors" section to future work. It's not necessary for the MVP, and the original motivation for it (the optimizer wants a cache of zcfgs) can be more easily served by a dedicated cache, instead of the larger undertaking that stashing zcfgs in descriptors would entail. This also punts the backup/restore discussions around whether or not we should capture zcfgs in backup images to future work.
  • Doing some conservative, best-effort limit checking before committing zcfgs/schema changes. Instead of surfacing reconciliation errors, it'd be better if we could prevent committing the bound-to-rejected changes, and I think we can. This drastically reduces the likelihood that we'll get into the scfgs-rejected-by-kv state being discussed below, which also de-emphasizes the need to synchronously return the reconciliation errors as part of the user txn committing zcfgs/schema changes.
  • "Jobs"-ify the execution of the reconciliation loop, using the existing infrastructure for checkpointing, observability, and leasing.
  • Clarify more of the reconciliation protocol, and eliminate the need to hold O(descriptors) objects in memory. The reconciliation should be "fully paginated" now, PTAL.

Dismissed @ajwerner from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, @dt, @irfansharif, @nvanbenschoten, and @otan)


docs/RFCS/20210610_tenant_zone_configs.md, line 95 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Is the plan that we'll address this UX issue in this RFC (and the first drop of this functionality in 21.2)? If so, we might need to add a new section to this RFC (unless I've missed something).

I've re-worked some ideas around how likely it is we're going to run into rejected-by-kv errors, which should lower the urgency of fleshing out full conformance observability in the MVP. It's still very important, and I've added some text in the Introspection section below highlighting the need to consider both kinds of conformance ("scfgs are conformed to" and "zcfgs have been successfully reconciled"). I'm curious to hear your thoughts on how much of it is work we can pick up post 21.2, and possibly backport.


docs/RFCS/20210610_tenant_zone_configs.md, line 145 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Agreed. I think this should say "When KV is told to ..."

Done.


docs/RFCS/20210610_tenant_zone_configs.md, line 191 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

is later considered by KV

This is hitting the same confusion as above. We should re-word to make the initiator of the action more clear.

Done.


docs/RFCS/20210610_tenant_zone_configs.md, line 272 at r1 (raw file):

Previously, ajwerner wrote…

I think that'd be the idea. Or at least, it's one or the other and I'd hope we'd start with the more conservative approach.

Yup, we'd be creating a scfg (previously kv-zcfg) for all tables in the system tenant. As for whether or not we split on table boundaries for the host tenant, see text below:

We'll post-process this list of non-overlapping spans-to-scfgs to coalesce
adjacent spans with the same effective scfg. This will reduce the number of
splits induced in KV while still conforming to the stated zcfgs. It's worth
noting that this is not how things work today in the host tenant -- we
unconditionally split on all table/index/partition boundaries even if both
sides of the split have the same materialized zcfg. We could skip this
post-processing step for the host tenant to preserve existing behavior, though
perhaps it's a [desirable improvement][66063]. 

docs/RFCS/20210610_tenant_zone_configs.md, line 282 at r1 (raw file):

Previously, ajwerner wrote…

I think this section could use some more tightening in terms of what actually happens upon rangefeed event. I think I'd break this down into various levels of sophistication that an implementation could take on.

I've given this a first pass, being a bit more specific with the protocol itself. Are there details you think are still missing? I feel like I've made the organization here a bit more difficult to follow by breaking it up so finely, so I'd happily take suggestions for how to organize this better.


docs/RFCS/20210610_tenant_zone_configs.md, line 286 at r1 (raw file):

Previously, ajwerner wrote…

What if we made the protocol semi-synchronous. It seems reasonable for operations which change zone configs to be able to wait for the relevant reconciliation event and to learn about any possible failures to reconcile? This could be a reasonable use case for an "outbox" of our own. Here's a straw-man sketch of what I have in mind.

CREATE TABLE system.zone_config_checkpoint (
    k BOOL AS (true) STORED PRIMARY KEY, -- cheeky way to say it's a single row
    lease BYTES,
    checkpoint DECIMAL,
)

CREATE TABLE system.zone_config_events (
    -- have not decided the right primary key here.
    -- could do it as a singleton, probably better not to.
    descriptors INT[]
)

CREATE TABLE system.zone_config_reconciliation_errors (
    timestamp DECIMAL,
    id STRING -- some sort of ID that will help the user
    error STRING -- some error
)

The leaseholder registers itself in the first table and then update it with checkpoints when something happens or, periodically, based on checkpoints. The writers of new configs will write the relevant descriptors to the "events" table, which is an "outbox" as people are using that word these days. Then, when reconciliation happens, if any errors occur, they can be written to the errors field. We could imagine having all the nodes watching the checkpoint and errors table and learning about any errors. A nice thing here is that we can bound what needs to get reconciled and we know how caught up we are. Also, we can make sure only the leaseholder moves the checkpoint.

There are other similar approaches to get at the same thing.

I've reworked how the reconciliation loop will be run, using just jobs instead. I've also put in best-effort limit checking we could do in the sql pods to reduce the likelihood of getting into these rejected-by-kv perma-error states, which reduces the importance of surfacing these errors synchronously to the client committing the zcfgs/DDL statements. Thing was, even if we did surface these errors, it's a bit difficult to actually act on them or to figure out what needs to change. Reducing the likelihood of these errors feels much cleaner.

As for observability into these rejected-by-kv errors, I'm assuming we can use existing jobs infrastructure to surface in-progress state. I was light on details because I'm a bit unfamiliar with what that would look like, but I assume it's possible.


docs/RFCS/20210610_tenant_zone_configs.md, line 317 at r1 (raw file):

Previously, ajwerner wrote…

I like that approach. Importantly; I don't think we should ever, in this scheme, need to send a large batch to ensure we have safe updates of the config. It'd be good to lay out what are the safety properties here. In the past we've said that we would like it to be the case that keys in KV always have a config that at some point did indeed apply to them.

I've re-worked this bit to no longer necessitate issuing all the diffs all at once, and loosening up the consistency guarantees accordingly. It's now possible for a zone config to be partially accepted, but with the best-effort error checking we're now doing, we expect it to be extremely unlikely to ever get into that state.


docs/RFCS/20210610_tenant_zone_configs.md, line 351 at r1 (raw file):

Previously, ajwerner wrote…

This counter smells a bit to me. I think it needs some further justification. On the whole, this section feels like the area we should most focus on sharpening. Let's understand tradeoffs between fully async and semi-sychronous reconciliaition. Let's understand the difference between partial and full reconciliation. Do we need to run full reconciliation in normal operation? Is it just a sanity check?

Instead of using the counter for mutual exclusion, we're now using transaction commit deadlines (hadn't known of this at the time). Wrote a bit more about partial vs. full sync, fully async vs. semi-sync reconciliation, PTAL.


docs/RFCS/20210610_tenant_zone_configs.md, line 383 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This section could use a little more detail about the consistency model we expect to expose to clients of the SpanConfig interface. Will updates to spans always be seen in order? Are we using rangefeed checkpoints to guarantee this? Do we no longer need to provide cross-key ordering guarantees because inheritance is computed above this level?

Done. And yup, we'll use rangefeed checkpoints to ensure that what's seen is ordered. It'll also be the case that a zcfg that gets denormalized to multiple scfg updates will all be seen all at once.


Added a consistency model section.


docs/RFCS/20210610_tenant_zone_configs.md, line 404 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How does this relate to a tenant's default zone? Does this zone "fill in the gaps" between all concrete kv-zcfgs? Does it do so implicitly or explicitly?

I was thinking we could key special-case the tenant's RANGE DEFAULT zcfg, writing it out to span_configurations explicitly even though it's not a concrete key (we could write it out to [start_key, end_key) == [/tenant/id, /tenant/id)). We'd then use just that as a fallback. Added a few more words. I suspect this will be easy enough to do that we'll just do it as part of the MVP.


docs/RFCS/20210610_tenant_zone_configs.md, line 430 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Have we explored what other limits we will want in the v1 of this work? I'd imagine there are some fairly easy ways for tenants to abuse zone configurations if we're not careful. For instance, setting a replication factor of 10k or dropping the gc.ttl to 1s. These are probably easier limits to enforce, as they are per-zone and don't require coordination to enforce.

Added some text. This too I assume we'll have to do on the KV side, given our threat model. Ignoring for a second the questions around how these hard errors due to zcfgs can be surfaced, we can similarly validate in SQL to generate informative errors.


docs/RFCS/20210610_tenant_zone_configs.md, line 442 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

+1 for surfacing errors synchronously.

See top-level discussion around the likelihood/handling scfgs-rejected-by-KV errors.


docs/RFCS/20210610_tenant_zone_configs.md, line 517 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

I agree. We'd likely need to do some light validation to ensure that the zcfg can be applied to the target cluster, but if that validation passes, I'd imagine we'd want to keep the zone configs in the backup image.

Made a note of this. Do you reckon this is necessary for the 21.2 MVP however? I'd try and gently push back if so, there's already a fair number of moving parts to power through until stability. How do backup/restores with MR schemas work today? Do we re-generate zone configs post-restore? Given the zcfgs, going forward, will be captured as part of the descriptors, we could introduce the "re-apply zcfgs on restored image" behaviour in 22.1. What do you think?


Actually, having removed the "stash configs in descriptors" portion of this RFC from the critical path, this all gets relegated to future work that's now possible.


docs/RFCS/20210610_tenant_zone_configs.md, line 546 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We'll need to ensure that rangefeeds are enabled on this table, even if the kv.rangefeed.enabled cluster setting is disabled. We do this in other places, IIRC.

Added to #66390.


docs/RFCS/20210610_tenant_zone_configs.md, line 557 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you see a world where these are not the same proto representation? Is there any need for them to store disjoint sets of information?

This line was actually talking about the representation of zcfgs where zcfgs were stored in system.zones, rather than in the descriptors themselves.

But that said, I do think we'd benefit from introduce a different proto representation.
Given span configs have no notions of inheritance, we could drop all the inherited_{constraints,lease_preferences}, null_voter_constraints_is_empty, subzones and subzone,_spans fields.

Separately, roachpb can't depend on zonepb due to circular dependency concerns, and I thought it'd be clearer+cleaner to have distinct types in code/rpc definitions. But also happy to use the same proto+moving ZoneConfig out of zonepb if you feel strongly enough.


Typed out the roachpb.SpanConfig proto definition.


docs/RFCS/20210610_tenant_zone_configs.md, line 610 at r1 (raw file):

Previously, ajwerner wrote…

I think this is it. Do you know of any other uses left?

yup, we should be in clear after the RFC. Listed now as future work.


docs/RFCS/20210610_tenant_zone_configs.md, line 617 at r1 (raw file):

Is this enforced by giving each tenant their own default zone config in the system.span_configurations table? And that's ok, even though the defaults will all be the same?

Yup. This text is relevant:

Adjacent spans in the table will either have diverging
scfgs, or will belong to different tenants. This schema gives us a convenient
way to determine what attributes apply to a given key/key-range, and also helps
us answer what the set of split points are: all `start_key`s.

Also see thread above around possibly capturing each tenant's RANGE DEFAULT explicitly, that could also help here. Didn't state this explicitly, but I imagined we'd install a , when each tenant is created, we'd first


docs/RFCS/20210610_tenant_zone_configs.md, line 650 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

That would certainly be my preference. Do we think that this proposal gets us all the way there though? Do we anticipate that tenants may want/need to set gc.ttlseconds? Range limits? Replication factor? I'd imagine in the near term we're going to have to leave open the ability to set zone configs.

Given that nothing in the RFC fundamentally changes how zone configs are represented, and given our DDL statements don't capture the full expressivity of zone configs, I'm going to suggest we just allow setting direct zcfgs for now. Whatever future migration headache this creates, is a headache we already have to address in order to migrate on-prem clusters/the system tenant.


docs/RFCS/20210610_tenant_zone_configs.md, line 183 at r2 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Nit: It may not be immediately obvious to the reader why we'd want to prevent zone config changes from being immediately applied. You might want to add a bit of additional context to explain.

I re-worded here slightly to clarify that the KV actions listed below are expensive in terms of I/O and overhead, and something that we'd ideally cost at some point (not part of this RFC). Is there anything additional that you feel would be clarifying?


docs/RFCS/20210610_tenant_zone_configs.md, line 197 at r2 (raw file):

Previously, ajstorm (Adam Storm) wrote…

+1. Another option is to just move this paragraph further down in the RFC. It hurts readability to open the "Storing..." section with a description of the optimizer's needs as we'd need to store the zcfg in the descriptor regardless.

@ajwerner: I don't follow. Because of the optimizer's need to access zcfgs for descriptors, what this section is saying is that we'd need to either introduce a separate cache for zcfgs in the catalog layer, or we could stash in the descriptor and get this caching for free seeing as how the descriptors themselves are cached. I'm proposing we do the latter, since it's good to have in isolation, and also paves a path forward for the optimizer dependency.

Are you suggesting that for the MVP we work on this separate caching for zcfgs instead? The proposal here was to not do that, and put stashing zcfgs in descriptors in the critical path. Or are you just suggesting we simply re-order this section/pull it into a separate doc?


@ajstorm: Re-{titled,nested} the section.


Looks like from our pod sync you were proposing we'd just settle for the separate config cache and have secondary tenants start using their own system.zones. Updated the text to remove all mentions of this "store configs in descriptors" business, relegating all of that to future work. Happy to remove it from the RFC entirely if you think it'll help with readability.


docs/RFCS/20210610_tenant_zone_configs.md, line 239 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I think the system tenant is allowed to change zcfgs on these ranges so we'd need to store them on disk. We have code that generates defaults for these today which we'll re-use to seed these pseudo-descriptors.

Yup, these distinguished zcfgs aren't immutable. These distinguished zcfgs are already synthesized today in system.zones, what this text is saying is that if we're no longer storing zcfgs in system.zones, we'll have to synthesize descriptors to store distinguished zcfgs in system.descriptors.


Following recent discussions I've removed this whole section ("stashing zcfgs in descriptors"), relegating all of it for future work.


docs/RFCS/20210610_tenant_zone_configs.md, line 294 at r2 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Am I reading this correctly in that we're going to scan all descriptors for a given tenant whenever there's any change? Or will the rangefeed only point to descriptors which have been updated?

It's the latter, the rangefeed will only point us to the descriptors that have been updated and we'd only need to traverse up and down the tree for those descriptors. Text updated.


docs/RFCS/20210610_tenant_zone_configs.md, line 345 at r2 (raw file):

Previously, ajstorm (Adam Storm) wrote…

When the tenant is re-activated, will the reconciliation occur? If not, the zcfg could never be applied?

Yup.


docs/RFCS/20210610_tenant_zone_configs.md, line 367 at r2 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Nit: Will you have to worry about hash collisions here?

I didn't mean what I meant when I said referring to them by hash, just meant we could maintain a separate <uniqueID: scfg> mapping in-memory, and in the tree nodes hold onto only the uniqueIDs. Reworded.


docs/RFCS/20210610_tenant_zone_configs.md, line 416 at r2 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Maximum over some period of time? Or maximum for all time? If the user hits this maximum, what would be the recommended mechanism to free up some splits for future use? Drop tables/partitions? Would this counter then get decremented?

Maximum for all time. Dropping tables/indexes/partitions would help, yes, and so would removing explicit zcfgs from individual tables/indexes/partitions. And yes, the counter would get correctly decremented using this same reconciliation machinery.


docs/RFCS/20210610_tenant_zone_configs.md, line 442 at r2 (raw file):

Previously, otan (Oliver Tan) wrote…

(small comment) at the very least can this be written out to logs so e.g. an oncall can look at the issues historically?

We're using jobs now, so we'll just make use of the usual jobs logging.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool RFC (I still have some portions to go through, but the overall idea is very compelling)!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajstorm, @arulajmani, @awoods187, @dt, @irfansharif, @miretskiy, @nvanbenschoten, @otan, and @pbardea)


docs/RFCS/20210610_tenant_zone_configs.md, line 291 at r3 (raw file):

};

message GetSpanConfigsRequest {

What happens if there are a ton of configs (split points) within this span? How can I paginate through that?


docs/RFCS/20210610_tenant_zone_configs.md, line 296 at r3 (raw file):

message GetSpanConfigsResponse {
  repeated SpanConfigEntry span_configs = 1;

These are all the spans config that overalp with the given span? So the first scfg's start key could be before the one I asked for, and similar for the last end key?


docs/RFCS/20210610_tenant_zone_configs.md, line 299 at r3 (raw file):

};

message UpdateSpanConfigsRequest {

It's important to flesh out the details of such an API. There should be enough comments here to clarify the following questions:

  • if I try to upsert a config for a range that contains multiple existing configs, is that essentially removing some key splits?
  • if I try to delete a config for a span that starts or ends inside an existing config span, does that introduce a split?
  • I assume the upsert (and respectively delete) spans are non-overlapping.. do they also need to be sorted?
  • can delete spans overlap with upsert spans? (e.g. first delete all configs inside a tenant and recreate them)

docs/RFCS/20210610_tenant_zone_configs.md, line 357 at r3 (raw file):

In multi-tenant CRDB, each active tenant can have one or more SQL pods talking
to the shared KV cluster. Every tenant will have a single "leaseholder" pod

[nit] We should avoid reusing the term leaseholder since that already has a very clear use in our system. Could be SQL leader/coordinator/supervisor.


docs/RFCS/20210610_tenant_zone_configs.md, line 439 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Yup, exactly. Provided the txn commit time is less than the lease expiration time, future acquisition attempts of that particular token (would necessarily happen after expiration) would see that txn. This is similar to the txn commit deadline idea we're using for the reconciliation job's lease.

Related to limiting split points - to me split points seem most closely related to storage, in that a split should cost no more than having 500MB of data lying around (this amount would require a split anyway). So maybe a split should (from cost/limiting perspective) behave like using some fixed amount of storage (either 500MB or maybe the average range size 320MB), and have that be enforced as part of enforcing storage (which is currently done through periodic monitoring and corresponding adjustments to the available RUs, rather than hard limits).


docs/RFCS/20210610_tenant_zone_configs.md, line 530 at r3 (raw file):

It's possible for a KV server to request the span configs for a key where that
key is not yet declared in the server's known set of scfgs. The spans captured
`system.span_configurations` are only the "concrete" ones, for known

I think things would be cleaner if the span configs would not allow "holes". Instead of span/config pairs, we would have key/config pairs, where the config applies up to the next key. Operations can always be thought of as introducing or removing split points.

When setting up a tenant, we would introduce a scfg for the tenant range. That config would be naturally retained by all the "gaps" in the key space and would act as the default until any specific scfgs for table ranges are propagated.


docs/RFCS/20210610_tenant_zone_configs.md, line 706 at r3 (raw file):

the observation that most scfgs will be identical, both within a tenant's
keyspace and across. Referring to scfgs through ID or hash would reduce the
size of `system.span_configurations`.

I think this makes a lot of sense. The set of distinct scfgs are likely small enough to always be cached in memory. We could designate the configs as immutable and key them by a SHA of the proto which would make caching coherency trivial.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaduBerinde you should know that a bunch of this has been implemented, though nothing is enable-able in the system tenant in terms of machinery without jumping through some hoops. See #67679.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajstorm, @arulajmani, @awoods187, @dt, @irfansharif, @miretskiy, @nvanbenschoten, @otan, and @pbardea)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this proposal, but I am surprised (worried?) that the RFC doesn't discuss fault tolerance and performance hotspots much. Some comments to that effect below.

Reviewed 1 of 1 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajstorm, @arulajmani, @awoods187, @dt, @irfansharif, @miretskiy, @nvanbenschoten, @otan, @pbardea, and @RaduBerinde)


docs/RFCS/20210610_tenant_zone_configs.md, line 154 at r3 (raw file):

apply a set of span configs, it will perform relevant safety checks and
rate-limiting. Each KV server will establish a range feed over the keyspan
where all span configs will be stored. Whenever there's an update, each store

What was the rationale for choosing a single keyspan for all span configs, instead of homing the span configs "close" to the keyspace over which they apply?

Also, is there an analysis of what to do if we find a hotspot in the span config range? Presumably, a cluster with many guest tenants will see high zcfg activity due to all the DDL going on concurrently.


docs/RFCS/20210610_tenant_zone_configs.md, line 210 at r3 (raw file):

```sql
CREATE TABLE system.span_configurations (

To follow up on my previous comment: I think we need to split this table, as well as span_configurations_per_tenant below, at tenant boundaries..
There are two reasons for this:

  • we will want to avoid hotspots, which could cause DDL activity by one tenant to contend with DDL activity by another tenant. We want to design to ensure that tenants are relatively independent from each other.

  • we want to design for fault tolerance: if a host system range becomes unavailable, we want to reduce the blast radius and ensures that a minimum number of tenants are affected.

by splitting these system tables at tenant boundaries, we make the tenants independent from each other and ensure that unavailability affects only the tenant configured by the unavailable range.

In a related concern, I would also suggest perhaps partitioning these tables somehow, and ensure (perhaps via a suitably defined zcfg) that each config is homed in the region closest to the "home region" for the tenant being configured. It would be awkward to have a tenant runnning in region-pacific and have the span configs homed in region-europe.


docs/RFCS/20210610_tenant_zone_configs.md, line 259 at r3 (raw file):

```sql
CREATE TABLE system.spans_configurations_per_tenant (

nit: spans? why plural?


docs/RFCS/20210610_tenant_zone_configs.md, line 269 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I'm not sure. As we're prototyping to figure out the interfaces it'll be apparent to us what should sit where. Still I suspect this will still sit on the SQL side, I think this over-estimated cost is a different kind of determination than what KV is making.

A related concern: do we trust the SQL layer to count this right? Ideally, we'd like to avoid trusting SQL, to avoid the chance that a compromised SQL pod populate problematic scfgs in the host cluster.


docs/RFCS/20210610_tenant_zone_configs.md, line 432 at r3 (raw file):

have the same materialized zcfg. We could skip this post-processing step for
the host tenant to preserve existing behavior, though perhaps it's a [desirable
improvement][66063].

I think a provision should be made to allow zcfgs to specify "forced" splits even if adjacent configs are equivalent.
Why? For two reasons:

  • fault tolerance. It's the same story as above: if one table's ranges become unavailable, a tenant may desire to preserve availability of adjacent tables.
  • performance. An end-user may wish to use some of their "split budget" to make write-heavy workloads to separate tables become separate ranges so they can be split across separate leaseholder nodes.

docs/RFCS/20210610_tenant_zone_configs.md, line 530 at r3 (raw file):

Previously, RaduBerinde wrote…

I think things would be cleaner if the span configs would not allow "holes". Instead of span/config pairs, we would have key/config pairs, where the config applies up to the next key. Operations can always be thought of as introducing or removing split points.

When setting up a tenant, we would introduce a scfg for the tenant range. That config would be naturally retained by all the "gaps" in the key space and would act as the default until any specific scfgs for table ranges are propagated.

I came to this paragraph about to write a comment with the same idea as Radu's :)
i agree that point configs are helpful. It also simplifies the introduction of new split points, by avoiding the need to rewrite existing scfgs.


docs/RFCS/20210610_tenant_zone_configs.md, line 551 at r3 (raw file):

### Span config consistency guarantees

For the config applied to a given key, we can summarize it's consistency

nit: its


docs/RFCS/20210610_tenant_zone_configs.md, line 617 at r3 (raw file):

will still be able to set zone configs for their own `DATABASE system`, which
only apply to the tenant's own system tables.

Arguably, a tenant's "own" entries in the host systems' system.span_config tables should be configurable to be homed in a particular region of the customer's choice. What mechanism can we offer for this?

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact i just noticed (while looking at #70912 / #70912) that the RFC doesn't talk about data domiciliation enough (towards our GDPR compliance roadmap) see below.

Maybe the RFC should link to #70913

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajstorm, @arulajmani, @awoods187, @dt, @irfansharif, @miretskiy, @nvanbenschoten, @otan, @pbardea, and @RaduBerinde)


docs/RFCS/20210610_tenant_zone_configs.md, line 211 at r3 (raw file):

```sql
CREATE TABLE system.span_configurations (
    start_key     BYTES NOT NULL PRIMARY KEY, -- inclusive

Also, the full data in the key may contain indexed values from the table. For compliance with data domiciliation (GDPR etc) this means that this information must be subject to the same data domiciliation rules as the corresponding partition in the table.

So this is one more argument iin favor of splitting this table, and guaranteeing that the span configs that correspond to each table partition are homed in that table's region.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a chance to discuss this with Arul, who challenged me on a couple of my comments. I am updating my comments below based on that discussion (but don't let these threads hold up further work on the MVP).

Independently from that discussion I am noticing many open comments that don't have an answer yet. What process are we aiming for to move the convo forward? @irfansharif I'd be curious to know what approach you'd rather take. We can discuss separately.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajstorm, @arulajmani, @awoods187, @dt, @irfansharif, @miretskiy, @nvanbenschoten, @otan, @pbardea, and @RaduBerinde)


docs/RFCS/20210610_tenant_zone_configs.md, line 154 at r3 (raw file):

Previously, knz (kena) wrote…

What was the rationale for choosing a single keyspan for all span configs, instead of homing the span configs "close" to the keyspace over which they apply?

Also, is there an analysis of what to do if we find a hotspot in the span config range? Presumably, a cluster with many guest tenants will see high zcfg activity due to all the DDL going on concurrently.

To give more color on this comment: a single span config range for all tenants, using a single rangefeed, has multiple operational problems:

  • if one tenant has a bunch of zcfgs to be applied, it's going to hold up any zcfg change issued by another tenant (which will be waiting behind in the feed).
  • if the one span config range becomes unavailable, then zone configs break for all tenants simultaneously.
  • if the scfgs are in a different region than the tenant, range relocations will incur cross-region traffic.
  • it's going to make it algorithmically less easy to bill the users (via RUs) for their scfg-allocator constraint resolutions. (This is, interestingly, already recognized as an unresolved question at the end of the document.)

I understand that there are concerns with the alternative, which is to make the scfgs "decentralized" (see alternate section below). I'll add comments there too.


docs/RFCS/20210610_tenant_zone_configs.md, line 210 at r3 (raw file):

Previously, knz (kena) wrote…

To follow up on my previous comment: I think we need to split this table, as well as span_configurations_per_tenant below, at tenant boundaries..
There are two reasons for this:

  • we will want to avoid hotspots, which could cause DDL activity by one tenant to contend with DDL activity by another tenant. We want to design to ensure that tenants are relatively independent from each other.

  • we want to design for fault tolerance: if a host system range becomes unavailable, we want to reduce the blast radius and ensures that a minimum number of tenants are affected.

by splitting these system tables at tenant boundaries, we make the tenants independent from each other and ensure that unavailability affects only the tenant configured by the unavailable range.

In a related concern, I would also suggest perhaps partitioning these tables somehow, and ensure (perhaps via a suitably defined zcfg) that each config is homed in the region closest to the "home region" for the tenant being configured. It would be awkward to have a tenant runnning in region-pacific and have the span configs homed in region-europe.

My interest for "distributed scfgs" notwithstanding, if we keep a single scfg table, splitting the span_configurations table would be an intermediate solution, that would restore a little bit of good properties:

  • we would be able to home the scfg data close(r) to its tenant, to avoid cross-region traffic
  • we would be able to avoid unavailability for 1 tenant affecting that of another (assuming no scfg processing operation requires full table scans), assuming we use one rangefeed per partition

docs/RFCS/20210610_tenant_zone_configs.md, line 530 at r3 (raw file):

Previously, knz (kena) wrote…

I came to this paragraph about to write a comment with the same idea as Radu's :)
i agree that point configs are helpful. It also simplifies the introduction of new split points, by avoiding the need to rewrite existing scfgs.

After discussing this with Arul, we found out that using point configs were undesirable, because it would cause a scfg for one table to automatically "leak" to every table after it in creation order that didn't get a scfg defined yet. Given the async nature of zcfg propagation, we'd see the data for those tables (possibly, an unbounded number of them) move around as the zcfgs get incrementally applied.
It would also be semantically strange for the user to see tables appearing to be subject to zone configs that were never set on them.

(We accept that there may be transient state around a table that's undergoing a zcfg change, with both the before-state and the after-state be non-deterministically applied, however we really wish to never see a state that's not logically related to the zcfg being applied.)

Coming back to the concern that both Radu and I have (I am presuming Radu's opinion): the text of the RFC as currently stated at the time of this writing suggests that the fall back is a global, cluster-wide static scfg. IMHO we want to provide tenants control over the placement of data in the "holes".
Therefore, the RFC should be updated to mention that there will be a configurable per-tenant fallback, and not a "global, static" fallback.

Arul points out that the KV layer does not currently know about tenant default scfg in this proposal. Although there is an initial scfg populated for each tenant, it gets split up as tables get added, and remains the scfg applicable for holes (by construction, copies of it remains in-between all the scfgs that were assigned by zcfgs: when a table is added, the initial scfg is duplicated into a pre-table and post-table segment, and the table scfg is inserted in-between). However, we do not have an easy way to modify all the copies of the initial scfg that's been propagated to the holes, after it's been split up by tables. If we want to introduce per-tenant fallbacks, we'd need an API between SQL and KV to update all the scfgs in-between those that were derived from zcfgs. This is a new mechanism that we hadn't considered to be in-scope.

TBD: discuss this in the next group meeting around this RFC.


docs/RFCS/20210610_tenant_zone_configs.md, line 676 at r3 (raw file):

If zone configs are spread across the keyspace, in each tenant, we'd have to
establishing T rangefeeds per store -- which feels excessive and
cost-prohibitive. We'd also need each store to establish rangefeeds for

"T rangefeeds per store": note that these would be mostly idle - the span configs do not see activity beyond the application of zcfgs, which is arguably a very rare event per tenant.
I don't recall that we have particular misgivings about mostly-idle rangefeeds? (Would like to learn more)

If we can afford T mostly-idle rangefeeds, then the need for (and uncertainties around) gossip can be side-stepped entirely.

To re-iterate my interest in this direction, coming back from my comment above:

  • it would ensure that unavailability of a scfg range does not affect all tenants simultaneously
  • it ensures that scfg changes by one tenant can make progress wrt replica placement for 1 tenant regardless of activity in other tenants
  • it would allow us to "home" the scfgs close to the tenant's data, and avoid the risk of seeing cross-region traffic to inspect scfgs
  • it would enable us to bill scfg->replica config application activity per tenant more easily

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajstorm, @arulajmani, @awoods187, @dt, @irfansharif, @knz, @miretskiy, @nvanbenschoten, @otan, @pbardea, and @RaduBerinde)


docs/RFCS/20210610_tenant_zone_configs.md, line 154 at r3 (raw file):

if one tenant has a bunch of zcfgs to be applied, it's going to hold up any zcfg change issued by another tenant (which will be waiting behind in the feed).

I think the answer to this should be a rate-limiting mechanism of some form on the writes to the span configs. We definitely, absolutely must have a limiting mechanism on the number of span configs per tenant. That alone doesn't prevent the problem you describe here, but it's possible that we can convince ourselves that the delay due to the largest write we'd allow of a tenant would not incur that much delay.

if the one span config range becomes unavailable, then zone configs break for all tenants simultaneously.

Same can be said for node liveness and meta1, and, for better or for worse, the tail of system.rangelog. I'm not convinced at all that this is a problem. System ranges are highly available. If we aren't getting enough fault-tolerance out of them, we should bump up their replication factors.

if the scfgs are in a different region than the tenant, range relocations will incur cross-region traffic.

Do you mean when updating them or when reading/subscribing? I'd hope that we can replicate the system.spanconfig table enough such that there's a full copy in every region. Rangefeeds can and do run on followers.

it's going to make it algorithmically less easy to bill the users (via RUs) for their scfg-allocator constraint resolutions. (This is, interestingly, already recognized as an unresolved question at the end of the document.)

This is true, especially without the inheritence scheme originally proposed such that a single row write to the tenant cluster can amplify to touch many spans in the host cluster. If there was a 1:1 mapping, I wouldn't care so much. Yes we'd be off, but not by too much. There's talk of reviving the inheritence proposals, so maybe that'll come into play. More importantly, however, is that we're definitely going to need to be building limits for span configs. Ideally that conversation will take us to talk about both rates as well as absolute numbers. Assuming that limiting infrastructure can talk about rates, then we can imagine it also charging for those writes. We're not optimizing for latency when applying span configs. I could very much see a desire to have the host side implementation of the writing calls of the KVAccessor increment some RU counters somewhere.

@irfansharif irfansharif force-pushed the 210610.tenant-zcfgs branch 2 times, most recently from 7a7de4c to 1e5a3dd Compare December 4, 2021 05:45
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #71994, we've basically implemented everything from this RFC. There's sure to be some paper cuts here and there but let's merge this as done. I lightly edited the text to reflect some minor changes in our thinking where there was any; everywhere else I annotated it with the actual interfaces we settled on. Having to re-read this uber verbose text again, I'll applaud all youse patience -- this could've been a third of the size knowing what we know now.

What process are we aiming for to move the convo forward? @irfansharif I'd be curious to know what approach you'd rather take

I've responded to everything and we've discussed most things in other forums. I'd sort of treated this as a dead document until now because so much legwork happened elsewhere (sorry!).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajstorm, @arulajmani, @awoods187, @dt, @irfansharif, @knz, @miretskiy, @otan, @pbardea, and @RaduBerinde)


docs/RFCS/20210610_tenant_zone_configs.md, line 152 at r1 (raw file):

Previously, awoods187 (Andy Woods) wrote…

+1

Done (stole it from the internal presentation a few weeks ago).


docs/RFCS/20210610_tenant_zone_configs.md, line 286 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

As for observability into these rejected-by-kv errors, I'm assuming we can use existing jobs infrastructure to surface in-progress state. I was light on details because I'm a bit unfamiliar with what that would look like, but I assume it's possible.

I don't think we'll be able to do this. Jobs surface errors by transitioning to the "failed" state and writing out the error to system.jobs. Our reconciliation task is slightly different from "regular" jobs, in that it can't be cancelled or fail. As such, my plan for the prototype is to initially start off by logging errors and swallowing them. Long term though, we probably want to persist these. I think we should persist them in a new system table instead of using the Progress payload. Mostly because building introspection on top of the progress proto sounds painful.

Same as other limit related threads; tracked here: #70555.


docs/RFCS/20210610_tenant_zone_configs.md, line 546 at r1 (raw file):

Previously, ajwerner wrote…

This raises an interesting idea -- should rangefeed enablement be a property of the zone config?

Absolutely it should, tracked in #70614.


docs/RFCS/20210610_tenant_zone_configs.md, line 562 at r2 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Was this added as a stub? You've listed drawbacks throughout, which we might want to reiterate down here.

Lazily avoiding touching the RFC text any further.


docs/RFCS/20210610_tenant_zone_configs.md, line 23 at r3 (raw file):

Previously, ajwerner wrote…

cc @pbardea @adityamaru @dt @miretskiy

This came up right now when we're talking about protected timestamps. Namely, it would be totally amazing to be able to protect something like a database. The problem, however, is that if we set this timestamp on a database it's going to be O(N) writes to all the partitions and that might take a long time which is sort of sad.

If we did have inheritence, this would solve many problems and make protected timestamps pretty chill.

Protected timestamps for secondary tenants is coming to a future RFC theater near you!


docs/RFCS/20210610_tenant_zone_configs.md, line 154 at r3 (raw file):

Previously, ajwerner wrote…

if one tenant has a bunch of zcfgs to be applied, it's going to hold up any zcfg change issued by another tenant (which will be waiting behind in the feed).

I think the answer to this should be a rate-limiting mechanism of some form on the writes to the span configs. We definitely, absolutely must have a limiting mechanism on the number of span configs per tenant. That alone doesn't prevent the problem you describe here, but it's possible that we can convince ourselves that the delay due to the largest write we'd allow of a tenant would not incur that much delay.

if the one span config range becomes unavailable, then zone configs break for all tenants simultaneously.

Same can be said for node liveness and meta1, and, for better or for worse, the tail of system.rangelog. I'm not convinced at all that this is a problem. System ranges are highly available. If we aren't getting enough fault-tolerance out of them, we should bump up their replication factors.

if the scfgs are in a different region than the tenant, range relocations will incur cross-region traffic.

Do you mean when updating them or when reading/subscribing? I'd hope that we can replicate the system.spanconfig table enough such that there's a full copy in every region. Rangefeeds can and do run on followers.

it's going to make it algorithmically less easy to bill the users (via RUs) for their scfg-allocator constraint resolutions. (This is, interestingly, already recognized as an unresolved question at the end of the document.)

This is true, especially without the inheritence scheme originally proposed such that a single row write to the tenant cluster can amplify to touch many spans in the host cluster. If there was a 1:1 mapping, I wouldn't care so much. Yes we'd be off, but not by too much. There's talk of reviving the inheritence proposals, so maybe that'll come into play. More importantly, however, is that we're definitely going to need to be building limits for span configs. Ideally that conversation will take us to talk about both rates as well as absolute numbers. Assuming that limiting infrastructure can talk about rates, then we can imagine it also charging for those writes. We're not optimizing for latency when applying span configs. I could very much see a desire to have the host side implementation of the writing calls of the KVAccessor increment some RU counters somewhere.

Going to again link the tenant span config limits issue here: #70555. Clearly something we need to get to soon.

As for reviving inheritance in KV -- I hope we avoid opening that can of worms. It brings with it a lot of implementation complexity for possibly reducing write amp for frequently updated zone configs. For that, I'd first would love to see actual benchmarks demonstrating the extent of the problem. It could be, for e.g., now that we're considering folding in protected timestamps as part of zone configs -- using this span configs infrastructure as the "transport" mechanism to inform KV of what ranges are protected at what timestamps. Even then, I think it's fairly easy (desirable even! and non-invasive!) to avoid fully generalized inheritance by providing fast paths to protect large swaths of the keyspace at a time. I'll make a plug for this once we have that RFC published.


docs/RFCS/20210610_tenant_zone_configs.md, line 210 at r3 (raw file):

Previously, knz (kena) wrote…

My interest for "distributed scfgs" notwithstanding, if we keep a single scfg table, splitting the span_configurations table would be an intermediate solution, that would restore a little bit of good properties:

  • we would be able to home the scfg data close(r) to its tenant, to avoid cross-region traffic
  • we would be able to avoid unavailability for 1 tenant affecting that of another (assuming no scfg processing operation requires full table scans), assuming we use one rangefeed per partition

I tried talking to some of these questions in other threads.


docs/RFCS/20210610_tenant_zone_configs.md, line 211 at r3 (raw file):

Previously, knz (kena) wrote…

Also, the full data in the key may contain indexed values from the table. For compliance with data domiciliation (GDPR etc) this means that this information must be subject to the same data domiciliation rules as the corresponding partition in the table.

So this is one more argument iin favor of splitting this table, and guaranteeing that the span configs that correspond to each table partition are homed in that table's region.

This was a non-goal considering how this table is hardly unique in this exact problem you're pointing out. We'll have to audit more thoroughly to build towards this fully-compliant ideal (whatever that means); this schema is one of many we'll have find a general solution for.


docs/RFCS/20210610_tenant_zone_configs.md, line 269 at r3 (raw file):

Previously, knz (kena) wrote…

A related concern: do we trust the SQL layer to count this right? Ideally, we'd like to avoid trusting SQL, to avoid the chance that a compromised SQL pod populate problematic scfgs in the host cluster.

It wouldn't be counted in SQL. In any case, this part of the design was not done during this RFC's implementation; it'll be tracked here: #70555.


docs/RFCS/20210610_tenant_zone_configs.md, line 291 at r3 (raw file):

Previously, RaduBerinde wrote…

What happens if there are a ton of configs (split points) within this span? How can I paginate through that?

We didn't implement pagination yet, but I imagine it'd be "GetSpanConfigsRequest" with a limit prescribed, and the response indicating that there are more entries to be requested; callers would continue pagination with a new start key.


docs/RFCS/20210610_tenant_zone_configs.md, line 296 at r3 (raw file):

Previously, RaduBerinde wrote…

These are all the spans config that overalp with the given span? So the first scfg's start key could be before the one I asked for, and similar for the last end key?

Yup. The specifics of this API are found in the implementation linked below, I'm lazily avoiding touching the RFC text anymore.


docs/RFCS/20210610_tenant_zone_configs.md, line 299 at r3 (raw file):

Previously, RaduBerinde wrote…

It's important to flesh out the details of such an API. There should be enough comments here to clarify the following questions:

  • if I try to upsert a config for a range that contains multiple existing configs, is that essentially removing some key splits?
  • if I try to delete a config for a span that starts or ends inside an existing config span, does that introduce a split?
  • I assume the upsert (and respectively delete) spans are non-overlapping.. do they also need to be sorted?
  • can delete spans overlap with upsert spans? (e.g. first delete all configs inside a tenant and recreate them)

All these details (and more) were fleshed out in the actual implementation. I've just linked to snippets below instead of spending more on this text.


docs/RFCS/20210610_tenant_zone_configs.md, line 357 at r3 (raw file):

Previously, RaduBerinde wrote…

[nit] We should avoid reusing the term leaseholder since that already has a very clear use in our system. Could be SQL leader/coordinator/supervisor.

Done, we have a "manager" thing in the actual code.


docs/RFCS/20210610_tenant_zone_configs.md, line 432 at r3 (raw file):

Previously, knz (kena) wrote…

I think a provision should be made to allow zcfgs to specify "forced" splits even if adjacent configs are equivalent.
Why? For two reasons:

  • fault tolerance. It's the same story as above: if one table's ranges become unavailable, a tenant may desire to preserve availability of adjacent tables.
  • performance. An end-user may wish to use some of their "split budget" to make write-heavy workloads to separate tables become separate ranges so they can be split across separate leaseholder nodes.

Yup, did you happen to see #65903? I believe it captures what you're describing (and is linked ~ somewhere in this giant blob of an RFC).


docs/RFCS/20210610_tenant_zone_configs.md, line 439 at r3 (raw file):

Previously, RaduBerinde wrote…

Related to limiting split points - to me split points seem most closely related to storage, in that a split should cost no more than having 500MB of data lying around (this amount would require a split anyway). So maybe a split should (from cost/limiting perspective) behave like using some fixed amount of storage (either 500MB or maybe the average range size 320MB), and have that be enforced as part of enforcing storage (which is currently done through periodic monitoring and corresponding adjustments to the available RUs, rather than hard limits).

That's an interesting idea. Good thing we're talking next week, I was planning to discuss this in more detail. For now, closing this thread in favour of #70555.


docs/RFCS/20210610_tenant_zone_configs.md, line 530 at r3 (raw file):

Previously, knz (kena) wrote…

After discussing this with Arul, we found out that using point configs were undesirable, because it would cause a scfg for one table to automatically "leak" to every table after it in creation order that didn't get a scfg defined yet. Given the async nature of zcfg propagation, we'd see the data for those tables (possibly, an unbounded number of them) move around as the zcfgs get incrementally applied.
It would also be semantically strange for the user to see tables appearing to be subject to zone configs that were never set on them.

(We accept that there may be transient state around a table that's undergoing a zcfg change, with both the before-state and the after-state be non-deterministically applied, however we really wish to never see a state that's not logically related to the zcfg being applied.)

Coming back to the concern that both Radu and I have (I am presuming Radu's opinion): the text of the RFC as currently stated at the time of this writing suggests that the fall back is a global, cluster-wide static scfg. IMHO we want to provide tenants control over the placement of data in the "holes".
Therefore, the RFC should be updated to mention that there will be a configurable per-tenant fallback, and not a "global, static" fallback.

Arul points out that the KV layer does not currently know about tenant default scfg in this proposal. Although there is an initial scfg populated for each tenant, it gets split up as tables get added, and remains the scfg applicable for holes (by construction, copies of it remains in-between all the scfgs that were assigned by zcfgs: when a table is added, the initial scfg is duplicated into a pre-table and post-table segment, and the table scfg is inserted in-between). However, we do not have an easy way to modify all the copies of the initial scfg that's been propagated to the holes, after it's been split up by tables. If we want to introduce per-tenant fallbacks, we'd need an API between SQL and KV to update all the scfgs in-between those that were derived from zcfgs. This is a new mechanism that we hadn't considered to be in-scope.

TBD: discuss this in the next group meeting around this RFC.

I played around with the idea of defining configs up until the next key, but it was complicated because of these "trailing tables unintentionally applying earlier key config" reasons.

As for "config between the gaps", it's fairly easy to do it even without configs mapped only to keys -- we just haven't because there hasn't been a product ask for it. When writing configs to KV, we're acutely aware of where the gaps are (check out spanconfig.StoreWriter -- it can tell you where the new "holes" are when writing only to specific spans within a larger one); for now we just decided to not write anything in those gaps and fall back to a static config. We can make this more sophisticated as actual requirements do as well.


docs/RFCS/20210610_tenant_zone_configs.md, line 617 at r3 (raw file):

Previously, knz (kena) wrote…

Arguably, a tenant's "own" entries in the host systems' system.span_config tables should be configurable to be homed in a particular region of the customer's choice. What mechanism can we offer for this?

I'm not imagining them as the tenant's own entries at all; the tenant's completely unaware of their existence. It's more like KV book-keeping data for what configs it needs to apply to what spans, it just so happens that they're tenant spans.


docs/RFCS/20210610_tenant_zone_configs.md, line 676 at r3 (raw file):

it would ensure that unavailability of a scfg range does not affect all tenants simultaneously

The SPOF for system.span_configurations is concerning, but I don't know if partitioning the storage is how I'd address it. It's also not the only "centralized" state we maintain in the system. In fact, the contents of that table are reconstructable from the individual tenant system.zones tables, so it's actually better than the other SPOFs (node liveness, range meta, etc.)

it would allow us to "home" the scfgs close to the tenant's data, and avoid the risk of seeing cross-region traffic to inspect scfgs

See other thread; I'm seeing span configs less as "tenant-owned" and more as "kv-owned". I don't think there's risk of seeing cross-region traffic to inspect scfgs; KV nodes learn about scfgs asynchronously through rangefeeds and it's never in the critical path for requests.

it ensures that scfg changes by one tenant can make progress wrt replica placement for 1 tenant regardless of activity in other tenants

I also don't know that this is any better solved with a more decentralized approach. I would argue the opposite -- a thin central API would letter us rate limit/admission control tenants more effectively. This API is not called directly through tenant activity; each pod maintains a background "reconciliation process" that does it. It's not in the critical path for requests by design.

Zone configs dictate data placement, replication factor, and GC
behavior; they power CRDB's multi-region abstractions. They're disabled
for secondary tenants due to scalability bottlenecks in how they're
currently stored and disseminated. They also prevent writes before DDL
in the same transaction, implement inheritance and key-mapping by
coupling KV and SQL in an undesirable manner (making for code that's
difficult to read and write), and don't naturally extend to a
multi-tenant CRDB.

This RFC proposes a re-work of the zone configs infrastructure to enable
its use for secondary tenants, and in doing so addresses the problems
above. We introduce a distinction between SQL zone configs (attached to
SQL objects, living in the tenant keyspace) and KV zone configs (applied
to an arbitrary keyspan, living in the host tenant), re-implement
inheritance by introducing unique IDs for each zone configuration
object, and notify each node of updates through rangefeeds.

Release note: None
@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 6, 2021

Build succeeded:

@craig craig bot merged commit d466b09 into cockroachdb:master Dec 6, 2021
@irfansharif irfansharif deleted the 210610.tenant-zcfgs branch December 6, 2021 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.