Skip to content

Commit

Permalink
rfc: system.region_liveness
Browse files Browse the repository at this point in the history
Create draft RFC for the `system.region_liveness` table.

Informs: CC-24173
Release note: none
  • Loading branch information
jeffswenson committed Aug 16, 2023
1 parent 45cd1c9 commit 4068ff2
Showing 1 changed file with 369 additions and 0 deletions.
369 changes: 369 additions & 0 deletions docs/RFCS/20230731_region_liveness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,369 @@
- Feature Name: Region Liveness
- Status: draft
- Authors: Jeff Swenson
- Cockroach Issue: https://github.com/cockroachdb/cockroach/issues/103727

# Summary

This RFC proposes the region liveness subsystem for the database. The region
liveness subsystem is backed by the `system.region_liveness` table and is used
to determine when a region is unavailable. Tracking region liveness allows for
SQL servers to store session state like descriptor leases in tables configured
to only survive zone failure, without allowing the failure of a region to
impact cluster level operations like schema changes.

## Requirements

* Serverless clusters can survive region failures without paying a startup
latency cost when compared to clusters with survive zone database.
* If a region is unavailable, schema operations that do not depend on the
unavailable region can eventually make forward progress.
* If a region is unavailable, jobs that were claimed by SQL servers in the
unavailable region can be claimed by servers in available regions.

# Motivation

[As of 23.1](https://github.com/cockroachdb/cockroach/issues/85612), it is
possible for the `root` user to configure the system database as multi region.
In general tables read during the cold start process are configured as global
tables. Tables written during the cold start process are configured as regional
by row tables. Partitioning read and write tables in this way allows SQL
servers with a survive zone failure system database to start up without
executing any cross region read or write operations.

If a user configures any of their databases as survive region, then their
[system database is automatically promoted to survive
region](https://github.com/cockroachdb/cockroach/issues/102134). This means
writes to the regional by row tables are forced to cross region boundaries, so
configuring any database as survive region slows down serverless cold starts.

Below is an exhaustive list of tables written to by the SQL server in the cold
start process.

* `system.sqlliveness` is used to lease a `session_id`. The `sesion_id` is used
to lease `system.sql_instances` rows and `system.job` rows.
```sql
CREATE TABLE public.sqlliveness (
session_id BYTES NOT NULL,
expiration DECIMAL NOT NULL,
crdb_region system.public.crdb_internal_region NOT NULL,
CONSTRAINT "primary" PRIMARY KEY (crdb_region ASC, session_id ASC)
) LOCALITY REGIONAL BY ROW AS crdb_region
```
* `system.sql_instances` is used to lease a `id` for the SQL server, which is
used to lease descriptors in the `system.lease` table. `system.sql_instances`
is also used for sql server <-> sql server service discovery. The `id` column
is guaranteed to be globally unique and relatively small. Ids are
pre-allocated so that they can be claimed during start up with a region local
query.
```sql
CREATE TABLE public.sql_instances (
id INT8 NOT NULL,
addr STRING NULL,
session_id BYTES NULL,
locality JSONB NULL,
sql_addr STRING NULL,
crdb_region system.public.crdb_internal_region NOT NULL,
binary_version STRING NULL,
CONSTRAINT "primary" PRIMARY KEY (crdb_region ASC, id ASC)
) LOCALITY REGIONAL BY ROW AS crdb_region
```
* `system.lease` is used to track which version of the descriptor lease held by
each node.
```sql
CREATE TABLE public.lease (
"descID" INT8 NOT NULL,
version INT8 NOT NULL,
"nodeID" INT8 NOT NULL,
expiration TIMESTAMP NOT NULL,
crdb_region system.public.crdb_internal_region NOT NULL,
CONSTRAINT "primary" PRIMARY KEY (crdb_region ASC, "descID" ASC,
version ASC, expiration ASC, "nodeID" ASC)
) LOCALITY REGIONAL BY ROW AS crdb_region
```

For optimal cold start performance, the regional by row tables must be
configured as survive zone. A key property of these tables is that each row
belongs to an individual SQL server. Therefore if a region is known to be
unavailable, it contains no healthy SQL servers, and the tables can be treated
as if they are empty.

For all other system tables, there is no cold start impact of making them
survive region, so they will be configured as SURVIVE REGION even if the
cluster contains only SURVIVE ZONE tables.

# Proposal

The `system.region_liveness` table will be created that tracks which regions
are available. The table has the following schema:

```sql
CREATE TABLE system.public.region_liveness (
crdb_region system.public.crdb_internal_region NOT NULL,
unavailable_at TIMESTAMP NULL,
CONSTRAINT region_liveness_pkey PRIMARY KEY (crdb_region ASC)
) LOCALITY GLOBAL
```

A region is unavailable if `unavailable_at` happens before the logical
timestamp of the transaction. If a region is unavailable, servers may treat the
`system.sqlliveness`, `system.sql_instances`, and `system.lease` table as if
they are empty in that region.

If a SQL server thinks a region is unavailable, the server sets the
`unavailable_at` column for that region. The server picks a timestamp for
`unavailable_at` that is far enough in the future to guarantee all existing
`system.sqlliveness` sessions will have expired by the time the region is
considered unavailable.

When a server wishes to create or extend a `system.sqlliveness` session, it
must consult the `system.region_liveness` table. If `unavailable_at` IS NOT
NULL, then the server must pick a `sqlliveness` expiration timestamp that comes
before the `unavailable_at` timestamp.

If `unvailable_at` is not null, but set to a timestamp after the transaction
timestamp, a SQL server can attempt to set it to null in a transaction
modifying only the region_liveness table. If `unavailable_at` is before the
transaction timestamp, then the server must delete all rows in the
`system.sqlliveness`, `system.sql_instances`, and `system.lease` tables in the
transaction that sets `unavailable_at` to `NULL`. The deletion is necessary
because servers in other regions may have acted as if the rows in the region
did not exist.

# Details

## Region Liveness

### How is the `unavailable_at` timestamp picked?

When writing to the `unavailable_at` timestamp, an expiration time must be
picked that is far enough in the future that all `system.sqlliveness` will have
expired by the time the region is treated as unavailable. This decision must be
made without reading from the `system.sql_instances` table, because if the
region is unavailable, then reading the unavailable region's rows in the
`system.sql_instances` table will timeout.

The simplest option is to pick an `unavailable_at` timestamp that is the
transaction timestamp +
[server.sqlliveness.ttl](https://github.com/cockroachdb/cockroach/blob/37e61e9e245ac623b461e7917761c27d0223f1bb/pkg/sql/sqlliveness/slinstance/slinstance.go#L42).
It’s not necessary to account for clock skew when picking `unavailable_at`
because the logic is based on transaction timestamp and the SQL servers writing
to the `system.sqlliveness` table are reading from the `system.region_liveness`
table in a transaction.

There is one danger in picking the `unavailable_at` timestamp based on the
`system.sqlliveness.ttl`, which is the `system.sqlliveness.ttl` is a setting
that users may change.

Consider:
T0: a session is created in region-a that expires at T0+30=T30
T1: the session expiration setting is changed to +5 seconds.
T2: region-a is marked as expired at T0+5=T5.

In this example, the region-a will be treated as expired at T5, while there is
still an active session in the region, which violates the
`system.region_liveness` table’s invariants.

Here are a few ways the risk of changing the `system.sqlliveness.ttl` setting
could be mitigated.
* Option 1: Look at historical values of the setting and pick the max value
when fencing a region. This is the most work, but the infrastructure for
reading multiple versions of a setting would also be useful for improving the
behavior of cluster setting version gates.
* Option 2: Pick something like 2x the setting value as the region expiration
timestamp. Validate the setting to prevent users from increasing it by more
than 2x at a time. Validating the setting prevents individual setting changes
from introducing unsafe configuration, but multiple setting changes can
introduce unsafe configuration unless the validation looks at all historical
values. At best this and `Option 3` are a half measure, they avoid most
likely causes of unsafe configuration, but do not avoid all cases.
* Option 3: Introduce another setting for region expiration. Have validation to
enforce region expiration is greater than the session expiration value.
* Option 4: Remove the setting and use a hard coded value instead.

For the initial implementation, this problem will be ignored. The multi-region
system database is only used by serverless deployments and the serverless team
will use `ALTER TENANT ALL SET CLUSTER SETTING` to override
`system.sqlliveness.ttl` and prevent users from changing it.

### Avoiding Transaction Refreshes when Setting `unavailable_at`

The CRDB commit protocol contains a hazard that makes it easy to violate the
constraint that `txn_timestamp + system.sqlliveness.ttl < unavailable_at`. When
committing a transaction, if the timestamp cache for a row comes after the
transaction timestamp, the commit process will refresh the reads and push the
commit timestamp into the future. This can be avoided by setting a commit
deadline equal to the transaction timestamp in the transaction that writes the
`unavailable_at` timestamp. That way if the transaction would push the commit
timestamp, committing the transaction will fail. The code attempting to set
`unavailable_at` can retry the transaction, but will need to pick a new
`unavailabe_at` timestamp.

### Using the KV API

Because `system.sqlliveness`, `system.sqlinstances`, and `systemleases` are
involved in bootstrapping the SQL layer, the subsystems interacting with the
tables use the raw KV API instead of the SQL execution layer. Likewise code
writing to the new `system.region_liveness` table will use the KV API directly.
It is okay to use SQL to interact with the `system.region_liveness` table as
long as the SQL is not used during the SQL bootstrap process.

### Why `unavailable_at` Instead of `expiration`?

An alternative design to tracking an `unavailable_at` for each region is the
table could contain an `expiration` timestamp. SQL servers would periodically
advance the `expiration` timestamp and it would be used as an upper bound for
`system.sqlliveness` `expiration` column. The `expiration` design mirrors the
behavior of the `systemsqlliveness table` and is a little easier to reason about.

The problem with heart beating an `expiration` column in
`system.region_liveness` is SQL servers are not always running. So the
`expiration` column would always be in the past during the cold start process.
Advancing `expiration` in the cold start process would require a cross-region
write, which defeats the whole purpose of making the tables regional by row and
survive zone.

### Why Use a Global Table Instead of Regional by Table?

Most systems interact with the `system.region_liveness` table via reads. Making
`system.region_liveness` a global table allows for transactions to read from a
local replica. It would be possible to make `system.region_liveness` regional
by table if the `system.sqlliveness` heart beating process used stale reads.
But using stale reads makes the correctness more difficult to reason about.

### When is a Region Marked as Unavailable?

The interface for the `system.region_liveness` table will contain a
`ProbeLiveness(region string)` function that attempts to read from the
`system.sqlinstance` table in the probed region. If a region can not be reached
within 15 seconds, `unavailable_at` is set to start the process of fencing out
the region.

15 seconds is chosen because it gives enough time for leases to transfer in the
case the lease holder of the `system.sqlinstance` table died. The lease
transfer latency is controlled by the `COCKROACH_RANGE_LEASE_DURATION` and
defaults to 6 seconds.

Code interacting with the `system.sqlliveness` and `system.lease` tables will
trigger a liveness probe if query to a region's rows times out.

### `system.region_liveness` Needs a Fixed Descriptor ID

There is a limited number of remaining static descriptor IDs in the system
database. Since `system.region_liveness` will be involved in bootstrapping the
schema subsystem, it would be difficult to use a dynamically allocated
descriptor ID. Since CRDB is effectively out of static descriptor IDs, an ID
that was used, but then removed from the system database's initial schema, will
be used for `system.region_liveness`.

## SQL Liveness Details

### SQL Liveness Heart Beating

Whenever the `system.sqlliveness.expiration` column is extended, the
`system.region_liveness.unavailable_at` column for the region should be checked
to ensure the region is available at the timestamp selected for the session. If
the region is unavailable at the desired timestamp, the SQL liveness subsystem
will clamp the expiration to the instant the region is unavailable.

### `sqlliveness.Reader.IsAlive`

The main API to the `system.sqlliveness` table is the `IsAlive` method that
returns `true` if a session may be alive and `false` if a session is dead. Once
a session is considered dead, it must stay dead forever.

`IsAlive` will have two key changes made to it:
1. If the region is unavailable, all sessions from the region will be treated
as if they are dead. Every time a new `session_id` from a region is
encountered, `sqlliveness` will read the `system.region_liveness` table to
ensure the region is still down.
2. If `IsAlive` fails to read from a region, is will trigger a liveness probe
for the region.

## Descriptor Leasing Details

### Lease Expiration vs `session_id` Expiration

There are two timestamps that are relevant for leases and lease expiration.

1. The `system.lease.expiration` column.
2. The `system.sqlliveness.expiration` column. The relationship between a lease
and the sqlliveness session is indirect. The `system.lease.nodeID` column
stores a `system.sql_instances.id`, which is leased by a
`system.sqlliveness.session_id`.

The `system.lease.expiration` timestamp will usually exceed the
`system.sqlliveness.expiration` time. But the lease subsystem is programmed
such that a server only assumes the lease lives until the SQL liveness
session's expiration timestamp. This means it is safe to ignore leases in an
unavailable region, because creating a `system.sqlliveness.session_id` to claim
the `sql_instance.id` row that owns the unexpired lease will require marking
the region as available, which will delete all stale leases in the region
before servers are allowed to start up.

### Pseudo Code for Advancing Descriptor Version

A descriptor version is allowed to advance to version n+1 as long as there are
no outstanding leases for version n-1. When searching for descriptor leases,
the lease subsystem will skip unavailable regions.

```
txn := BeginTxn()
regionLiveness := ReadRegionLiveness(txn)
for region in AllRegions():
unavailableAt, exists := regionLiveness[region]
if exists and unavailableAt.Before(txn.Time()):
continue // skip unavailable regions
if hasLease(region, descriptor, version):
return true
return false
```

## SQL Instances

### ID Allocation

The `system.sql_instances.id` allocation loop will be rewritten to ignore
unavailable regions. This is safe because instances in the region will be
deleted by the transaction that restores region availability.

### Instance Range Feed

There is a rangefeed over the `system.sql_instances` table. The rangfeed
maintains an instance cache that is used to find servers for distsql. There are
two changes required to the cache that maintains the rangefeed.
1. The cache might see duplicate instance ids if a region is unavailable and an
instance id used by the unavailable region is claimed by an available
region. An easy way to work around this is to store the
`system.sql_instances` row in the cache with the more recent transaction
timestamp.
2. The rangfeed should be split by region so that the cache can be initialized
when a region is unavailable.

## Jobs

The job subsystem runs a periodic scan to reclaim jobs acquired by a dead
`system.sqlliveness.session_id`. The Jobs claim loop uses the
`sqlliveness.Reader.IsAlive`, which will be enhanced to consult
`system.region_liveness`, so job system specific changes are not needed to
benefit from the `system.region_liveness` table.

## System Database

Once region liveness is available, we need to configure most tables in the
system database so that they survive region failures. The regional by row
tables will be configured so that they only survive zone failures.

## Backporting

The current plan is to backport this functionality to 23.2. This is reasonably
safe because the functionality will be controlled by a setting and will only be
enabled in multi-region serverless clusters.

The overall release process is:
1. 23.2 will include the `system.region_liveness` table.
2. A minor release will include the logic to consult `system.region_liveness`
table and the `ProbeLiveness` function will be guarded by a system setting.
3. Once the minor release is fully deployed, the serverless team will enable
the `ProbeLiveness` feature. If the serverless team needs to roll back the
release, they will first have to disable the `ProbeLiveness` feature.

0 comments on commit 4068ff2

Please sign in to comment.