rfc: system.region_liveness

Create draft RFC for the `system.region_liveness` table. Informs: CC-24173 Release note: none
cockroachdb · Aug 16, 2023 · 4068ff2 · 4068ff2
1 parent 45cd1c9
commit 4068ff2
Showing 1 changed file with 369 additions and 0 deletions.
diff --git a/docs/RFCS/20230731_region_liveness.md b/docs/RFCS/20230731_region_liveness.md
@@ -0,0 +1,369 @@
+- Feature Name: Region Liveness
+- Status: draft 
+- Authors: Jeff Swenson
+- Cockroach Issue: https://github.com/cockroachdb/cockroach/issues/103727
+
+# Summary
+
+This RFC proposes the region liveness subsystem for the database. The region
+liveness subsystem is backed by the `system.region_liveness` table and is used
+to determine when a region is unavailable. Tracking region liveness allows for
+SQL servers to store session state like descriptor leases in tables configured
+to only survive zone failure, without allowing the failure of a region to
+impact cluster level operations like schema changes.
+
+## Requirements
+
+* Serverless clusters can survive region failures without paying a startup
+  latency cost when compared to clusters with survive zone database.
+* If a region is unavailable, schema operations that do not depend on the
+  unavailable region can eventually make forward progress.
+* If a region is unavailable, jobs that were claimed by SQL servers in the
+  unavailable region can be claimed by servers in available regions.
+
+# Motivation
+
+[As of 23.1](https://github.com/cockroachdb/cockroach/issues/85612), it is
+possible for the `root` user to configure the system database as multi region.
+In general tables read during the cold start process are configured as global
+tables. Tables written during the cold start process are configured as regional
+by row tables. Partitioning read and write tables in this way allows SQL
+servers with a survive zone failure system database to start up without
+executing any cross region read or write operations.
+
+If a user configures any of their databases as survive region, then their
+[system database is automatically promoted to survive
+region](https://github.com/cockroachdb/cockroach/issues/102134). This means
+writes to the regional by row tables are forced to cross region boundaries, so
+configuring any database as survive region slows down serverless cold starts.
+
+Below  is an exhaustive list of tables written to by the SQL server in the cold
+start process. 
+
+* `system.sqlliveness` is used to lease a `session_id`. The `sesion_id` is used
+  to lease `system.sql_instances` rows and `system.job` rows.
+  ```sql
+  CREATE TABLE public.sqlliveness (
+      session_id BYTES NOT NULL,
+      expiration DECIMAL NOT NULL,
+      crdb_region system.public.crdb_internal_region NOT NULL,
+      CONSTRAINT "primary" PRIMARY KEY (crdb_region ASC, session_id ASC)
+  ) LOCALITY REGIONAL BY ROW AS crdb_region
+  ```
+* `system.sql_instances` is used to lease a `id` for the SQL server, which is
+  used to lease descriptors in the `system.lease` table. `system.sql_instances`
+  is also used for sql server <-> sql server service discovery. The `id` column
+  is guaranteed to be globally unique and relatively small. Ids are
+  pre-allocated so that they can be claimed during start up with a region local
+  query.
+  ```sql
+  CREATE TABLE public.sql_instances (
+      id INT8 NOT NULL,
+      addr STRING NULL,
+      session_id BYTES NULL,
+      locality JSONB NULL,
+      sql_addr STRING NULL,
+      crdb_region system.public.crdb_internal_region NOT NULL,
+      binary_version STRING NULL,
+      CONSTRAINT "primary" PRIMARY KEY (crdb_region ASC, id ASC)
+  ) LOCALITY REGIONAL BY ROW AS crdb_region
+  ```
+* `system.lease` is used to track which version of the descriptor lease held by
+  each node.
+  ```sql
+  CREATE TABLE public.lease (
+      "descID" INT8 NOT NULL,
+      version INT8 NOT NULL,
+      "nodeID" INT8 NOT NULL,
+      expiration TIMESTAMP NOT NULL,
+      crdb_region system.public.crdb_internal_region NOT NULL,
+      CONSTRAINT "primary" PRIMARY KEY (crdb_region ASC, "descID" ASC, 
+          version ASC, expiration ASC, "nodeID" ASC)
+  ) LOCALITY REGIONAL BY ROW AS crdb_region
+  ```
+
+For optimal cold start performance, the regional by row tables must be
+configured as survive zone. A key property of these tables is that each row
+belongs to an individual SQL server. Therefore if a region is known to be
+unavailable, it contains no healthy SQL servers, and the tables can be treated
+as if they are empty.
+
+For all other system tables, there is no cold start impact of making them
+survive region, so they will be configured as SURVIVE REGION even if the
+cluster contains only SURVIVE ZONE tables.
+
+# Proposal
+
+The `system.region_liveness` table will be created that tracks which regions
+are available. The table has the following schema:
+
+```sql
+CREATE TABLE system.public.region_liveness (
+    crdb_region system.public.crdb_internal_region NOT NULL,
+    unavailable_at TIMESTAMP NULL,
+    CONSTRAINT region_liveness_pkey PRIMARY KEY (crdb_region ASC)
+) LOCALITY GLOBAL
+```
+
+A region is unavailable if `unavailable_at` happens before the logical
+timestamp of the transaction. If a region is unavailable, servers may treat the
+`system.sqlliveness`, `system.sql_instances`, and `system.lease` table as if
+they are empty in that region.
+
+If a SQL server thinks a region is unavailable, the server sets the
+`unavailable_at` column for that region. The server picks a timestamp for
+`unavailable_at` that is far enough in the future to guarantee all existing
+`system.sqlliveness` sessions will have expired by the time the region is
+considered unavailable.
+
+When a server wishes to create or extend a `system.sqlliveness` session, it
+must consult the `system.region_liveness` table. If `unavailable_at` IS NOT
+NULL, then the server must pick a `sqlliveness` expiration timestamp that comes
+before the `unavailable_at` timestamp.
+
+If `unvailable_at` is not null, but set to a timestamp after the transaction
+timestamp, a SQL server can attempt to set it to null in a transaction
+modifying only the region_liveness table. If `unavailable_at` is before the
+transaction timestamp, then the server must delete all rows in the
+`system.sqlliveness`, `system.sql_instances`, and `system.lease` tables in the
+transaction that sets `unavailable_at` to `NULL`. The deletion is necessary
+because servers in other regions may have acted as if the rows in the region
+did not exist.
+
+# Details
+
+## Region Liveness
+
+### How is the `unavailable_at` timestamp picked?
+
+When writing to the `unavailable_at` timestamp, an expiration time must be
+picked that is far enough in the future that all `system.sqlliveness` will have
+expired by the time the region is treated as unavailable. This decision must be
+made without reading from the `system.sql_instances` table, because if the
+region is unavailable, then reading the unavailable region's rows in the
+`system.sql_instances` table will timeout. 
+
+The simplest option is to pick an `unavailable_at` timestamp that is the
+transaction timestamp +
+[server.sqlliveness.ttl](https://github.com/cockroachdb/cockroach/blob/37e61e9e245ac623b461e7917761c27d0223f1bb/pkg/sql/sqlliveness/slinstance/slinstance.go#L42).
+It’s not necessary to account for clock skew when picking `unavailable_at`
+because the logic is based on transaction timestamp and the SQL servers writing
+to the `system.sqlliveness` table are reading from the `system.region_liveness`
+table in a transaction. 
+
+There is one danger in picking the `unavailable_at` timestamp based on the
+`system.sqlliveness.ttl`, which is the `system.sqlliveness.ttl`  is a setting
+that users may change.
+
+Consider:
+T0: a session is created in region-a that expires at T0+30=T30
+T1: the session expiration setting is changed to +5 seconds.
+T2: region-a is marked as expired at T0+5=T5. 
+
+In this example, the region-a will be treated as expired at T5, while there is
+still an active session in the region, which violates the
+`system.region_liveness` table’s invariants. 
+
+Here are a few ways the risk of changing the `system.sqlliveness.ttl` setting
+could be mitigated.
+* Option 1: Look at historical values of the setting and pick the max value
+  when fencing a region. This is the most work, but the infrastructure for
+  reading multiple versions of a setting would also be useful for improving the
+  behavior of cluster setting version gates.
+* Option 2: Pick something like 2x the setting value as the region expiration
+  timestamp. Validate the setting to prevent users from increasing it by more
+  than 2x at a time. Validating the setting prevents individual setting changes
+  from introducing unsafe configuration, but multiple setting changes can
+  introduce unsafe configuration unless the validation looks at all historical
+  values. At best this and `Option 3` are a half measure, they avoid most
+  likely causes of unsafe configuration, but do not avoid all cases.
+* Option 3: Introduce another setting for region expiration. Have validation to
+  enforce region expiration is greater than the session expiration value.
+* Option 4: Remove the setting and use a hard coded value instead.
+
+For the initial implementation, this problem will be ignored. The multi-region
+system database is only used by serverless deployments and the serverless team
+will use `ALTER TENANT ALL SET CLUSTER SETTING` to override
+`system.sqlliveness.ttl` and prevent users from changing it. 
+
+### Avoiding Transaction Refreshes when Setting `unavailable_at`
+
+The CRDB commit protocol contains a hazard that makes it easy to violate the
+constraint that `txn_timestamp + system.sqlliveness.ttl < unavailable_at`. When
+committing a transaction, if the timestamp cache for a row comes after the
+transaction timestamp, the commit process will refresh the reads and push the
+commit timestamp into the future. This can be avoided by setting a commit
+deadline equal to the transaction timestamp in the transaction that writes the
+`unavailable_at` timestamp. That way if the transaction would push the commit
+timestamp, committing the transaction will fail. The code attempting to set
+`unavailable_at` can retry the transaction, but will need to pick a new
+`unavailabe_at` timestamp.
+
+### Using the KV API
+
+Because `system.sqlliveness`, `system.sqlinstances`, and `systemleases` are
+involved in bootstrapping the SQL layer, the subsystems interacting with the
+tables use the raw KV API instead of the SQL execution layer. Likewise code
+writing to the new `system.region_liveness` table will use the KV API directly.
+It is okay to use SQL to interact with the `system.region_liveness` table as
+long as the SQL is not used during the SQL bootstrap process.
+
+### Why `unavailable_at` Instead of `expiration`?
+
+An alternative design to tracking an `unavailable_at` for each region is the
+table could contain an `expiration` timestamp. SQL servers would periodically
+advance the `expiration` timestamp and it would be used as an upper bound for
+`system.sqlliveness` `expiration` column. The `expiration` design mirrors the
+behavior of the `systemsqlliveness table` and is a little easier to reason about. 
+
+The problem with heart beating an `expiration` column in
+`system.region_liveness` is SQL servers are not always running. So the
+`expiration` column would always be in the past during the cold start process.
+Advancing `expiration` in the cold start process would require a cross-region
+write, which defeats the whole purpose of making the tables regional by row and
+survive zone.
+
+### Why Use a Global Table Instead of Regional by Table?
+
+Most systems interact with the `system.region_liveness` table via reads. Making
+`system.region_liveness` a global table allows for transactions to read from a
+local replica. It would be possible to make `system.region_liveness` regional
+by table if the `system.sqlliveness` heart beating process used stale reads.
+But using stale reads makes the correctness more difficult to reason about.
+
+### When is a Region Marked as Unavailable?
+
+The interface for the `system.region_liveness` table will contain a
+`ProbeLiveness(region string)` function that attempts to read from the
+`system.sqlinstance` table in the probed region. If a region can not be reached
+within 15 seconds, `unavailable_at` is set to start the process of fencing out
+the region.
+
+15 seconds is chosen because it gives enough time for leases to transfer in the
+case the lease holder of the `system.sqlinstance` table died. The lease
+transfer latency is controlled by the `COCKROACH_RANGE_LEASE_DURATION` and
+defaults to 6 seconds.
+
+Code interacting with the `system.sqlliveness` and `system.lease` tables will
+trigger a liveness probe if query to a region's rows times out.
+
+### `system.region_liveness` Needs a Fixed Descriptor ID
+
+There is a limited number of remaining static descriptor IDs in the system
+database. Since `system.region_liveness` will be involved in bootstrapping the
+schema subsystem, it would be difficult to use a dynamically allocated
+descriptor ID. Since CRDB is effectively out of static descriptor IDs, an ID
+that was used, but then removed from the system database's initial schema, will
+be used for `system.region_liveness`.
+
+## SQL Liveness Details
+
+### SQL Liveness Heart Beating
+
+Whenever the `system.sqlliveness.expiration` column is extended, the
+`system.region_liveness.unavailable_at` column for the region should be checked
+to ensure the region is available at the timestamp selected for the session. If
+the region is unavailable at the desired timestamp, the SQL liveness subsystem
+will clamp the expiration to the instant the region is unavailable. 
+
+### `sqlliveness.Reader.IsAlive`
+
+The main API to the `system.sqlliveness` table is the `IsAlive` method that
+returns `true` if a session may be alive and `false` if a session is dead. Once
+a session is considered dead, it must stay dead forever.
+
+`IsAlive` will have two key changes made to it:
+1. If the region is unavailable, all sessions from the region will be treated
+   as if they are dead. Every time a new `session_id` from a region is
+   encountered, `sqlliveness` will read the `system.region_liveness` table to
+   ensure the region is still down. 
+2. If `IsAlive` fails to read from a region, is will trigger a liveness probe
+   for the region.
+
+## Descriptor Leasing Details
+
+### Lease Expiration vs `session_id` Expiration
+
+There are two timestamps that are relevant for leases and lease expiration.
+
+1. The `system.lease.expiration` column.
+2. The `system.sqlliveness.expiration` column. The relationship between a lease
+   and the sqlliveness session is indirect. The `system.lease.nodeID` column
+   stores a `system.sql_instances.id`, which is leased by a
+   `system.sqlliveness.session_id`.
+
+The `system.lease.expiration` timestamp will usually exceed the
+`system.sqlliveness.expiration` time. But the lease subsystem is programmed
+such that a server only assumes the lease lives until the SQL liveness
+session's expiration timestamp. This means it is safe to ignore leases in an
+unavailable region, because creating a `system.sqlliveness.session_id` to claim
+the `sql_instance.id` row that owns the unexpired lease will require marking
+the region as available, which will delete all stale leases in the region
+before servers are allowed to start up.
+
+### Pseudo Code for Advancing Descriptor Version
+
+A descriptor version is allowed to advance to version n+1 as long as there are
+no outstanding leases for version n-1. When searching for descriptor leases,
+the lease subsystem will skip unavailable regions.
+
+```
+txn := BeginTxn()
+regionLiveness := ReadRegionLiveness(txn)
+for region in AllRegions():
+    unavailableAt, exists := regionLiveness[region]
+    if exists and unavailableAt.Before(txn.Time()):
+       continue // skip unavailable regions
+    if hasLease(region, descriptor, version):
+       return true
+return false
+```
+
+## SQL Instances
+
+### ID Allocation
+
+The `system.sql_instances.id` allocation loop will be rewritten to ignore
+unavailable regions. This is safe because instances in the region will be
+deleted by the transaction that restores region availability.
+
+### Instance Range Feed
+
+There is a rangefeed over the `system.sql_instances` table. The rangfeed
+maintains an instance cache that is used to find servers for distsql. There are
+two changes required to the cache that maintains the rangefeed.
+1. The cache might see duplicate instance ids if a region is unavailable and an
+   instance id used by the unavailable region is claimed by an available
+   region. An easy way to work around this is to store the
+   `system.sql_instances` row in the cache with the more recent transaction
+   timestamp.
+2. The rangfeed should be split by region so that the cache can be initialized
+   when a region is unavailable.
+
+## Jobs
+
+The job subsystem runs a periodic scan to reclaim jobs acquired by a dead
+`system.sqlliveness.session_id`. The Jobs claim loop uses the
+`sqlliveness.Reader.IsAlive`, which will be enhanced to consult
+`system.region_liveness`, so job system specific changes are not needed to
+benefit from the `system.region_liveness` table. 
+
+## System Database
+
+Once region liveness is available, we need to configure most tables in the
+system database so that they survive region failures. The regional by row
+tables will be configured so that they only survive zone failures.
+
+## Backporting
+
+The current plan is to backport this functionality to 23.2. This is reasonably
+safe because the functionality will be controlled by a setting and will only be
+enabled in multi-region serverless clusters.
+
+The overall release process is:
+1. 23.2 will include the `system.region_liveness` table.
+2. A minor release will include the logic to consult `system.region_liveness`
+   table and the `ProbeLiveness` function will be guarded by a system setting.
+3. Once the minor release is fully deployed, the serverless team will enable
+   the `ProbeLiveness` feature. If the serverless team needs to roll back the
+   release, they will first have to disable the `ProbeLiveness` feature.