From 4b112c0329f8faaf44e1024aeac61eecf03e660a Mon Sep 17 00:00:00 2001 From: Spencer Kimball Date: Thu, 29 Sep 2016 23:38:20 -0400 Subject: [PATCH] Annotate design document update --- docs/design.md | 105 +++++++++++++++++++++++++++++++++++++------------ 1 file changed, 79 insertions(+), 26 deletions(-) diff --git a/docs/design.md b/docs/design.md index 0367d2e14be1..1aee51260bb9 100644 --- a/docs/design.md +++ b/docs/design.md @@ -4,6 +4,9 @@ by Spencer Kimball from early 2014. # Overview +[Spencer to edit overview. This stuff is so ripe it's starting to +stink: "structured data? WTF is that?] + Cockroach is a distributed key:value datastore (SQL and structured data layers of cockroach have yet to be defined) which supports **ACID transactional semantics** and **versioned values** as first-class @@ -14,6 +17,10 @@ disruption and **no manual intervention**. Cockroach nodes are symmetric; a design goal is **homogeneous deployment** (one binary) with minimal configuration. +[Start at top of the new stack here: SQL, mention of distributed SQL, +then talk about decomponsition into KV commands, then to monolithic +sorted map, etc. Use diagram] + Cockroach implements a **single, monolithic sorted map** from key to value where both keys and values are byte strings (not unicode). Cockroach **scales linearly** (theoretically up to 4 exabytes (4E) of @@ -29,6 +36,7 @@ in disparate datacenters for survivability (e.g. `{ US-East, US-West, Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East, US-West, Japan, Australia }`). +[Batches, update?] Single mutations to ranges are mediated via an instance of a distributed consensus algorithm to ensure consistency. We’ve chosen to use the [Raft consensus algorithm](https://raftconsensus.github.io); all consensus @@ -40,6 +48,7 @@ logical mutation fall within the same range, atomicity and consistency are guaranteed by Raft; this is the **fast commit path**. Otherwise, a **non-locking distributed commit** protocol is employed between affected ranges. +[Link to blog post] Cockroach provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and serializable snapshot isolation (SSI) semantics, allowing **externally @@ -55,6 +64,7 @@ providing ordering for any observer or chain of observers. Similar to [Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf) directories, Cockroach allows configuration of arbitrary zones of data. +[locality info, GC prefs, etc...] This allows replication factor, storage device type, and/or datacenter location to be chosen to optimize performance and/or availability. Unlike Spanner, zones are monolithic and don’t allow movement of fine @@ -64,8 +74,10 @@ grained data on the level of entity groups. Cockroach implements a layered architecture. The highest level of abstraction is the SQL layer (currently unspecified in this document). +[FIX] It depends directly on the [*structured data API*](#structured-data-api), which provides familiar relational concepts +[Blend these structured data concepts into the SQL description] such as schemas, tables, columns, and indexes. The structured data API in turn depends on the [distributed key value store](#key-value-api), which handles the details of range addressing to provide the abstraction @@ -73,6 +85,14 @@ of a single, monolithic key value store. The distributed KV store communicates with any number of physical cockroach nodes. Each node contains one or more stores, one per physical device. +![SQL] + + + +![Distributed SQL] + + + ![Architecture](media/architecture.png) Each store contains potentially many ranges, the lowest-level unit of @@ -83,6 +103,7 @@ raft. The color coding shows associated range replicas. ![Ranges](media/ranges.png) +[Update RoachNode concept] Each physical node exports a RoachNode service. Each RoachNode exports one or more key ranges. RoachNodes are symmetric. Each has the same binary and assumes identical roles. @@ -101,12 +122,14 @@ Up to `F` failures can be tolerated, where the total number of replicas `N = 2F # Cockroach Client +[Woah, big update required here!] In order to support diverse client usage, Cockroach clients connect to any node via HTTPS using protocol buffers or JSON. The connected node proxies involved client work including key lookups and write buffering. # Keys +[Update referencing concepts in keys/constants.go required here] Cockroach keys are arbitrary byte arrays. If textual data is used in keys, utf8 encoding is recommended (this helps for cleaner display of values in debugging tools). User-supplied keys are encoded using an @@ -124,7 +147,8 @@ time to return the most recent writes prior to the snapshot timestamp. Older versions of values are garbage collected by the system during compaction according to a user-specified expiration interval. In order to support long-running scans (e.g. for MapReduce), all versions have a -minimum expiration. +minimum expiration of 1 day, though arbitrary key ranges can set the +GC TTL value as necessary. Versioned values are supported via modifications to RocksDB to record commit timestamps and GC expirations per key. @@ -137,6 +161,30 @@ the low water mark of the cache appropriately. If a new range lease holder is elected, it sets the low water mark for the cache to the current wall time + ε (ε = 99th percentile clock skew). +**Time in CockroachDB** + +Each cockroach node maintains a hybrid logical clock (HLC) as discussed +in the [Hybrid Logical Clock paper](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf). +HLC time uses timestamps which are composed of a physical component (thought of +as and always close to local wall time) and a logical component (used to +distinguish between events with the same physical component). HLC allows +tracking causality for related events, similarly to vector clocks, but with less +overhead. In practice, it works much like other logical clocks: When events +are received by a node, the event's timestamp (provided by the sender) is +used to update the receiving node's HLC clock. + +For a more in depth description of HLC please read the paper. Our +implementation is [here](https://github.com/cockroachdb/cockroach/blob/master/util/hlc/hlc.go). + +Cockroach picks a Timestamp for a transaction using HLC time. Throughout this +document, *timestamp* always refers to the HLC time which is a singleton +on each node. The HLC is updated by every read/write event on the node, and +the HLC time >= wall time. A read/write timestamp received in a cockroach request +from another node is not only used to version the operation, but also updates +the HLC on the node. This is useful in guaranteeing that all data read/written +on a node is at a timestamp < next HLC time. + +[Matt to review following two sections] # Lock-Free Distributed Transactions Cockroach provides distributed transactions without locks. Cockroach @@ -181,30 +229,6 @@ never be decreased. The core difference between the two isolation levels SI and SSI is that the former allows the transaction's candidate timestamp to increase and the latter does not. -**Hybrid Logical Clock** - -Each cockroach node maintains a hybrid logical clock (HLC) as discussed -in the [Hybrid Logical Clock paper](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf). -HLC time uses timestamps which are composed of a physical component (thought of -as and always close to local wall time) and a logical component (used to -distinguish between events with the same physical component). It allows us to -track causality for related events similar to vector clocks, but with less -overhead. In practice, it works much like other logical clocks: When events -are received by a node, it informs the local HLC about the timestamp supplied -with the event by the sender, and when events are sent a timestamp generated by -the local HLC is attached. - -For a more in depth description of HLC please read the paper. Our -implementation is [here](https://github.com/cockroachdb/cockroach/blob/master/util/hlc/hlc.go). - -Cockroach picks a Timestamp for a transaction using HLC time. Throughout this -document, *timestamp* always refers to the HLC time which is a singleton -on each node. The HLC is updated by every read/write event on the node, and -the HLC time >= wall time. A read/write timestamp received in a cockroach request -from another node is not only used to version the operation, but also updates -the HLC on the node. This is useful in guaranteeing that all data read/written -on a node is at a timestamp < next HLC time. - **Transaction execution flow** Transactions are executed in two phases: @@ -377,6 +401,7 @@ An exploration of retries with contention and abort times with abandoned transaction is [here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing). +[Update (conceptually) s/table/records/g] **Transaction Table** Please see [roachpb/data.proto](https://github.com/cockroachdb/cockroach/blob/master/roachpb/data.proto) for the up-to-date structures, the best entry point being `message Transaction`. @@ -418,6 +443,7 @@ Please see [roachpb/data.proto](https://github.com/cockroachdb/cockroach/blob/ma two phase locking. Aborts and retries increase read and write traffic, increase latency and decrease throughput. +[refurbish next two sections] **Choosing a Timestamp** A key challenge of reading data in a distributed system with clock skew @@ -568,10 +594,13 @@ T2 (T2start) after the completion of transaction T1 (T1end), will have commit timestamps such thats1 \< s2. +[Link to "Living Without Atomic Clocks" blog post.] + # Logical Map Content +[Update the system info] Logically, the map contains a series of reserved system key / value -pairs covering accounting, range metadata and node accounting +pairs covering accounting, range metadata and node accounting before the actual key / value pairs for non-system data (e.g. the actual meat of the map). @@ -609,6 +638,7 @@ for further details. # Node Storage +[Updates necessary below] Nodes maintain a separate instance of RocksDB for each disk. Each RocksDB instance hosts any number of ranges. RPCs arriving at a RoachNode are multiplexed based on the disk name to the appropriate @@ -633,6 +663,7 @@ A really good reference on tuning Linux installations with RocksDB is # Range Metadata +[Peter: did you actually have numbers for the size of meta2 records?] The default approximate size of a range is 64M (2\^26 B). In order to support 1P (2\^50 B) of logical data, metadata is needed for roughly 2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata @@ -660,6 +691,7 @@ is sparse, the successor key is defined as the next key which is present. The found using the same process. The *meta2* record identifies the range containing `key1`, which is again found the same way (see examples below). +[Updates required below] Concretely, metadata keys are prefixed by `\0\0meta{1,2}`; the two null characters provide for the desired sorting behaviour. Thus, `key1`'s *meta1* record will reside at the successor key to `\0\0\meta1`. @@ -671,6 +703,7 @@ key *after* the meta indexing record we’re looking for, which would result in having to back the iterator up, an option which is both less efficient and not available in all cases. +[Review and cleanup required below?] The following example shows the directory structure for a map with three ranges worth of data. Ellipses indicate additional key/value pairs to fill an entire range of data. Except for the fact that splitting ranges @@ -776,6 +809,7 @@ client evicts the stale entries and possibly does a new lookup. # Raft - Consistency of Range Replicas +[Raft experts review this nonsense below] Each range is configured to consist of three or more replicas, as specified by their ZoneConfig. The replicas in a range maintain their own instance of a distributed consensus algorithm. We use the [*Raft consensus algorithm*](https://raftconsensus.github.io) @@ -805,6 +839,7 @@ Future optimizations may include two-phase elections and quiescent ranges # Range Leases +[Cleanup, add section on epoch-based range leases and motivation & link to range lease RFC] As outlined in the Raft section, the replicas of a Range are organized as a Raft group and execute commands from their shared commit log. Going through Raft is an expensive operation though, and there are tasks which should only be @@ -840,6 +875,7 @@ offset. ## Relationship to Raft leadership +[Is this all still accurate?] The range lease is completely separate from Raft leadership, and so without further efforts, Raft leadership and the Range lease may not be represented by the same replica most of the time. This is convenient semantically since it decouples @@ -859,6 +895,8 @@ transitions. ## Command Execution Flow +[background on motivation here necessary. Did we already explain why you +need the command queue? Is it well explained below?] This subsection describes how a lease holder replica processes a read/write command in more details. Each command specifies (1) a key (or a range of keys) that the command accesses and (2) the ID of a range which the @@ -903,6 +941,7 @@ expired, the command will be rejected by the replica. # Splitting / Merging Ranges +[Accurate still? Needs review.] RoachNodes split or merge ranges based on whether they exceed maximum or minimum thresholds for capacity or load. Ranges exceeding maximums for either capacity or load are split; ranges below minimums for *both* @@ -980,8 +1019,11 @@ spare capacity is chosen in the same datacenter and a special-case split is done which simply duplicates the data 1:1 and resets the range configuration metadata. +[NEED a section on snapshots] + # Range-Spanning Binary Tree +[Remove this section I think.] A crucial enhancement to the organization of range metadata is to augment the bi-level range metadata lookup with a minimum spanning tree, implemented as a left-leaning red-black tree over all ranges in the map. @@ -1032,6 +1074,9 @@ tasks. # Node Allocation (via Gossip) +[This section is a damn mess. Need to at least make it accurate for +current code base. Ultimately need to incorporate the ideas behind +Tristan's changes.] New nodes must be allocated when a range is split. Instead of requiring every RoachNode to know about the status of all or even a large number of peer nodes --or-- alternatively requiring a specialized curator or @@ -1091,6 +1136,8 @@ The gossip protocol itself contains two primary components: # Node Accounting +[Replace with description of MVCC, store-level aggregation, recorder, +and time series merging] The gossip protocol discussed in the previous section is useful to quickly communicate fragments of important information in a decentralized manner. However, complete accounting for each node is also @@ -1111,6 +1158,9 @@ fanout)) + 1`. # Key-prefix Accounting and Zones +[Key prefix accounting no longer a reality; nix all references. Zone +configs might already be mentioned enough elsewhere. Probably delete +section -- needs review.] Arbitrarily fine-grained accounting is specified via key prefixes. Key prefixes can overlap, as is necessary for capturing hierarchical relationships. For illustrative purposes, let’s say keys @@ -1225,6 +1275,7 @@ it discovers differences, it reconfigures ranges in the same way that it rebalances away from busy nodes, via special-case 1:1 split to a duplicate range comprising the new configuration. +[Rest of this document is tossable] # Key-Value API see the protobufs in [roachpb/](https://github.com/cockroachdb/cockroach/blob/master/roachpb), @@ -1305,6 +1356,7 @@ write-optimized (HBase, Cassandra, SQLite3/LSM, CockroachDB). ## Architecture +[Keep this diagram; update if necessary] CockroachDB implements a layered architecture, with various subdirectories implementing layers as appropriate. The highest level of abstraction is the [SQL layer][5], which depends @@ -1331,6 +1383,7 @@ replicas. ## Client Architecture +[Update this diagram] RoachNodes serve client traffic using a fully-featured SQL API which accepts requests as either application/x-protobuf or application/json. Client implementations consist of an HTTP sender (transport) and a transactional sender which implements a simple