-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design doc updates #9646
Design doc updates #9646
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -209,27 +209,27 @@ on a node is at a timestamp < next HLC time. | |
|
||
Transactions are executed in two phases: | ||
|
||
1. Start the transaction by writing a new entry to the system | ||
transaction table (keys prefixed by *\0tx*) with state “PENDING”. In | ||
1. Start the transaction by selecting a range which is likely to be | ||
heavily involved in the transaction and writing a new transaction | ||
record to a reserved area of that range with state "PENDING". In | ||
parallel write an "intent" value for each datum being written as part | ||
of the transaction. These are normal MVCC values, with the addition of | ||
a special flag (i.e. “intent”) indicating that the value may be | ||
committed after the transaction itself commits. In addition, | ||
the transaction id (unique and chosen at tx start time by client) | ||
is stored with intent values. The tx id is used to refer to the | ||
transaction table when there are conflicts and to make | ||
is stored with intent values. The txn id is used to refer to the | ||
transaction record when there are conflicts and to make | ||
tie-breaking decisions on ordering between identical timestamps. | ||
Each node returns the timestamp used for the write (which is the | ||
original candidate timestamp in the absence of read/write conflicts); | ||
the client selects the maximum from amongst all write timestamps as the | ||
final commit timestamp. | ||
|
||
2. Commit the transaction by updating its entry in the system | ||
transaction table (keys prefixed by *\0tx*). The value of the | ||
commit entry contains the candidate timestamp (increased as | ||
necessary to accommodate any latest read timestamps). Note that | ||
the transaction is considered fully committed at this point and | ||
control may be returned to the client. | ||
2. Commit the transaction by updating its transaction record. The value | ||
of the commit entry contains the candidate timestamp (increased as | ||
necessary to accommodate any latest read timestamps). Note that the | ||
transaction is considered fully committed at this point and control | ||
may be returned to the client. | ||
|
||
In the case of an SI transaction, a commit timestamp which was | ||
increased to accommodate concurrent readers is perfectly | ||
|
@@ -267,7 +267,7 @@ encounter data that necessitate conflict resolution | |
|
||
When a transaction restarts, it changes its priority and/or moves its | ||
timestamp forward depending on data tied to the conflict, and | ||
begins anew reusing the same tx id. The prior run of the transaction might | ||
begins anew reusing the same txn id. The prior run of the transaction might | ||
have written some write intents, which need to be deleted before the | ||
transaction commits, so as to not be included as part of the transaction. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not correct, epochs take care of this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The whole section from here down is bogus. |
||
These stale write intent deletions are done during the reexecution of the | ||
|
@@ -281,12 +281,12 @@ a NOOP. | |
***Transaction abort:*** | ||
|
||
This is the case in which a transaction, upon reading its transaction | ||
table entry, finds that it has been aborted. In this case, the | ||
transaction can not reuse its intents; it returns control to the client | ||
before cleaning them up (other readers and writers would clean up | ||
dangling intents as they encounter them) but will make an effort to | ||
clean up after itself. The next attempt (if applicable) then runs as a | ||
new transaction with **a new tx id**. | ||
record, finds that it has been aborted. In this case, the transaction | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/upon/usually upon/ (because abort span, etc) |
||
can not reuse its intents; it returns control to the client before | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It never really reuses intents. The intents' primary role is that of preventing starvation by forcing conflicts. |
||
cleaning them up (other readers and writers would clean up dangling | ||
intents as they encounter them) but will make an effort to clean up | ||
after itself. The next attempt (if applicable) then runs as a new | ||
transaction with **a new txn id**. | ||
|
||
***Transaction interactions:*** | ||
|
||
|
@@ -313,7 +313,7 @@ There are several scenarios in which transactions interact: | |
stamp" below. | ||
|
||
- **Reader encounters write intent with older timestamp**: the reader | ||
must follow the intent’s transaction id to the transaction table. | ||
must follow the intent’s transaction id to the transaction record. | ||
If the transaction has already been committed, then the reader can | ||
just read the value. If the write transaction has not yet been | ||
committed, then the reader has two options. If the write conflict | ||
|
@@ -363,9 +363,9 @@ are upgraded to committed. In the event a transaction is aborted, all written | |
intents are deleted. The client proxy doesn’t guarantee it will resolve intents. | ||
|
||
In the event the client proxy restarts before the pending transaction is | ||
committed, the dangling transaction would continue to live in the | ||
transaction table until aborted by another transaction. Transactions | ||
heartbeat the transaction table every five seconds by default. | ||
committed, the dangling transaction would continue to "live" until | ||
aborted by another transaction. Transactions periodically heartbeat | ||
their transaction record to maintain liveness. | ||
Transactions encountered by readers or writers with dangling intents | ||
which haven’t been heartbeat within the required interval are aborted. | ||
In the event the proxy restarts after a transaction commits but before | ||
|
@@ -377,7 +377,7 @@ An exploration of retries with contention and abort times with abandoned | |
transaction is | ||
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing). | ||
|
||
**Transaction Table** | ||
**Transaction Records** | ||
|
||
Please see [roachpb/data.proto](https://github.com/cockroachdb/cockroach/blob/master/roachpb/data.proto) for the up-to-date structures, the best entry point being `message Transaction`. | ||
|
||
|
@@ -389,7 +389,7 @@ Please see [roachpb/data.proto](https://github.com/cockroachdb/cockroach/blob/ma | |
abort. | ||
- Lower latency than traditional 2PC commit protocol (w/o contention) | ||
because second phase requires only a single write to the | ||
transaction table instead of a synchronous round to all | ||
transaction record instead of a synchronous round to all | ||
transaction participants. | ||
- Priorities avoid starvation for arbitrarily long transactions and | ||
always pick a winner from between contending transactions (no | ||
|
@@ -570,41 +570,21 @@ timestamps such thats<sub>1</sub> \< s<sub>2</sub>. | |
|
||
# Logical Map Content | ||
|
||
Logically, the map contains a series of reserved system key / value | ||
pairs covering accounting, range metadata and node accounting | ||
before the actual key / value pairs for non-system data | ||
(e.g. the actual meat of the map). | ||
Logically, the map contains a series of reserved system key/value | ||
pairs preceding the actual user data (which is managed by the SQL | ||
subsystem). | ||
|
||
- `\0\0meta1` Range metadata for location of `\0\0meta2`. | ||
- `\0\0meta1<key1>` Range metadata for location of `\0\0meta2<key1>`. | ||
- `\x02<key1>`: Range metadata for range ending `\x03<key1>`. This a "meta1" key. | ||
- ... | ||
- `\0\0meta1<keyN>`: Range metadata for location of `\0\0meta2<keyN>`. | ||
- `\0\0meta2`: Range metadata for location of first non-range metadata key. | ||
- `\0\0meta2<key1>`: Range metadata for location of `<key1>`. | ||
- `\x02<keyN>`: Range metadata for range ending `\x03<keyN>`. This a "meta1" key. | ||
- `\x03<key1>`: Range metadata for range ending `<key1>`. This a "meta2" key. | ||
- ... | ||
- `\0\0meta2<keyN>`: Range metadata for location of `<keyN>`. | ||
- `\0acct<key0>`: Accounting for key prefix key0. | ||
- ... | ||
- `\0acct<keyN>`: Accounting for key prefix keyN. | ||
- `\0node<node-address0>`: Accounting data for node 0. | ||
- ... | ||
- `\0node<node-addressN>`: Accounting data for node N. | ||
- `\0tx<tx-id0>`: Transaction record for transaction 0. | ||
- ... | ||
- `\0tx<tx-idN>`: Transaction record for transaction N. | ||
- `\0zone<key0>`: Zone information for key prefix key0. | ||
- ... | ||
- `\0zone<keyN>`: Zone information for key prefix keyN. | ||
- `<>acctd<metric0>`: Accounting data for Metric 0 for empty key prefix. | ||
- ... | ||
- `<>acctd<metricN>`: Accounting data for Metric N for empty key prefix. | ||
- `<key0>`: `<value0>` The first user data key.** | ||
- ... | ||
- `<keyN>`: `<valueN>` The last user data key.** | ||
|
||
There are some additional system entries sprinkled amongst the | ||
non-system keys. See the Key-Prefix Accounting section in this document | ||
for further details. | ||
- `\x03<keyN>`: Range metadata for range ending `<keyN>`. This a "meta2" key. | ||
- `\x04{desc,node,range,store}-idegen`: ID generation oracles for various component types. | ||
- `\x04status-node-<varint encoded Store ID>`: Store runtime metadata. | ||
- `\x04tsd<key>`: Time-series data key. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where did zones go? Are they in one of the magic tables? |
||
- `<key>`: A user key. In practice, these keys are managed by the SQL | ||
subsystem, which employs its own key anatomy. | ||
|
||
# Node Storage | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/is likely to be heavily involved in the transaction/involved in the first write of the transaction/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Isn't this what we just discussed as being the more-likely-to-rot variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree that this is too specific.