Title: DEP-0008: Multi-Writer
Short Name: 0008-multiwriter
Type: Standard
Status: Draft (as of 2018-07-06)
Github PR: Draft
Authors: Bryan Newbold, Stephen Whitmore, Mathias Buus
Multi-Writer is a set of schema, API, and feature extensions to allow multiple agents (users, devices, or software) to write to the same hyperdb database. By building on top of this abstraction layer, future versions of hyperdrive and Dat will gain these features.
Mechanisms for distributed consistency and granting trust are specified here; the need for merge conflict algorithms and secure key distribution are mentioned but specific solutions are not specified.
This DEP forms the second half of the hyperdb specification; the first half covered only the key/value database aspects of hyperdb.
The current hypercore/Dat ecosystem currently lacks solutions for two fundamental use cases:
- individual users should be able to modify distributed archives under their control from multiple devices, at a minimum to prevent loss of control of content if a single device (containing secret keys) is lost
- contributions from and collaboration between multiple users on a single archive or database should be possible, with appropriate trust and access control semantics
Access to a single secret key is currently required to make any change to a hypercore feed, and it is broadly considered best practice not to distribute secret keys between multiple users or multiple devices. In fact, the current hypercore implementation has no mechanism to resolve disputes or recover if multiple agents used the same secret key to append to the same feed.
Solutions to these two use cases are seen as essential for many current and future Dat ecosystem applications.
The multi-writer features of hyperdb are implemented by storing and replicating the contributions of each writer in a separate hypercore feed. This specification concerns itself with the details of how changes from multiple feeds (which may be written and replicated concurrently or asynchronously) are securely combined to present a unified key/value interface.
The following related concerns are explicitly left to application developers to design and implement:
- secure key distribution and authentication (eg, if a friend should be given write access to a hyperdb database, how is that friend's feed key found and verified?)
- merge conflict resolution (using the provided API), potentially using application-layer semantics
Before we go any further, a few definitions:
Feed: A hypercore feed: an append-only log of Entries, which can be arbitrary data blobs.
Database: in this context, a Hyperdb key/value database. Built from several Feeds (two Feeds per Writer).
Writer: a user (or user controlled device or software agent) that has a distinct feed with a public/private key pair, and thus the ability to append hypercore entries and "write" changes to their version of the database.
Original Writer: the writer who created a given hyperdb database in the form of the Original Feed. The public key of the original feed is the one used to reference the database as a collection of feeds (eg, for the purpose of discovery).
At a high level, multi-writer hyperdb works by having existing authorized writers (starting with the original writer) include authorization of new writers by appending metadata to their own feed which points to the new feeds (by public key). Each entry in each writer's feed contains "clock" metadata that records the known state of the entire database (all writers) seen from the perspective of that writer at the time they created the entry, in the form of "clock" version pointers. This metadata (a "vector clock") can be used by other writers to resolve (or at least identify) conflicting content in the database. The technical term for this type of system is a "Conflict-free replicated data type" (CRDT), and specifically an "Operation-based" (as opposed to "State-based") CRDT.
A "node" is a data structure representing a database entry, including the
key
, value
, and feed that the entry is committed to.
db.get(key)
(as described in the hyperdb DEP)
returns an array of nodes. If it is unambiguous what the consistent state of a
key is, the array will have only that one value. If there is a conflict
(ambiguity), multiple nodes will be returned. If a key has unambiguously been
removed from the database, a "null" or empty datatype is returned. If one
branch of a conflict has a deletion (but at least one of the others does not),
a node with the deleted
flag will be returned; note that such "tombstone"
nodes can still have a value
field, which may contain application-specific
metadata (such as a self-reported timestamp), which may help resolve the
conflict.
If multiple nodes are returned from a get
, the writer should attempt to merge
the values (or chose one over the other) and write the result to the database
with db.put(key, value)
. Client libraries can make this process more
ergonomic by accepting a helper function to select between multiple nodes.
Libraries can also offer an option to either directly return the value of a
single node (instead of the node itself), or raise an error; this is likely to
be more ergonomic for applications which do not intend to support multiple
writers per database.
db.authorize(key)
will write metadata to the local feed authorizing the new
feed (corresponding to key
) to be included in the database. Once authorized,
a feed may further authorize additional feeds (recursively).
db.authorized(key)
(returning a boolean) indicates whether the given key
is
an authorized writer to the hyperdb database.
At the time of this DEP there is no mechanism for revoking authorization.
There is some overhead associated with each "writer" added to the feed, impacting the number of files on disk, memory use, and the computational cost of some lookup operations. The design should easily accommodate dozens of writers, and should scale to 1,000 writers without too much additional overhead. Note that a large number of writers also implies a larger number and rate of append operations, and additional network connections, which may cause scaling issues on their own. More real-world experience and benchmarking is needed in this area.
The complete protobuf schemas for the hyperdb "Entry" and "InflatedEntry" message types (as specified in the hyperdb DEP) are:
message Entry {
required string key = 1;
optional bytes value = 2;
optional bool deleted = 3;
required bytes trie = 4;
repeated uint64 clock = 5;
optional uint64 inflate = 6;
}
message InflatedEntry {
message Feed {
required bytes key = 1;
}
required string key = 1;
optional bytes value = 2;
optional bool deleted = 3;
required bytes trie = 4;
repeated uint64 clock = 5;
optional uint64 inflate = 6;
repeated Feed feeds = 7;
optional bytes contentFeed = 8;
}
The fields of interest for multi-writer are:
clock
: a "vector clock" to record observed state at the time of writing the entry. Included in every Entry and InflatedEntry.inflate
: a back-pointer to the entry index of the most recent InflatedEntry (containing a feed metadata change). Included in every Entry and InflatedEntry. Should not be included for the very first entry in a feed (which is an InflatedEntry).feeds
: list of public keys for each writer's feed. Only included in InflatedEntry, and only when feeds have changed. Does include a self-reference to the current (local) Feed's key, always as the first element.
When serialized on disk in a SLEEP directory:
source/
: the original feed, as created or cloned by this writerlocal/
: if "local" feed is different from "source", it goes herepeers/<discover-key>/
: all other writers' feed go under this directory (the discovery key is lower-case hex-encoded)content/<discovery-key>/
: if a higher-level protocol is being used that uses multiple linked hypercore feeds (eg, hyperdrive), the linked "content" feeds all go under this directory
At any point in time, each writer has a potentially unique view of the "current" state of the database as a whole; this is the nature of real-world distributed systems. For example, a given write might have the most recent appends from one peer (eg, only seconds old), but be missing much older appends from another (eg, days or weeks out of date). By having each writer include metadata about their perceived state of the system as a whole in operations to their Feed, all writers are able to collectively converge on an "eventually consistent" picture of the database as whole (this process will be described in the next section).
A writer's "current known state" of the database consists of the set of active Feeds, and for each the most recent entry sequence number ("version"). This state can be serialized as an array of integers, referred to as a vector clock.
Each put()
operation on the database appends a node to the writer's local
feed, and contains the writer's vector clock as observed at that time.
InflatedEntry
nodes also contain a list of all known authorized Feeds;
inflated nodes only need to be written when the Feed list changes. Every
non-inflated entry contains a pointer back to the most recent inflated entry;
inflated entries themselves contain a pointer back to the previous inflated
entry (the first inflated entry has a null pointer). Elements of a vector clock
are ordered by the Feed list from the corresponding Inflated entry.
By convention, the order of Feed lists is to start with the writer's local feed first, then proceed by the order in which Feeds were discovered. Note that this ordering is not consistent across writers, only within the same feed.
As an example, if a node (non-inflated entry) had a vector clock of [1, 3, 6]
, that would mean:
- when this node was written, the largest seq # in the writer's local fed was 1
- when this node was written, the largest seq # in the second known feed was 3
- when this node was written, the largest seq # in the third known feed was 6
The hyperdb DEP specifies the procedures for lookup (get()
)
and append (put()
) operations to the database, as well as binary encoding
schemes for entry messages.
Note that the trie encoding specifies pointers in a (feed, entry)
pair
format. The feed
integer is an index into the most recent Feed list (found in
the most recent inflated entry; see the last section). When working with a
multi-writer hyperdb database, simply look up entries in the appropriate feed,
instead of only looking in the current feed. The next section ("Consistent
History") describes which entry (or entries) to start with instead of simply
assuming the most recent entry from the local feed.
The set of all appended nodes in all feeds of a hyperdb, and all the vector clock pointers between them, forms a "directed acyclic graph" (DAG). Any node which does not have another node pointing to it is called a "head" (this terminology is similar to that used in git). At any point in time, an observed copy of a database has one or more heads, each representing the top of a tree-like graph. In the trivial case of a non-multi-writer hyperdb, there is always only a single head: the most recent entry in the local feed. Just after appending to the local feed, there is also always a single head, because that node's vector clock will reference all know most recent entries from other feeds. It is only when nodes are appended by separate writers who did not know of the others' concurrent action (and then these changes are replicated) that there are multiple heads.
When operating on the database (eg, executing a get()
operation), all heads
must be considered. The lookup procedure documented in the hyperdb
DEP must be repeated for each head, and nodes returned
representing the set of all unique values.
The situation where a get()
operation multiple heads returns different values
for the same key is called a "conflict" and requires a "merge" to resolve. Some
writer (either a human being or application-layer code) must decide on the
correct value for the key and write that value as a new entry (with a vector
clock that includes the previous heads). The procedure for choosing the best
value to use in a given conflict is sometimes easy to determine, but is
impossible to determine algorithmically in the general case. See the "Usage"
section for more details.
Every hyperdb hypercore feed starts with a protocol header (of type
HypercoreHeader
) at entry index 0, as specified in DEP-0007. Feed entry
indexes (aka, sequence numbers), are zero-indexed. With the header message at
index 0, hyperdb entries start at index 1.
Let's say Alice starts a new hyperdb and writes two key/value entries to it:
// Operations
Alice: db.put('/foo/bar', 'baz')
Alice: db.put('/foo/2', '{"json":3}')
// Alice's Feed
0 (HypercoreHeader, protocol=hyperdb)
1 (key='/foo/bar', value='baz',
vector_clock=[1], inflated=null, feeds=['a11ce...']) (InflatedEntry)
2 (key='/foo/2', value='{"json":3}',
vector_clock=[1], inflated=0)
// Graph
Alice: 1 <--- 2
The vector clock at seq=2
points back to seq=1
.
Next Alice authorizes Bob to write to the database. Internally, this means Alice writes an Inflated entry to her feed that contains Bob's Feed (identified by his public key) in her feed list.
// Operations
Alice: db.authorize('b0b123...')
// Alice's Feed
0 (HypercoreHeader, protocol=hyperdb)
1 (key='/foo/bar', value='baz',
vector_clock=[1], inflated=null, feeds=['a11ce...']) (InflatedEntry)
2 (key='/foo/2', value='{"json":3}',
vector_clock=[1], inflated=1) (Entry)
3 (key=null, value=null,
vector_clock=[2], inflated=1, feeds=['a11ce...', 'b0b123...']) (InflatedEntry)
// Graph
Alice: 1 <--- 2 <--- 3
Bob writes a value to his feed, and then Alice and Bob sync. The result is:
// Operations
Bob: db.put('/a/b', '12)
// Alice's Feed
0 (HypercoreHeader, protocol=hyperdb)
1 (key='/foo/bar', value='baz',
vector_clock=[1], inflated=null, feeds=['a11ce...']) (InflatedEntry)
2 (key='/foo/2', value='{"json":3}',
vector_clock=[1], inflated=1) (Entry)
3 (key=null, value=null,
vector_clock=[2], inflated=1, feeds=['a11ce...', 'b0b123...']) (InflatedEntry)
// Bob's Feed
0 (HypercoreHeader, protocol=hyperdb)
1 (key='/a/b', value='12',
vector_clock=[0], inflated=null, feeds=['b0b123...']) (InflatedEntry))
// Graph
Alice: 1 <--- 2 <--- 3
Bob : 1
Notice that none of Alice's entries refer to Bob's, and vice versa. Neither has
written any entries to their feeds since the two became aware of each other.
Right now there are two "heads" of the graph: Alice's feed at seq 3, and Bob's
feed at seq 1. Any get()
operations would need to descend from both heads,
though in this situation there would be no conflicts as the keys in the two
feeds are disjoint.
Next, Alice writes a new value, and her latest entry will refer to Bob's:
// Operations
Alice: db.put('/foo/hup', 'beep')
// Alice's Feed
0 (HypercoreHeader, protocol=hyperdb)
1 (key='/foo/bar', value='baz',
vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry)
2 (key='/foo/2', value='{"json":3}',
vector_clock=[0], inflated=0) (Entry)
3 (key=null, value=null,
vector_clock=[1, null], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry)
4 (key='/foo/hup', value='beep',
vector_clock=[2,0], inflated=2) (Entry)
// Bob's Feed
0 (HypercoreHeader, protocol=hyperdb)
1 (key='/a/b', value='12',
vector_clock=[0], inflated=null, feeds=['b0b123...']) (InflatedEntry))
// Graph
Alice: 1 <--- 2 <--- 3 <--/ 4
Bob : 1 <-------------------/
Alice's latest feed entry now points to Bob's latest feed entry, and there is
only one "head" in the database. This means that any get()
operations only
need to run once, starting at seq=4
in Alice's feed.
As noted above, there is no existing mechanism for removing authorization for a feed once added, and an authorized feed may recursively authorize additional feeds. There is also no mechanism to restrict the scope of an authorized feed's actions (eg, limit to only a specific path prefix). This leaves application designers and users with few tools to control trust or access ("all or nothing"). Care must be taken in particular if self-mutating software is being distributed via hyperdb, or when action may be taken automatically based on the most recent content of a database (eg, bots or even third-party tools may publish publicly, or even take real-world action like controlling an electrical relay).
There is no mechanism to remove malicious history (or any history for that matter); if an authorized (but hostile) writer appends a huge number of key operations (bloating hyperdb metadata size), or posts offensive or illegal content to a database, there is no way to permanently remove the data without creating an new database.
The read semantics of hyperdb are unchanged from hypercore: an actor does not need to be "authorized" (for writing) to read the full history of a database, they only need the public key.
As noted in other DEPs, a malicious writer can potentially execute a denial of service (DoS) attack by appending hyperdb entries that for a cyclic loop of references.
Mutli-writer capability incurs a non-trivial increase in library, application,
and user experience complexity. For many applications, collaboration is an
essential feature, and the complexity is easily justified. To minimize
complexity for applications which do not need multi-writer features,
implementation authors should consider configuration modes which hide the
complexity of unused features. For example, by having an option to returning a
single node for a get()
(and throw an error if there is a conflict), or a
flag to throw an error if a database unexpectedly contains more than a single
feed.
Two tasks (conflict merges and secure key distribution) are left to application developers. Both of these are Hard Problems. The current design mitigates the former by reducing the number of merge conflicts that need to be handled by an application (aka, only the non-trivial ones need to be handled), and implementation authors are encouraged to provide an ergonomic API for writing conflict resolvers. The second problem (secure key distribution) is out of scope for this DEP. It is hoped that at least one pattern or library will emerge from the Dat ecosystem such that each application author doesn't need to invent a solution from scratch.
Design goals for hyperdb (including the multi-writer feature) included:
- ability to execute operations (get, put) with a sparse (partial) replication of the database, using as few additional network requests as possible
- minimal on-disk and on-wire overhead
- implemented on top of an append-only log (to build on top of hypercore)
If a solution for core use cases like collaboration and multi-device synchronization is not provided at a low level (as this DEP provides), each application will need to invent a solution at a higher level, incurring duplicated effort and a higher risk of bugs.
As an alternative to CRDTs, Operational Transformation (OT) has a reputation for being more difficult to understand and implement.
What is the actual on-disk layout (folder structure), if not what is documented here?
The local feed's sequence number could skipped from vector clocks, because it's implied by the sequence number of the hypercore entry itself. Same with the key in the feed list (for inflated entries). In both cases, the redundant data is retained for simplicity.
If there are multiple heads, but they all return the same value
for a get()
operation, how is it decided which node
will be returned? AKA, values are the
same, but node metadata might not be (order of vector clock, etc).
Suspect that some details are off in the example: shouldn't the InflatedEntry authorizing a new feed include a vector clock reference to a specific seq in that feed? Should new local (not yet authorized) feeds reference their source in an initial InflatedEntry (eg, Bob at seq=0)? Should the first InflatedEntry in a feed point to itself in it's vector clock?
As of March 2018, Mathias Buus (@mafintosh) is leading development of a hyperdb nodejs module on github, which includes multi-writer features and is the basis for this DEP.
Jim Pick (@jimpick) has been an active contributor working out multi-writer details.
- 2017-12-06: @noffle publishes
ARCHITECTURE.md
overview in the hyperdb github repo - 2018-05-02: First round of public review
- 2018-05-23: hyperdb 3.0.0 node.js implementation released
- 2018-06-10: Second draft submitted for review
- 2018-07-06: Accepted with Draft status (after edits)
- 2018-11-17: Clarify example indexing with HypercoreHeaders