Skip to content
Jatin edited this page Jan 6, 2014 · 8 revisions

Simple Replicated Key-Value Database

A key-value store is a very simple form of a database. Its entries are key-value pairs, the key part acting as a unique identifier, and the value being arbitrary data. Current implementation is based on Akka Actor's and hence can only be used using message communication with Actors. This constraint might be removed in future.

In the current version, the system will include a primary node (the primary replica), which will be responsible for replicating all changes to a set of secondary nodes (the secondary replicas), where potential replica nodes might join and leave at arbitrary times.

The primary replica can only accept modification events (insertions and removals) and replicate its current state to the secondaries. Both the primary and the secondary replicas will accept lookup (read) events, although the secondary nodes will be allowed to give results that are "out-of-date" since it takes time for the replicas to keep up with the changes on the primary replica.

Overview of the system components

The key-value store and its environment consists of the following components:

The clustered key-value store:

A set of nodes that store key value pairs in a distributed fashion, cooperating to maintain a certain set of guarantees (specified in section “System Behavior - Consistency guarantees”). This cluster of nodes consists of replicas and the provided Arbiter and Persistence modules:

  • Primary replica A distinguished node in the cluster that accepts updates to keys and propagates the changes to secondary replicas.
  • Secondary replicas Nodes that are in contact with the primary replica, accepting updates from it and serving clients for read-only operations.
  • Arbiter A subsystem that is provided in this exercise and which assigns the primary or secondary roles to your nodes.
  • Persistence A subsystem that is provided in this exercise and which offers to persist updates to stable storage, but might fail spuriously.
  • Clients: Entities that communicate with one of the replicas to update and read key-value pairs.

Clients and The KV Protocol

Clients are external entities contacting the replica set (your cluster) for reading and writing key-value pairs.

The KV Protocol itself contains two subsets of operations:

  1. Update operations (insertion and removal).
  2. Lookup operation. Clients contacting the primary node directly can use all operations on the key-value store, while clients contacting the secondaries can only use lookups.

The two set of operations in detail are:

Update Commands

  1. Insert(key, value, id) - This message instructs the primary to insert the (key, value) pair into the storage and replicate it to the secondaries: id is a client-chosen unique identifier for this request.
  2. Remove(key, id) - This message instructs the primary to remove the key (and its corresponding value) from the storage and then remove it from the secondaries.

A successful Insert or Remove results in a reply to the client in the form of an OperationAck(id) message where the id field matches the corresponding id field of the operation that has been acknowledged. A failed Insert or Remove command results in an OperationFailed(id) reply. A failure is defined as the inability to confirm the operation within 1 second. See the sections on replication and persistence below for more details.

Lookup

  1. Get(key, id) - Instructs the replica to look up the "current" (what current means is described in detail in the next section) value assigned with the key in the storage and reply with the stored value. A Get operation results in a GetResult(key, valueOption, id) message where:
    • id matches the value in the id field of the corresponding Get message.
    • valueOption equals None if the key is not present in the replica or Some(value) if a value is currently assigned to the given key in that replica.

System Behavior - Consistency guarantees

Let's assume the scenario that one client issues the following commands to the primary replica (starting from empty storage), waiting for successful acknowledgement of each operation before proceeding with the next (see further below for the case of not awaiting confirmation):

Insert("key1", "a")
Insert("key2", "1")
Insert("key1", "b")
Insert("key2", "2")

Ordering guarantees for clients contacting the primary replica

A second client reading directly from the primary is not allowed to see:

  • key1 containing b and then containing a (since a was written before b for key1)
  • key2 containing 2 and then containing 1 (since 1 was written before 2 for key2) In other words, this second client sees the updates in order.

In contrast, the second client may observe

  • key1 containing b and then key2 containing 1
  • key2 containing 2 and then key1 containing a This means that the ordering guarantee only applies between reads and write to the same key, not across keys. The store may choose to provide stronger semantics to respect ordering across different keys, but clients will not be able to rely on this; the reason is that lifting the restriction of having only one non-failing primary replica would require breaking these stronger guarantees.

Ordering guarantees for clients contacting a secondary replica

For a second client reading from one of the secondary replicas (during a conversation, the replica does not change) the exact same requirements apply as if that client was reading from the primary, with the following addition:

  • It is guaranteed that a client reading from a secondary replica will eventually see the following (at some point in the future):

  • key1 containing b

  • key2 containing 2 Ordering guarantees for clients contacting different replicas

If a second client asks different replicas for the same key, it may observe different values during the time window when an update is disseminated. The client asking for key1 might see

  • answer b from one replica
  • and subsequently answer a from a different replica As per the rule stated in the previous section, and assuming that the client keeps asking repeatedly, eventually all reads will result in the value b if no other updates are done on key1.

_Eventual consistency means that given enough time, all replicas settle on the same view. _

Durability guarantees of updates for clients contacting the primary replica

The previous two sections prescribed possible transitions that the clients are allowed to experience on key updates. In this section, we will see what guarantees acknowledgement messages obey (on the primary replica).

Whenever the primary replica receives an update operation (either Insert or Remove) it replies with an OperationAck(id) or OperationFailed(id) message, to be sent at most 1 second after the update command was processed.

A positive OperationAck reply must be sent as soon as

  • The change in question has been handed down to the Persistence module (provided) and a corresponding acknowledgement has been received from it
  • Replication of the change has been initiated and all of the secondary replicas have acknowledged the replication of the update. If replicas leave the cluster, which is signalled by sending a new Replicas message to the primary (this shall be changed via arbiter in future versions), then outstanding acknowledgements of these replicas will be waived. This can lead to the generation of an OperationAck triggered indirectly by the Replicas message.

A negative OperationFailed reply will be sent if the conditions for sending an OperationAck are not met within the 1 second maximum response time.

Consistency in the case of failed replication or persistence

Assuming in the above scenario that the last write fails (i.e. an OperationFailed is returned), replication to some replicas may have been successful while it failed on others. Therefore in this case to maintain consistency, the following is done:

  1. To the successful replicas, a Remove call is made for the same key
  2. If the above operation is successful, OperationAck is replied back to client
  3. If 1st operation is a failure, then no more attempts are made and OperationFailed is replied back to client. (This might leave database in in-consistent state)

Which value to expect while an update is outstanding?

Sending an update request for a key followed by a Get request for the same key without waiting for the acknowledgement of the update is allowed to return either the old or the new value (or a third value if another client concurrently updates the same key). An example, assuming only this one client at this time:

Insert("key1", "a")
<await confirmation>
Insert("key1", "b")
Get("key1")

The replies for the last two requests may arrive in any order, and the reply for the Get request may either contain a or b.

The Arbiter

The Arbiter is an external subsystem following a simple protocol:

New replicas must first send a Join message to the Arbiter signaling that they are ready to be used. The Join message will be answered by either a JoinedPrimary or JoinedSecondary message indicating the role of the new node. The first node to join will get the primary role, other subsequent nodes are assigned the secondary role.

The Replicas

When the Replica actor starts, it sends a Join message to the Arbiter and then choose between primary or secondary behavior according to the reply of the Arbiter to the Join message (a JoinedPrimary or JoinedSecondary message).

Persistence

Each replica will submit incoming updates to the local Persistence actor and wait for its acknowledgement before confirming the update to the requester. In case of the primary, the requester is a client which sent an Insert or Remove request and the confirmation is an OperationAck, whereas in the case of a secondary the requester is a Replicator sending a Snapshot and expecting a SnapshotAck back.

The used message types are:

  • Persist(key, valueOption, id) is sent to the Persistence actor to request the given state to be persisted (with the same field description as for the Replicate message above).
  • Persisted(key, id) is sent by the Persistence actor as reply in case the corresponding request was successful; no reply is sent otherwise. The provided implementation of this persistence service is a mock in the true sense, since it is rather unreliable: every now and then it will fail with an exception and not acknowledge the current request. Replica actor appropriately supervises the Persistence actor and restarts in case of failure.

Please note that in the the nodes (i.e. actors) are executed within the same JVM, but the problems and their solutions apply equally when distributing the system across multiple network hosts.