Figure above provides an info-graphic summary of what happens when our raft implementation and its test components start.
In Figure above, the green arrow shows a typical client request, Clients send their requests to any of the server, if the server is a follower, it will forward the request to the leader. Only the leader can respond to client when the request had been committed (by the majority). Client will then remember the Leader's address, and send subsequent requests directly to the leader. The blue arrow shows the debugging interface, Servers report updates to Monitor, and the Monitor can request server to print out all of its log. Leader also informs Raft when it becomes leader, allowing Disaster to act directly the leader, crash the leader for example.
Our implementation splits the Server Elixir Process to three stages. Each of top stage gives control to the lower stage, and waits for the lower stage to finish. The first stage is Role Switch, where servers check their roles and run their Follower, Candidate, Leader.start. When these roles exit, they clear the Elixir mailbox, only leaving behind what is important for the next role. The second stage is Role Acting, where the Server executes the raft algorithm, when the Raft process mentioned in [previous sub section](#Instance Initiation and Flow of Message) dispatches a disaster it will enter the third stage. In the third stage, Disaster Simulation, the Server will drop all the incoming message, until it exit this stage. The difference between different disasters are defined in section 1.2.4.
As mentioned earlier, the server's control is split to three stage, the top stage give control to lower stage by running lower stage's function and wait for it to finish . When the system is running normally, execution is mainly at the second Role Acting stage. Events such as role or term changes cause these modules to exit, and the server module decides which role to switch to next.
In these modules, initialization is done in the start function, with everything else done by the next function. As our implementation uses message passing for communication, timeouts are implemented through messages and the send_after function. The check_term_and function does term guarding, with the rest of the message handling only being executed if this term check has passed.
When servers switch between these states, the role's message queue is flushed with exception of the Will message, which is kept and sent to its self before transiting to the next state.
The State module stores the server state, and is common between the leader, follower and candidate states. Servers are able to read and modify this state regardless of which of the three states they are in.
All servers in the network keeps a list of server ids for all servers in network including itself, where the id for a server is its index in this list. The ids thus range from 0 to the size of the network-1.
The logs are stored in State, and raft ensures it will be consistent across servers. It is a linked list, and every log is a form of tuple payload, term when leader receives message. The payload is a dictionary consist of Client Address, Command and UID. When the leader commit a log, the response is send to the Client Address with the UID, the command is executed by the database.
As described in the previous section, the disaster module handles various changes to the system.
-
Offline refers to a server becoming unavailable, where no state is lost. It is analogous to a loss in network connectivity, where the server would still be functioning.
-
Crash refers to a server both becoming unavailable, and losing any current state. It is analogous to a server restarting or a complete loss of a server.
-
Timeout refers to a server hitting the election timeout, switch to candidate and starting a new election.
-
Online refers to a server becoming available again. It can happen from both the crashed or offline states.
In order to recover a crashed/wiped server, we have to known the current last_applied state, or start from scratch. To keep our implementation simple, we have chosen the latter scenario, as we do not need to implement the persistent state. This keeps it similar to the real world scenario where a server crashes and the data has not been written to persistent storage. Network configuration is assumed to be relatively static, and saved on persistent storage, allowing the recovered server to rejoin the system. The server then has to collect all logs from the beginning and apply them.
In the event that a server is not yet aware of the leader, client requests are simply ignored, and it is up to the client to retry these requests at a later time. This situation could happen if servers have just joined the network, or in the event of a network partition where some servers cannot communicate with the leader,
All client requests have a unique id(UID) which is used to sort the log. When requests are received, this UID is checked against the logs to see if it is a duplicate, sending a reply back if it exists and is committed, and ignoring it otherwise. New requests are appended to the log. This process can be optimized through various methods, such as log compaction, compression or UID indexing. These were not done to reduce implementation complexity.
As the message passing between clients and servers are asynchronous, our implementation does not allow clients to recall messages. Also, in the event of the client address changing, the reply will not be successfully sent to the client. It is up to the client to wait for the reply timeout, and resend the request from its new address.
To make the code more readable, and avoid reinventing the wheel, the code are organised by functionality, not which instance they belong to. In particular, state.ex act as a library for different state operations, which is invoked heavily by follower, candidate.ex, monitor.ex contains all the debug printing code, invoked by almost every other file.
While developing the Server, we used unit test and iex to test basic functionalities, the debugging messages are placed with varies level. We also placed warning on the routine where server should not go, and halt once the first serious fault is suspected. The monitor.ex is used as a centralised location for all debug functions, allow easy printing with a single function call.
The system was implemented with an emphasis on ease of testing. Configuration can be fed to test, overriding the default parameters set in the dac module. A list of sample configuration are stored under the testplan folder, the README.md file also contains detailed instruction of how to write test-case. We created a test script script/test.sh to run test cases and save outputs to a log file, allowing for easy extensions without modifying source code. This also improved test repeatability, reducing the chance of regression bugs.
In this test, the system was ran normally, and the state was checked after several iterations, with no failures being simulated. The output can be seen in output/1_c2s5_normal_ape.log. Here, a leader is successfully elected, the entire test consists of a single term, with all servers successfully committing all 20 client requests by the end of the test.
This test case shows what happens when a single follower crashes, and
then recovers a while later. The system should continue operating
normally with one less node, and should be able to update its entries
from the leader node once it has recovered. The output can be seen in
output/2_c2s5_single_failure_follower_ape.log.
The testing sequence is as follows:
time = 0ms, Node 0 timeout to force election and be elected leader
time = 1000ms, Node 1 crashes
time = 2000ms, Node 1 recovers
From this test, we see that the system continues to operate with the leader node and the 3 remaining follower nodes. When the crashed node recovers, the system continues to operate, and the crashed node recognizes the current leader, and rejoins as a follower. At the end of the log we also see that the crashed node is able to
This test case shows what happens when the leader node crashes, and then
recovers a while later. The leader crashing should force other servers
to eventually timeout and elect a new leader, after which normal system
operation should continue. The crashed node should then be able to
rejoin as a follower, and update its log from the leader node once it
has recovered. The output can be seen in
output/3_c2s5_single_failure_leader_ape.log.
The testing sequence is as follows:
time = 1000ms, Current leader node crashes
time = 2000ms, Crashed node recovers
In one test run, we saw that when the current leader(server 4) crashes, server 3 was the first to timeout, becoming a candidate, and gets successfully elected as a leader, allowing system operation to continue.
In this case, 2 server nodes are set to fail, with the current leader
failing each time. The system should be able to replace the leader in
both situations, and continue operating without them. The output can be
seen in output/4_c2s5_two_failures.log.
The testing sequence is as follows:
time = 0ms, Node 0 timeout to force election and be elected leader
time = 1000ms, Current leader node goes offline
time = 1500ms, Current leader node crashes From the log, we see that the
system is successful in replacing both failed leaders, and continuing
operation with 3 nodes.
In this case, 3 nodes fail in total, with the current leader node
failing each time, followed by a failed node recovering. The system
should continue to function till the third server fails, at which point
it should be unable to elect a new leader. The recovery of one of the
failed servers should allow the system to resume operation. The output
can be seen in output/5_c2s5_voting_with_majority_crash.log.
The testing sequence is as follows:
time = 0ms, Node 0 timeout to force election and be elected leader
time = 1000ms, Current leader node goes offline
time = 1500ms, Current leader node crashes
time = 2000ms, Current leader node goes offline
time = 2500ms, Crashed node(crashed at t=1500ms) recovers
In this case, a network partition is simulated, by taking nodes offline
and online in groups of 2 and 3. This lets us simulate each of the
partitions. The system performed as expected, continuing to run in the
partition with 3 nodes, and being unable to elect a leader in the
partition with 2 nodes. When the network partition was removed, the
system was able to continue operating, with one of the nodes from the
larger partition becoming the leader. The output can be seen in
output/8_c2s5_partition.log.
The testing sequence is as follows:
time = 100ms, Node 0,1 goes offline
time = 2000ms, Node 2,3,4 goes offline
time = 4000ms, Node 0,1 goes online
time = 6000ms, Node 2,3,4 goes online
The raft paper states that the system should remain functional as long as the following relationship is maintained, with at least an order of magnitude between each part.
broadcastTime << electionTimeout << Mean Time Between Failure(MTBF)
Our testing focuses on the second part of the relationship by adjusting the MTBF.
The electionTimeout is set to be between 150-300ms as specified by the
raft specification, and a server is set to fail every 250ms, with the
previously failed server recovering, resulting in a MTBF for a single
server of 1000ms.
Despite this, the system continues functioning and responding to client
requests. One thing to note is that as the servers are failing at fixed
intervals, it does not accurately simulate a real world scenario where
failures can happen at any time, even though they share the same Poisson
distribution.
To further test the above case to an extreme, the MTBF is further reduced to 300ms, identical to the max electionTimeout. In this case, we see that the system performs as expected by the raft specification, having very little progress while nodes are failing, as so much time is spent conducting the election that the next leader is likely to fail before progress can be made.
In this test, we ran the system with 10 clients and 5 servers, to simulate a much higher client request load. No failures were simulated. The output can be seen in output/11_c10s5_normal_ape_load_test.log. The system is able to continue functioning despite the high load.
Overall, we see that our implementation of raft, while not being very feature rich, is able to perform as expected, tolerating minority failures well, and failing as expected when the test forces it.