Make raft leader resume probing after snapshot crash #2707

manishrjain · 2018-10-30T16:20:32Z

When a raft follower is considered to be falling behind, leader sends it a snapshot. The follower then opens a streaming connection to the leader, asking it to send the snapshot. If the follower crashes while receiving and applying the snapshot, it is left in a permanent limbo, receiving no more raft updates. This was because the leader pauses its probing until it hears back from the follower. This "persists" through follower crash and restart.

This PR makes the snapshot streaming bi-directional. So, the follower can send an ACK back to the leader when it has successfully applied the snapshot. If the leader gets an error instead, it would mark the snapshot as a failure, and resume probing.

The end effect is that, if the follower crashes while receiving a snapshot, leader resumes probing, so when the follower comes back up, it again requests a snapshot from the leader.

Fixes #2698 .

P.S. I wish etcd's raft lib had better documentation to warn us about this, via the function. I'll create another PR to improve their godocs.

This change is

…o the leader.

srfrog

Reviewed 6 of 6 files at r1.
Reviewable status: complete! all files reviewed, all discussions resolved

When a raft follower is considered to be falling behind, leader sends it a snapshot. The follower then opens a streaming connection to the leader, asking it to send the snapshot. If the follower crashes while receiving and applying the snapshot, it is left in a permanent limbo, receiving no more raft updates. This was because the leader pauses its probing until it hears back from the follower. This "persists" through follower crash and restart. This PR makes the snapshot streaming bi-directional. So, the follower can send an ACK back to the leader when it has successfully applied the snapshot. If the leader gets an error instead, it would mark the snapshot as a failure, and resume probing. The end effect is that, if the follower crashes while receiving a snapshot, leader resumes probing, so when the follower comes back up, it again requests a snapshot from the leader. Fixes dgraph-io#2698 . P.S. I wish etcd's raft lib had better documentation to warn us about this, via the function. I'll create another PR to improve their godocs. * Change the order in which Snapshot is retrieved and WAL is `SaveToStorage`. So, if a follower is unable to retrieve the snapshot, it won't store it in the WAL. This allows future probing to work correctly. * Some exploration to determine a good way to send ack of writes back to the leader. * Recovers correctly from a node crash during snapshot streaming.

manishrjain added 5 commits October 29, 2018 12:29

Change the order in which Snapshot is retrieved.

f08bdaa

Some exploration to determine a good way to send ack of writes back t…

6117c2b

…o the leader.

Recovers correctly from a node crash during snapshot streaming.

f783375

Recompile proto on desktop.

6c8f047

Self Review

b2a83a4

manishrjain requested a review from srfrog October 30, 2018 16:59

srfrog approved these changes Oct 30, 2018

View reviewed changes

Self review

77ae84e

manishrjain merged commit 165817a into master Oct 30, 2018

manishrjain deleted the mrjn/reorder-snapshot branch October 30, 2018 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make raft leader resume probing after snapshot crash #2707

Make raft leader resume probing after snapshot crash #2707

manishrjain commented Oct 30, 2018 •

edited

Loading

srfrog left a comment

Make raft leader resume probing after snapshot crash #2707

Make raft leader resume probing after snapshot crash #2707

Conversation

manishrjain commented Oct 30, 2018 • edited Loading

srfrog left a comment

Choose a reason for hiding this comment

manishrjain commented Oct 30, 2018 •

edited

Loading