Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: follower panics due to a regression in the commit index. #197

Open
shaj13 opened this issue Apr 23, 2024 · 0 comments
Open

Discussion: follower panics due to a regression in the commit index. #197

shaj13 opened this issue Apr 23, 2024 · 0 comments

Comments

@shaj13
Copy link

shaj13 commented Apr 23, 2024

It seems there is some confusion about why PR #25 was introduced. To clarify, I opened it for discussion rather than as a final solution.

The issue arises when a follower loses its state, such as due to a disk failure, and is restarted with a fresh state and index 0, while the leader continues sending the last index it captured the follower at, say index 10. This causes a follower to panic https://github.com/etcd-io/raft/blob/main/log.go#L320. In such cases, the only current resolution is a manual intervention to remove and re-add the follower to the cluster as a new member.

However, A potential solution is for the follower to reject a heartbeat with a higher index, providing a hint index instead. The leader can then decrease the progress, and the system can resume as expected. This approach has already been implemented for msgApp https://github.com/etcd-io/raft/blob/main/raft.go#L1382.

The event log that suggests at #25 (comment) is useful for other use cases where this panic can occur, during other operations. However, this does not resolve the issue, as the user is unable to reconcile the follower progress on the leader node. The optimal solution is to use a rejection hint, similar to what is done in msgapp.

The same approach as in #25 can be taken, but instead of recovering, return an error that causes a panic when needed https://github.com/etcd-io/raft/blob/main/log.go#L320.
When handling the heartbeat, capture the error and check its reason. Based on that, handle the error by sending a rejection. If the error is not handled, it will panic. This approach is more idiomatic, as it reuses Go's error handling. In the future, it can adopt the event log for handling other operations when they panic.

That solves the regression and allows etcd raft to run in memory without a durable state. This is useful for applications that, for example, only need a replicated raft log-in memory, like Docker Swarm Secrets, where members can restart and follow the leader again to replicate encryption or security data that are never written to disk.

cc: @ahrtr @pav-kv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant