-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
contrib/raftexample: adding new members after snapshot taken fails #13741
Comments
@spzala ptal 🤗 |
Thanks for raising this issue. I just fixed it in pull/13760. Please see the explanation in the PR. Please also kindly let me know whether it works for you. |
Thank you very much! Late last week I'd finally gotten far enough into the etcd code to determine this was likely the solution, but I was not yet certain. This clears it up. |
Thanks @ahrtr ! |
Thanks @ahrtr I will be reviewing the changes today. |
What happened?
Using the contrib/raftexample code, adding a new node to the cluster if there is a Snapshot outstanding results in:
What did you expect to happen?
Node should be successfully added to cluster. Snapshot should be restored and logs replicated past it.
How can we reproduce it (as minimally and precisely as possible)?
raftexample/raft.go
, find the linevar defaultSnapshotCount uint64 = 10000
and set the number to 3.README.md
README.md
.Anything else we need to know?
I am a maintainer of Docker Swarmkit, working on upgrading many old dependencies. Upgrading our
etcd/raft
dependency past 3.3.x does not work.I have tracked down our problem to the error message listed in the first second. Attempting to add a new member fails after a snapshot has been taken, because when attempting to restore the snapshot, we are not listed in the snapshot's ConfState, and so will not restore the snapshot.
From where I am, with my current understanding, the problem looks like this:
I understand that this is not a bug in itself -- the error lies somewhere in Swarmkit's use of the raft library. There's something we must do before or in the course of restoring the received snapshot. I just don't know what it is.
My first instinct is to rely on the raft example to determine what I am doing wrong. However, with the example broken in an identical way, I'm left trying to plow through walls of the main etcd server code to locate the analogous sections and figure out how etcd handles this problem of snapshots during cluster joins.
Fixing this issue with contrib/raftexample allows me (and others, potentially) to understand what the issue is in our usage of
etcd/raft
, how to correctly join a new member with snapshot, and how to ultimately fix this issue in the Swarmkit code base.This issue was previously raised in #12473, which was closed for being stale, and a fix was attempted in #13578, which was more-or-less deemed incorrect. I apologize if opening a new issue is incorrect.
Etcd version (please run commands below)
N/A, but exact version of the codebase being used is v3.5.2
Etcd configuration (command line flags or environment variables)
N/A
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: