contrib/raftexample: adding new members after snapshot taken fails #13741

dperny · 2022-02-24T22:39:47Z

What happened?

Using the contrib/raftexample code, adding a new node to the cluster if there is a Snapshot outstanding results in:

2 attempted to restore snapshot but it is not in the ConfState {[1] [] [] [] false []}; should never happen

What did you expect to happen?

Node should be successfully added to cluster. Snapshot should be restored and logs replicated past it.

How can we reproduce it (as minimally and precisely as possible)?

First, alter the example such that the snapshot threshold is very low. In raftexample/raft.go, find the line var defaultSnapshotCount uint64 = 10000 and set the number to 3.
Build and run the raftexample binary as listed in README.md
Put a few entries into the cluster to trigger creation of a snapshot.
Attempt to add a new member to the cluster through the commands listed in README.md.
Observe the failure.

Anything else we need to know?

I am a maintainer of Docker Swarmkit, working on upgrading many old dependencies. Upgrading our etcd/raft dependency past 3.3.x does not work.

I have tracked down our problem to the error message listed in the first second. Attempting to add a new member fails after a snapshot has been taken, because when attempting to restore the snapshot, we are not listed in the snapshot's ConfState, and so will not restore the snapshot.

From where I am, with my current understanding, the problem looks like this:

Node 2 need to join the cluster.
To join the cluster, Node 2 needs to replicate the log.
To replicate the log, Node 2 must first start from the snapshot.
Node 2 cannot start from the snapshot, because the ConfState in the snapshot metadata does not include Node 2, as the snapshot comes before the log entry adding Node 2 to the ConfState.

I understand that this is not a bug in itself -- the error lies somewhere in Swarmkit's use of the raft library. There's something we must do before or in the course of restoring the received snapshot. I just don't know what it is.

My first instinct is to rely on the raft example to determine what I am doing wrong. However, with the example broken in an identical way, I'm left trying to plow through walls of the main etcd server code to locate the analogous sections and figure out how etcd handles this problem of snapshots during cluster joins.

Fixing this issue with contrib/raftexample allows me (and others, potentially) to understand what the issue is in our usage of etcd/raft, how to correctly join a new member with snapshot, and how to ultimately fix this issue in the Swarmkit code base.

This issue was previously raised in #12473, which was closed for being stale, and a fix was attempted in #13578, which was more-or-less deemed incorrect. I apologize if opening a new issue is incorrect.

Etcd version (please run commands below)

N/A, but exact version of the codebase being used is v3.5.2

Etcd configuration (command line flags or environment variables)

N/A

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

No response

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

thaJeztah · 2022-03-03T19:30:37Z

@spzala ptal 🤗

ahrtr · 2022-03-07T02:01:39Z

Thanks for raising this issue. I just fixed it in pull/13760. Please see the explanation in the PR. Please also kindly let me know whether it works for you.

dperny · 2022-03-07T02:27:24Z

Thank you very much! Late last week I'd finally gotten far enough into the etcd code to determine this was likely the solution, but I was not yet certain. This clears it up.

thaJeztah · 2022-03-07T07:12:30Z

Thanks @ahrtr !

spzala · 2022-03-07T20:01:20Z

Thanks @ahrtr I will be reviewing the changes today.

dperny added the type/bug label Feb 24, 2022

thaJeztah mentioned this issue Mar 3, 2022

etcd 3.5 and upgrade grpc/protobuf moby/swarmkit#3051

Merged

ahrtr mentioned this issue Mar 7, 2022

Update the confstate before sending snapshot #13760

Merged

serathius closed this as completed in #13760 Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contrib/raftexample: adding new members after snapshot taken fails #13741

contrib/raftexample: adding new members after snapshot taken fails #13741

dperny commented Feb 24, 2022

thaJeztah commented Mar 3, 2022

ahrtr commented Mar 7, 2022 •

edited

Loading

dperny commented Mar 7, 2022

thaJeztah commented Mar 7, 2022

spzala commented Mar 7, 2022

contrib/raftexample: adding new members after snapshot taken fails #13741

contrib/raftexample: adding new members after snapshot taken fails #13741

Comments

dperny commented Feb 24, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

thaJeztah commented Mar 3, 2022

ahrtr commented Mar 7, 2022 • edited Loading

dperny commented Mar 7, 2022

thaJeztah commented Mar 7, 2022

spzala commented Mar 7, 2022

ahrtr commented Mar 7, 2022 •

edited

Loading