-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd 3.5 and upgrade grpc/protobuf #3051
Conversation
Files in GOPATH have read-only persmissions, after simply copying them can result in an access error. Signed-off-by: Ilya Dmitrichenko <[email protected]> (cherry picked from commit 1646833)
Signed-off-by: Ilya Dmitrichenko <[email protected]> (cherry picked from commit 4ee4f47)
Signed-off-by: Ilya Dmitrichenko <[email protected]> (cherry picked from commit 25e88b5)
Signed-off-by: Ilya Dmitrichenko <[email protected]> (cherry picked from commit 4b579ac)
Signed-off-by: CrazyMax <[email protected]>
Signed-off-by: CrazyMax <[email protected]>
I've attempted etcd update earlier, I do recall it worked before I went on introducing However, at the time I was aiming to make etcd 3.4 work with a newer version of gRPC, that was to avoid making major changes in SwarmKit code. |
dcbee72
to
01ba48a
Compare
@dperny Only 3 tests failing now if you have an idea: https://app.circleci.com/pipelines/github/docker/swarmkit/385/workflows/91ffe05e-57f8-4435-969b-f4327a02cbaf/jobs/10750?invite=true#step-108-156
|
I started tracking this test error down. Specifically, the problem occurs in this line: By modifying this to output the actual values, I see that specifically the problem is this:
so, Eliding some of the intermediate steps, the value of Digging through the changes, this code get shuffled around and changed a lot between these updates, but most notably, this is the first time the actual behavior of the code changes independent of refactoring: See This is introduced in etcd-io/etcd#10063. In the comments, the PR author says this:
This leads me to believe, even with almost no understanding of the raft library, that something about this change may be the root of the issue. I will have to spend more time digging into understanding the Raft library to truly understand if this is the case and if so what the fix is, though. |
Hm... I'm definitely not an expert in this area myself. I did a bit of searching through git history for this code. Looks like this check was originally added by @LK4D4 in 92a3aa6 (part of #180)
Some tweaks/changes in this area were made after that by @aaronlehmann in 2b62941 amd 547188b (#255) to address #182 Those commits;
(sorry for the ping, @LK4D4 @aaronlehmann - just in case you recall more about "how this works", and/or have ideas on the things that @dperny mentioned) |
Ticket opened in etcd; etcd-io/etcd#13741 |
@crazy-max The fix, it turns out, is like 2 lines. Here is the link to the comparison. The commit message has the explanation. |
Ok thx! Was thinking dbd9df1 would be ok. |
Your commit works as well, either way is fine. |
Codecov Report
@@ Coverage Diff @@
## master #3051 +/- ##
==========================================
+ Coverage 62.18% 62.29% +0.10%
==========================================
Files 155 155
Lines 24533 24539 +6
==========================================
+ Hits 15257 15286 +29
+ Misses 7678 7643 -35
- Partials 1598 1610 +12 |
Signed-off-by: CrazyMax <[email protected]>
Signed-off-by: CrazyMax <[email protected]>
Signed-off-by: CrazyMax <[email protected]>
92449d7
to
12b19ee
Compare
PTAL @dperny @thaJeztah |
After 3.3.x, etcd made a small change to the raft library that broke Swarmkit. It also, as it turns out, broke their raft example. The core issue is that a snapshot has an embedded ConfState from when the snapshot is created. This ConfState, as it turns out, is not supposed to be the one from when the snapshot was made. It should be the one from when the snapshot is sent, the current ConfState. When adding a new node to the quorum, the node must be caught up using a snapshot. Previously, we were sending the snapshot exactly as it was taken. However, because the snapshot predates the node's membership in the cluster, the ConfState does not have the new node in it. The change to the raft library was the raft library began checking the snapshot ConfState, and rejecting snapshots where the node was missing from the ConfState. The fix is just, as mentioned above, to overwrite the ConfState from the snapshot with the current ConfState before sending. Signed-off-by: Drew Erny <[email protected]> (cherry picked from commit b7c49a6)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
follow-up #3039
closes #2911
closes #3039
closes #3043
update etcd to 3.5 to make grpc/protobuf compatible.