-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client/metadata: fix crasher caused by AllowStale = false #16549
Conversation
Fixes #16517 Given a 3 Server cluster with at least 1 Client connected to Follower 1: If a NodeMeta.{Apply,Read} for the Client request is received by Follower 1 with `AllowStale = false` the Follower will forward the request to the Leader. The Leader, not being connected to the target Client, will forward the RPC to Follower 1. Follower 1, seeing AllowStale=false, will forward the request to the Leader. The Leader, not being connected to... well hoppefully you get the picture: an infinite loop occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor question, but I don't think it's actually a problem 👍
// | ||
// BREAKING: This method will have the following signature in 1.6.0 | ||
// func (a *Allocations) Restart(allocID string, taskName string, allTasks bool, w *WriteOptions) (*WriteMeta, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mystery solved I guess 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't believe how many hints we were given that this would happen. I know I even hit it in the e2e cluster at least once but couldn't reproduce it and just assumed it was some terrible thing I had done to my e2e cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chesterton's Fence!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had never heard of that before and love it!
Fixes #16517
Given a 3 Server cluster with at least 1 Client connected to Follower 1:
If a NodeMeta.{Apply,Read} for the Client request is received by Follower 1 with
AllowStale = false
the Follower will forward the request to the Leader.The Leader, not being connected to the target Client, will forward the RPC to Follower 1.
Follower 1, seeing AllowStale=false, will forward the request to the Leader.
The Leader, not being connected to... well hoppefully you get the picture: an infinite loop occurs.
Manually testing looks good, but
automated testing is still WIP.Update: I'm pretty pleased with how the manual test turned out! It's pretty big but reliably exercises the bug without the patch. With the fix the test runs in <400ms locally which isn't bad for running 7 agents and waiting for each one to become ready!
If you're feeling brave (or like watching the world burn), you can exercise the bug by commenting out one of the
AllowStale = true
lines and running the test withgo test -v -run NodeMeta
in thenomad/
directory.I highly recommend doing that in a container or VM so you don't mess up your machine. I used: