-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client RPCs are sticky to server #3738
Conversation
0f7a941
to
f27a7da
Compare
client/client.go
Outdated
|
||
// We can wait a bit and retry! | ||
if time.Since(firstCheck) < c.config.RPCHoldTimeout { | ||
jitter := lib.RandomStagger(5 * time.Second / nomad.JitterFraction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use JitterFraction
instead of just saying 300ms (~5s / 16)? I don't understand the point of JitterFraction.
We may want to jitter on NumNodes as a server outage could cause a stampede to the other servers. The more clients, the more we need to spread out the herd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its meant to be a percentage (16 corresponds to 6.25%) that we want to apply as a wait time before retrying, with a random stagger added. So it will wait ~6.25% of the given overall holdout time before retrying.
also I think. this line should be lib.RandomStagger(c.config.RPCHoldTimeout/jitterFraction)
Not sure I follow the argument of using the number of clients as the fraction. That makes the percentage non deterministic, could be anywhere from 100% to a very small number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schmichael to add a bit more color to Preetha's response. The jitter allows you to get some retrying within an overall allowed window. So if you have 5 seconds for something to occur this jitter lets you try a few times within that window.
@preetapan Yep good catch. Both those were hardcoded to 5s before I plumbed the config and missed one. Not sure what you mean about the # of clients btw.
client/client.go
Outdated
if err != nil { | ||
merr.Errors = append(merr.Errors, fmt.Errorf("Server at address %s failed ping: %v", addr, err)) | ||
continue | ||
} else if !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will never be hit as ok is only false if err is non-nil.
I'd prefer only returning an error from Ping since a nil error implies ok=true
and a non-file error ok=false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
client/client.go
Outdated
} | ||
if len(servers) == 0 { | ||
if len(nomadServers) == 0 { | ||
return fmt.Errorf("server returned no valid servers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never realized it before but this is a rrrreally weird error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That manager.go code is my nemesis. I think I finally understand the point of the locks+atomics, but I hope we never have to touch it again as it's extremely fragile in its current state.
client/servers/manager.go
Outdated
// | ||
// NOTE(sean@): We are explicitly relying on the fact that serverList will | ||
// be copied onto the stack. Please keep this structure light. | ||
type serverList struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments are confusing and possibly incorrect. I don't believe it matters whether a serverList
or *serverList
is stored in atomic.Value
. Even though every caller to getServerList will receive a unique copy of serverList and the slice header, the slice's backing array containing the Server pointers will be shared among all callers until a cycleServer
is called which creates a new slice/backing-array or SetServers
is called.
If cycleServer
modified the backing area in place (which would be an equally valid looking implementation and avoid a heap allocation), it would create race conditions!
client/servers/manager.go
Outdated
type Manager struct { | ||
// listValue manages the atomic load/store of a Manager's serverList | ||
listValue atomic.Value | ||
listLock sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very difficult to figure out the proper way to use these. There are no tests that exercise concurrent access to this struct, so I couldn't use -race
to detect races. However, I think it all checks out? (Note I ignored AddServer and RemoverServer in my analysis/review because they're not used.)
client/servers/manager.go
Outdated
m.connPoolPinger = connPoolPinger // can't pass *Nomad.ConnPool: import cycle | ||
m.rebalanceTimer = time.NewTimer(clientRPCMinReuseDuration) | ||
m.shutdownCh = shutdownCh | ||
atomic.StoreInt32(&m.offline, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to use an atomic store in a New func since nothing can be accessing this concurrently (but as mentioned later I think this field can be removed outright).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
client/servers/manager.go
Outdated
m.refreshServerRebalanceTimer() | ||
|
||
case <-m.shutdownCh: | ||
m.logger.Printf("[INFO] manager: shutting down") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEBUG maybe? Seems like this won't mean anything to end users as "managers" is awfully generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
client/servers/manager.go
Outdated
|
||
func (m *Manager) SetServers(servers []*Server) { | ||
// Hot Path | ||
if len(servers) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above comment is incorrect and this code is unreachable in the current implementation as we always check for len(servers) == 0
and return an error before calling SetServers.
Enforcing an invariant that the server list can never be cleared seems ok, but we should probably mention that in a comment on the method itself in case someone changes code at the callsites without realizing this internal behavior.
client/servers/manager.go
Outdated
} | ||
|
||
// saveServerList is a convenience method which hides the locking semantics | ||
// of atomic.Value from the caller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"hides the locking semantics of atomic.Value" is a weird phrase, but whatever. getServerList
hides the type assertion nicely, and saveServerList
makes a good companion api.
client/servers/manager.go
Outdated
} | ||
|
||
// NumServers takes out an internal "read lock" and returns the number of | ||
// servers. numServers includes both healthy and unhealthy servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumServers returns the total number of known servers whether healthy or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
client/servers/manager.go
Outdated
} | ||
|
||
// GetServers returns a copy of the current list of servers. | ||
func (m *Manager) GetServers() []*Server { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove or move to tests where it's used? (See NotifyFailedServer comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used by client
client/servers/manager.go
Outdated
|
||
func (s *Server) String() string { | ||
if s.addr == "" { | ||
s.addr = s.Addr.String() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Racy. Either remove memoization or add a lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
client/servers/manager.go
Outdated
// the receiver's serverList. Returns false if the first server does not exist | ||
// in the list. Newly added servers are appended to the list and other missing | ||
// servers are removed from the list. | ||
func (m *Manager) reconcileServerList(l *serverList) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this method at all. The server list we pass in is a potentially shuffled version of the one we're reconciling against here (except for a small window where SetServers could have overwritten one). I think maybe it was useful in Consul which also reconciles servers with Serf? That would make sense.
I'm fairly certain this entire method can be removed and replaced by a call directly to m.saveServerList(l)
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
a231d3e
to
c55e030
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little concerned about that mutex copy but otherwise looks 💯
// Addr is the resolved address of the server | ||
Addr net.Addr | ||
addr string | ||
sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we just used a NewServer func and always initialized the addr
? I'm worried about what the Server copy in Manager.GetServers
does to the Mutex. Alternatively GetServers could just manually copy Addr & DC into a new struct instead of copying the whole struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schmichael The two mutexes were put in place originally because Serf was handling the add/remove notifications and populating/replacing the server list independently of the server lists use via the RPC pool. Basically: producer == serf/consul discovery, consumer == rpc pool. This may be out of date, but that was the history.
client/servers/manager.go
Outdated
|
||
if !foundHealthyServer { | ||
m.logger.Printf("[DEBUG] manager: No healthy servers during rebalance, aborting") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need this return
and the word aborting
in the log line is a bit misleading. We're not actually aborting anything here, we just couldn't find a healthy server while rebalancing.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
No description provided.