Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client RPCs are sticky to server #3738

Merged
merged 11 commits into from
Jan 26, 2018
Merged

Client RPCs are sticky to server #3738

merged 11 commits into from
Jan 26, 2018

Conversation

dadgar
Copy link
Contributor

@dadgar dadgar commented Jan 10, 2018

No description provided.

client/client.go Outdated

// We can wait a bit and retry!
if time.Since(firstCheck) < c.config.RPCHoldTimeout {
jitter := lib.RandomStagger(5 * time.Second / nomad.JitterFraction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use JitterFraction instead of just saying 300ms (~5s / 16)? I don't understand the point of JitterFraction.

We may want to jitter on NumNodes as a server outage could cause a stampede to the other servers. The more clients, the more we need to spread out the herd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its meant to be a percentage (16 corresponds to 6.25%) that we want to apply as a wait time before retrying, with a random stagger added. So it will wait ~6.25% of the given overall holdout time before retrying.

also I think. this line should be lib.RandomStagger(c.config.RPCHoldTimeout/jitterFraction)

Not sure I follow the argument of using the number of clients as the fraction. That makes the percentage non deterministic, could be anywhere from 100% to a very small number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schmichael to add a bit more color to Preetha's response. The jitter allows you to get some retrying within an overall allowed window. So if you have 5 seconds for something to occur this jitter lets you try a few times within that window.

@preetapan Yep good catch. Both those were hardcoded to 5s before I plumbed the config and missed one. Not sure what you mean about the # of clients btw.

client/client.go Outdated
if err != nil {
merr.Errors = append(merr.Errors, fmt.Errorf("Server at address %s failed ping: %v", addr, err))
continue
} else if !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will never be hit as ok is only false if err is non-nil.

I'd prefer only returning an error from Ping since a nil error implies ok=true and a non-file error ok=false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

client/client.go Outdated
}
if len(servers) == 0 {
if len(nomadServers) == 0 {
return fmt.Errorf("server returned no valid servers")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never realized it before but this is a rrrreally weird error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved it

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That manager.go code is my nemesis. I think I finally understand the point of the locks+atomics, but I hope we never have to touch it again as it's extremely fragile in its current state.

//
// NOTE(sean@): We are explicitly relying on the fact that serverList will
// be copied onto the stack. Please keep this structure light.
type serverList struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments are confusing and possibly incorrect. I don't believe it matters whether a serverList or *serverList is stored in atomic.Value. Even though every caller to getServerList will receive a unique copy of serverList and the slice header, the slice's backing array containing the Server pointers will be shared among all callers until a cycleServer is called which creates a new slice/backing-array or SetServers is called.

If cycleServer modified the backing area in place (which would be an equally valid looking implementation and avoid a heap allocation), it would create race conditions!

type Manager struct {
// listValue manages the atomic load/store of a Manager's serverList
listValue atomic.Value
listLock sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very difficult to figure out the proper way to use these. There are no tests that exercise concurrent access to this struct, so I couldn't use -race to detect races. However, I think it all checks out? (Note I ignored AddServer and RemoverServer in my analysis/review because they're not used.)

m.connPoolPinger = connPoolPinger // can't pass *Nomad.ConnPool: import cycle
m.rebalanceTimer = time.NewTimer(clientRPCMinReuseDuration)
m.shutdownCh = shutdownCh
atomic.StoreInt32(&m.offline, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use an atomic store in a New func since nothing can be accessing this concurrently (but as mentioned later I think this field can be removed outright).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

m.refreshServerRebalanceTimer()

case <-m.shutdownCh:
m.logger.Printf("[INFO] manager: shutting down")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEBUG maybe? Seems like this won't mean anything to end users as "managers" is awfully generic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


func (m *Manager) SetServers(servers []*Server) {
// Hot Path
if len(servers) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above comment is incorrect and this code is unreachable in the current implementation as we always check for len(servers) == 0 and return an error before calling SetServers.

Enforcing an invariant that the server list can never be cleared seems ok, but we should probably mention that in a comment on the method itself in case someone changes code at the callsites without realizing this internal behavior.

}

// saveServerList is a convenience method which hides the locking semantics
// of atomic.Value from the caller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"hides the locking semantics of atomic.Value" is a weird phrase, but whatever. getServerList hides the type assertion nicely, and saveServerList makes a good companion api.

}

// NumServers takes out an internal "read lock" and returns the number of
// servers. numServers includes both healthy and unhealthy servers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumServers returns the total number of known servers whether healthy or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

// GetServers returns a copy of the current list of servers.
func (m *Manager) GetServers() []*Server {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove or move to tests where it's used? (See NotifyFailedServer comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used by client


func (s *Server) String() string {
if s.addr == "" {
s.addr = s.Addr.String()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Racy. Either remove memoization or add a lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// the receiver's serverList. Returns false if the first server does not exist
// in the list. Newly added servers are appended to the list and other missing
// servers are removed from the list.
func (m *Manager) reconcileServerList(l *serverList) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this method at all. The server list we pass in is a potentially shuffled version of the one we're reconciling against here (except for a small window where SetServers could have overwritten one). I think maybe it was useful in Consul which also reconciles servers with Serf? That would make sense.

I'm fairly certain this entire method can be removed and replaced by a call directly to m.saveServerList(l) above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little concerned about that mutex copy but otherwise looks 💯

// Addr is the resolved address of the server
Addr net.Addr
addr string
sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just used a NewServer func and always initialized the addr? I'm worried about what the Server copy in Manager.GetServers does to the Mutex. Alternatively GetServers could just manually copy Addr & DC into a new struct instead of copying the whole struct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schmichael The two mutexes were put in place originally because Serf was handling the add/remove notifications and populating/replacing the server list independently of the server lists use via the RPC pool. Basically: producer == serf/consul discovery, consumer == rpc pool. This may be out of date, but that was the history.


if !foundHealthyServer {
m.logger.Printf("[DEBUG] manager: No healthy servers during rebalance, aborting")
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this return and the word aborting in the log line is a bit misleading. We're not actually aborting anything here, we just couldn't find a healthy server while rebalancing.

@dadgar dadgar merged commit 47d8fc5 into f-tunnel Jan 26, 2018
@dadgar dadgar deleted the f-sticky-clients branch January 26, 2018 21:30
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants