-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unnecessary locking and serverlist syncing in heartbeats #5654
Conversation
This removes an unnecessary shared lock between discovery and heartbeating which was causing heartbeats to be missed upon retries when a single server fails. Also made a drive by fix to call the periodic server shuffler goroutine.
client/client.go
Outdated
@@ -2432,6 +2427,26 @@ DISCOLOOP: | |||
|
|||
consulLogger.Info("discovered following servers", "servers", nomadServers) | |||
|
|||
// Check if the list of servers discovered is identical to the list we already have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think this code would be better placed in servers.SetServers
. That way it could be done inside the server manager's lock and prevent concurrent RPCs from rotating the list and making this comparison stale.
Not a big deal as there's no data race here, just the possibility of a comparison against stale data which shouldn't do much harm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It would be easily unit testable in server manager as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method for comparing whether the two server lists are equal in servers.SetServers
does a sort -
nomad/client/servers/manager.go
Line 206 in 9da25ad
m.servers.Sort() |
This approach avoids that sort step/modification. I can move over the code though and change the logic in SerServers to avoid the sort of existing servers, which modifies the observable server list.
client/client.go
Outdated
} | ||
} | ||
if allFound && len(nomadServers) == len(serverList) { | ||
c.logger.Info("Not replacing server list, current server list is identical to servers discovered in Consul") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug log level at most. This should only matter when debugging heartbeating which is hopefully very rare!
c.logger.Info("Not replacing server list, current server list is identical to servers discovered in Consul") | |
c.logger.Debug("Not replacing server list, current server list is identical to servers discovered in Consul") |
dev/cluster/server1.hcl
Outdated
@@ -16,5 +16,5 @@ server { | |||
} | |||
|
|||
# Self-elect, should be 3 or 5 for production | |||
bootstrap_expect = 3 | |||
bootstrap_expect = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This the dev cluster config, so bootstrap expect should match server2 and server3's.
You can override the value on the command line to use this config for a single server cluster or we could add a new config for single server clusters.
// Determine if they are equal | ||
equal := servers.Equal(m.servers) | ||
equal := m.serversAreEqual(servers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schmichael this is ready for another look
@@ -315,6 +315,9 @@ func NewClient(cfg *config.Config, consulCatalog consul.CatalogAPI, consulServic | |||
// Initialize the server manager | |||
c.servers = servers.New(c.logger, c.shutdownCh, c) | |||
|
|||
// Start server manager rebalancing go routine | |||
go c.servers.Start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious how this call was dropped? Was it a regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it was ever there so for long running clients only failed rpcs or discovering new servers would cause reshuffling.
copy.Sort() | ||
servers.Sort() | ||
|
||
return copy.Equal(servers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it better to compare servers lists omitting their DC, only by IP-addresses? Cause servers coming from consulDiscovery()
method would be without any DC, whereas current servers would have it.
See https://github.com/hashicorp/nomad/blob/master/client/client.go#L2436
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! As far as I can tell the code to prefer the local DC's server has been broken since v0.8.0. I'm going to remove the DC field and add an issue+TODO.
This ensures that server-to-server streaming RPC calls use the tls wrapped connections. Prior to this, `streamingRpcImpl` function uses tls for setting header and invoking the rpc method, but returns unwrapped tls connection. Thus, streaming writes fail with tls errors. This tls streaming bug existed since 0.8.0[1], but PR #5654[2] exacerbated it in 0.9.2. Prior to PR #5654, nomad client used to shuffle servers at every heartbeat -- `servers.Manager.setServers`[3] always shuffled servers and was called by heartbeat code[4]. Shuffling servers meant that a nomad client would heartbeat and establish a connection against all nomad servers eventually. When handling streaming RPC calls, nomad servers used these local connection to communicate directly to the client. The server-to-server forwarding logic was left mostly unexercised. PR #5654 means that a nomad client may connect to a single server only and caused the server-to-server forward streaming RPC code to get exercised more and unearthed the problem. [1] https://github.com/hashicorp/nomad/blob/v0.8.0/nomad/rpc.go#L501-L515 [2] #5654 [3] https://github.com/hashicorp/nomad/blob/v0.9.1/client/servers/manager.go#L198-L216 [4] https://github.com/hashicorp/nomad/blob/v0.9.1/client/client.go#L1603
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This removes an unnecessary shared lock between discovery and heartbeating
which was causing heartbeats to be missed upon retries when a single server
fails. Also made a drive by fix to call the periodic server shuffler goroutine.
I took some of the approach in #4688, but have kept the triggerDiscovery logic upon failure with the change that the server list is not retouched if the discovered servers are same as existing servers.
Fixes #4666
Prior to this change, running a three server cluster where one of the servers is arbitrarily paused caused many unnecessary reschedule events due the clients being considered lost. With this change, I see a dramatic difference, with the caveat that if a server is paused when a client is about to send its hearbeat and there's not enough grace period left for it to retry another server, it will still be considered a missed heartbeat.