-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry all servers on RPC call failure #1735
Conversation
c958d3c
to
ce76aef
Compare
rpcproxy is refactored into serverlist which prioritizes good servers over servers in a remote DC or who have had a failure. Registration, heartbeating, and alloc status updating will retry faster when new servers are discovered. Consul discovery will be retried more quickly when no servers are available (eg on startup or an outage).
ce76aef
to
64ac9b9
Compare
Remove use of periodic consul handlers in the client and just use goroutines. Consul Discovery is now triggered with a chan instead of using a timer and deadline to trigger. Once discovery is complete a chan is ticked so all goroutines waiting for servers will run. Should speed up bootstraping and recovery while decreasing spinning on timers.
Test asserted endpoint appendend servers, but the new/desired behavior is for the endpoint to set/overwrite servers.
@@ -44,6 +43,10 @@ const ( | |||
// datacenters looking for the Nomad server service. | |||
datacenterQueryLimit = 9 | |||
|
|||
// consulReaperIntv is the interval at which the consul reaper will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consul -> Consul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed everywhere (I think)
@@ -154,14 +156,20 @@ type Client struct { | |||
vaultClient vaultclient.VaultClient | |||
} | |||
|
|||
var noServers = errors.New("no servers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment and put in a var () block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noServers -> noServersErr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
heartbeatLock sync.Mutex | ||
|
||
// doDisco triggers consul discovery; see triggerDiscovery | ||
doDisco chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doDisco -�> triggerDiscoveryCh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed! (Much clearer!)
|
||
// discovered will be ticked whenever consul discovery completes | ||
// succesfully | ||
discovered chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discovered -> serversDiscoveredCh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
c.rpcProxy.AddPrimaryServer(serverAddr) | ||
if len(c.configCopy.Servers) > 0 { | ||
if err := c.SetServers(c.configCopy.Servers); err != nil { | ||
logger.Printf("[WARN] None of the configured servers are valid: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style for logging is fmt.Sprintf("[%s] %s: %s", level, subSystem, msg)
. So please add [WARN] client:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
func (c *Client) consulDiscovery() { | ||
for { | ||
select { | ||
case c.doDisco <- struct{}{}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to be case <-c.doDisco:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
} | ||
} | ||
|
||
func (c *Client) doConsulDisco() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment plus change name to be consulDiscoveryImpl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
// Notify waiting rpc calls. Wait briefly in case initial rpc | ||
// just failed but the calling goroutine isn't selecting on | ||
// discovered yet. | ||
const dur = 50 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is always the case that they will be on that border. We can remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
I think this was a clumsy workaround to get around startup races early on that AFAICT (both looking at the code again and testing it manually) is no longer necessary.
d := consul.NewExecutorDomain(allocID, taskName) | ||
domains = append(domains, d) | ||
} | ||
func (c *Client) consulReaper() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on method
I would prefer you initialize the ticker here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Good call; there was no need for the ticker to be on the struct at all!
c.consulSyncer.AddPeriodicHandler("Nomad Client Services Sync Handler", consulServicesReaperFn) | ||
} | ||
|
||
func (c *Client) doConsulReap() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments and rename to consulReapImpl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Also fix log line formatting
It's better to just let goroutines fallback to their longer retry intervals then try to be clever here.
@@ -892,9 +948,6 @@ func (c *Client) registerNode() error { | |||
} | |||
var resp structs.NodeUpdateResponse | |||
if err := c.RPC("Node.Register", &req, &resp); err != nil { | |||
if time.Since(c.start) > registerErrGrace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It merely wrapped an error with a tiny bit more context in certain circumstances.
Instead I just added that tiny bit more context to the log line that handles this error, so there's no need to wrap it: https://github.com/hashicorp/nomad/pull/1735/files/ec81fc9baf776efb30bd752f64688ad562d915c6#diff-1cd41a2611da49ce63630d5a51bd82c5L876
It also kind of seemed like dead code since the error wrapping only affected logging. It seemed like maybe once upon a time it affected retry logic, but it doesn't anymore on master or this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the comment on the registerErrGrace
, it does have a point. It is to avoid error logs showing up in the -dev
case. I would add it back
@@ -4,6 +4,7 @@ IMPROVEMENTS: | |||
* core: Introduce node SecretID which can be used to minimize the available | |||
surface area of RPCs to malicious Nomad Clients [GH-1597] | |||
* cli: `nomad alloc-status` shows allocation creation time [GH-1623] | |||
* client: Retry all servers on RPC call failure [GH-1735] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failed RPCs are retried on all servers
@@ -17,6 +18,8 @@ BUG FIXES: | |||
* client/fingerprint: Fix inconsistent CPU MHz fingerprinting [GH-1366] | |||
* discovery: Fix old services not getting removed from consul on update | |||
[GH-1668] | |||
* discovery: Fix client flapping when server was in a different datacenter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was -> is
+/- different datacenter as the client
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
rpcproxy is refactored into serverlist which prioritizes good servers
over servers in a remote DC or who have had a failure.
Registration, heartbeating, and alloc status updating will retry faster
when new servers are discovered.
Consul discovery will be retried more quickly when no servers are
available (eg on startup or an outage).