Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry all servers on RPC call failure #1735

Merged
merged 14 commits into from
Sep 27, 2016
Merged

Retry all servers on RPC call failure #1735

merged 14 commits into from
Sep 27, 2016

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Sep 22, 2016

rpcproxy is refactored into serverlist which prioritizes good servers
over servers in a remote DC or who have had a failure.

Registration, heartbeating, and alloc status updating will retry faster
when new servers are discovered.

Consul discovery will be retried more quickly when no servers are
available (eg on startup or an outage).

rpcproxy is refactored into serverlist which prioritizes good servers
over servers in a remote DC or who have had a failure.

Registration, heartbeating, and alloc status updating will retry faster
when new servers are discovered.

Consul discovery will be retried more quickly when no servers are
available (eg on startup or an outage).
@schmichael schmichael changed the title Refactor rpcproxy into a simpler serverlist Retry all servers on RPC call failure Sep 23, 2016
Remove use of periodic consul handlers in the client and just use
goroutines. Consul Discovery is now triggered with a chan instead of
using a timer and deadline to trigger.

Once discovery is complete a chan is ticked so all goroutines waiting
for servers will run.

Should speed up bootstraping and recovery while decreasing spinning on
timers.
Test asserted endpoint appendend servers, but the new/desired behavior
is for the endpoint to set/overwrite servers.
@@ -44,6 +43,10 @@ const (
// datacenters looking for the Nomad server service.
datacenterQueryLimit = 9

// consulReaperIntv is the interval at which the consul reaper will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consul -> Consul

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed everywhere (I think)

@@ -154,14 +156,20 @@ type Client struct {
vaultClient vaultclient.VaultClient
}

var noServers = errors.New("no servers")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment and put in a var () block

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noServers -> noServersErr

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

heartbeatLock sync.Mutex

// doDisco triggers consul discovery; see triggerDiscovery
doDisco chan struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doDisco -�> triggerDiscoveryCh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! (Much clearer!)


// discovered will be ticked whenever consul discovery completes
// succesfully
discovered chan struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discovered -> serversDiscoveredCh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

c.rpcProxy.AddPrimaryServer(serverAddr)
if len(c.configCopy.Servers) > 0 {
if err := c.SetServers(c.configCopy.Servers); err != nil {
logger.Printf("[WARN] None of the configured servers are valid: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style for logging is fmt.Sprintf("[%s] %s: %s", level, subSystem, msg). So please add [WARN] client:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

func (c *Client) consulDiscovery() {
for {
select {
case c.doDisco <- struct{}{}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to be case <-c.doDisco:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}
}

func (c *Client) doConsulDisco() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment plus change name to be consulDiscoveryImpl

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// Notify waiting rpc calls. Wait briefly in case initial rpc
// just failed but the calling goroutine isn't selecting on
// discovered yet.
const dur = 50 * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is always the case that they will be on that border. We can remove this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

I think this was a clumsy workaround to get around startup races early on that AFAICT (both looking at the code again and testing it manually) is no longer necessary.

d := consul.NewExecutorDomain(allocID, taskName)
domains = append(domains, d)
}
func (c *Client) consulReaper() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on method

I would prefer you initialize the ticker here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Good call; there was no need for the ticker to be on the struct at all!

c.consulSyncer.AddPeriodicHandler("Nomad Client Services Sync Handler", consulServicesReaperFn)
}

func (c *Client) doConsulReap() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments and rename to consulReapImpl

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -892,9 +948,6 @@ func (c *Client) registerNode() error {
}
var resp structs.NodeUpdateResponse
if err := c.RPC("Node.Register", &req, &resp); err != nil {
if time.Since(c.start) > registerErrGrace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It merely wrapped an error with a tiny bit more context in certain circumstances.

Instead I just added that tiny bit more context to the log line that handles this error, so there's no need to wrap it: https://github.com/hashicorp/nomad/pull/1735/files/ec81fc9baf776efb30bd752f64688ad562d915c6#diff-1cd41a2611da49ce63630d5a51bd82c5L876

It also kind of seemed like dead code since the error wrapping only affected logging. It seemed like maybe once upon a time it affected retry logic, but it doesn't anymore on master or this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the comment on the registerErrGrace, it does have a point. It is to avoid error logs showing up in the -dev case. I would add it back

@@ -4,6 +4,7 @@ IMPROVEMENTS:
* core: Introduce node SecretID which can be used to minimize the available
surface area of RPCs to malicious Nomad Clients [GH-1597]
* cli: `nomad alloc-status` shows allocation creation time [GH-1623]
* client: Retry all servers on RPC call failure [GH-1735]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed RPCs are retried on all servers

@@ -17,6 +18,8 @@ BUG FIXES:
* client/fingerprint: Fix inconsistent CPU MHz fingerprinting [GH-1366]
* discovery: Fix old services not getting removed from consul on update
[GH-1668]
* discovery: Fix client flapping when server was in a different datacenter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was -> is

+/- different datacenter as the client

@schmichael schmichael merged commit 95af837 into master Sep 27, 2016
@schmichael schmichael deleted the b-bootstrap-flapping branch September 27, 2016 23:33
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants