-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Consul Syncer into new ServiceClient #2467
Conversation
307426f
to
6bfab4b
Compare
4196022
to
99e7126
Compare
command/agent/consul/client.go
Outdated
client: consulClient, | ||
logger: logger, | ||
retryInterval: defaultSyncInterval, //TODO what should this default to?! | ||
syncInterval: defaultSyncInterval, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
command/agent/consul/client.go
Outdated
lastOk = false | ||
c.logger.Printf("[WARN] consul: failed to update services in Consul: %v", err) | ||
} | ||
//TODO Log? and jitter/backoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exp backoff from 1...30s
command/agent/consul/client.go
Outdated
c.logger.Printf("[DEBUG] consul: registered %d services / %d checks; deregisterd %d services / %d checks", regServiceN, regCheckN, deregServiceN, deregCheckN) | ||
return nil | ||
|
||
//TODO Labels and gotos are nasty; move to a function? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add helper functions to get a snapshot of the maps and another function to merge outstanding functions back in
command/agent/consul/script.go
Outdated
} | ||
|
||
if err != nil { | ||
//FIXME Backoff? Retry faster? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this
client/task_runner.go
Outdated
|
||
//FIXME is there a better place to do this? used to be in executor | ||
// Prepare services | ||
interpolateServices(r.getTaskEnv(), r.task) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this
command/agent/consul/client.go
Outdated
// ScriptExecutor is the interface the ServiceClient uses to execute script | ||
// checks inside a container. | ||
type ScriptExecutor interface { | ||
Exec(ctx context.Context, cmd string, args []string) ([]byte, int, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be defined on the DriverHandle interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a new interface that is defined there that wraps the DriverHandle. I don't think that is the best approach because for plugin versions of drivers there will not be a good way to do that.
@@ -390,6 +388,10 @@ func (h *javaHandle) Update(task *structs.Task) error { | |||
return nil | |||
} | |||
|
|||
func (h *javaHandle) Exec(ctx context.Context, cmd string, args []string) ([]byte, int, error) { | |||
return execChroot(ctx, h.taskDir, cmd, args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java driver won't always be chroot'ed. Only on linux
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled in #2554
@@ -250,6 +245,10 @@ func (h *rawExecHandle) Update(task *structs.Task) error { | |||
return nil | |||
} | |||
|
|||
func (h *rawExecHandle) Exec(ctx context.Context, cmd string, args []string) ([]byte, int, error) { | |||
return execChroot(ctx, "", cmd, args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really an appropriate name...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled in #2554
// execChroot executes cmd with args inside chroot if set and returns the | ||
// output, exit code, and error. If chroot is an empty string the command is | ||
// executed on the host. | ||
func execChroot(ctx context.Context, chroot, name string, args []string) ([]byte, int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this. I feel like the executor should be used to execute commands and it should launch them into the same cgroup/chroot/namespaces etc that it already has context for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added #2554 to handle this in a followup PR.
client/task_runner.go
Outdated
@@ -1361,6 +1426,9 @@ func (r *TaskRunner) handleUpdate(update *structs.Allocation) error { | |||
// given limit. It returns whether the task was destroyed and the error | |||
// associated with the last kill attempt. | |||
func (r *TaskRunner) handleDestroy() (destroyed bool, err error) { | |||
// Remove from Consul |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behavior is to deregister services as soon as they exit and re-register once that have have started. This will leave the service registered even as the task may be restarting with an arbitrarily long delay
nomad/server.go
Outdated
defaultConsulDiscoveryInterval time.Duration = 9 * time.Second | ||
|
||
// defaultConsulDiscoveryIntervalRetry is how often to poll Consul for | ||
// new servers if there is no leader and the last Consul query failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the opposite of the behavior we want. This makes it so under error condition we speed up instead of backing off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
command/agent/consul/client.go
Outdated
"github.com/hashicorp/nomad/nomad/structs" | ||
) | ||
|
||
var mark = struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment
command/agent/consul/client.go
Outdated
// ScriptExecutor is the interface the ServiceClient uses to execute script | ||
// checks inside a container. | ||
type ScriptExecutor interface { | ||
Exec(ctx context.Context, cmd string, args []string) ([]byte, int, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a new interface that is defined there that wraps the DriverHandle. I don't think that is the best approach because for plugin versions of drivers there will not be a good way to do that.
command/agent/consul/client.go
Outdated
type ServiceClient struct { | ||
client AgentAPI | ||
logger *log.Logger | ||
retryInterval time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment
command/agent/consul/script.go
Outdated
type scriptHandle struct { | ||
// cancel the script | ||
cancel func() | ||
done chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exitCh
command/agent/consul/script.go
Outdated
case context.DeadlineExceeded: | ||
// Log deadline exceeded every time, but flip last check to false | ||
s.lastCheckOk = false | ||
s.logger.Printf("[WARN] consul.checks: check %q timed out (%s)", s.check.Name, s.check.Timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have the allocation ID and task name.
command/agent/consul/script.go
Outdated
} | ||
if err != nil { | ||
state = api.HealthCritical | ||
output = []byte(err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string -> []byte -> string. Instead:
var outputMsg string
if err != nil {
outputMsg = err.Error()
} else {
outputMsg = string(output)
}
command/agent/consul/script_test.go
Outdated
|
||
func (b *blockingScriptExec) Exec(ctx context.Context, _ string, _ []string) ([]byte, int, error) { | ||
b.running <- mark | ||
cmd := exec.CommandContext(ctx, "/bin/sleep", "9000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
command/agent/consul/script_test.go
Outdated
"github.com/hashicorp/consul/api" | ||
"github.com/hashicorp/nomad/nomad/structs" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the normal case tests:
- Getting correct exit code
- Getting output
- Getting exec errors
d95713b
to
1b453be
Compare
f0f1691
to
d295329
Compare
client/task_runner.go
Outdated
var mErr multierror.Error | ||
r.handleLock.Lock() | ||
if r.handle != nil { | ||
// Need to check driver abilities for updating services |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale comment?
@@ -1061,8 +1072,12 @@ func (r *TaskRunner) run() { | |||
} | |||
} | |||
|
|||
// cleanup calls Driver.Cleanup when a task is stopping. Errors are logged. | |||
// cleanup removes Consul entries and calls Driver.Cleanup when a task is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't get called through restarts which means the check is registered during the restart sleep. (the wait channel firing in the run loop)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! Fixed and tested: 9625e19
Ensured test failed without the RemoveTask call when restarting.
command/agent/agent.go
Outdated
} | ||
if !conf.TLSConfig.EnableHTTP { | ||
consulServices[consul.GenerateServiceKey(httpServ)] = httpServ | ||
consulServices = append(consulServices, httpServ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump! Just want to make sure this doesn't slip through
c.logger.Printf("[WARN] consul.sync: failed to update services in Consul: %v", err) | ||
} | ||
failures++ | ||
if !retryTimer.Stop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment here what you are doing with the timer. I know it is the recommended practice but still not super common
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added. I hate the timer API.
command/agent/consul/client.go
Outdated
} | ||
|
||
// commit operations and returns false if shutdown signalled before committing. | ||
func (c *ServiceClient) commit(ops *operations) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really using the return value anywhere in that meaningful of a way. Remove?
command/agent/consul/client.go
Outdated
} | ||
|
||
// Service exists and wasn't updated, don't add it later | ||
delete(newIDs, existingID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the port changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally missed the fact that port changes are excluded from service and check IDs, so great catch. Added a huge test to exercise port changes and did lots of fixing: 970979d
command/agent/consul/mock.go
Outdated
import ( | ||
"log" | ||
|
||
"github.com/hashicorp/consul/api" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have been following the habit of _testing.go to put importable stubs/mocks/etc
command/agent/consul/script.go
Outdated
return | ||
case context.DeadlineExceeded: | ||
// Log deadline exceeded every time, but flip last check to false | ||
s.lastCheckOk = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't the code == 0
in this case?
cancel() | ||
|
||
state := api.HealthCritical | ||
switch code { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.consul.io/docs/agent/checks.html#check-scripts
Think you want a default case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state
is initialized to the default case (critical/failing)
os.Exit(m.Run()) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test:
- exit code 0
- exit code 1
- exit code 2
- Capturing output
- Unknown command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added except "Unknown command" which depends on the Exec implementation to return an error (which is tested): 9c14e8b
Exec implementations are tested directly for example: https://github.com/hashicorp/nomad/pull/2467/files#diff-1eb2911dd4ed9b612194897284c68251R304
2713281
to
9866a15
Compare
Ready for another review. Note that #2218 isn't supported for non-agent checks. Will do that in a follow up PR. |
command/agent/agent.go
Outdated
@@ -680,3 +703,59 @@ func (a *Agent) setupConsul(consulConfig *config.ConsulConfig) error { | |||
go a.consulService.Run() | |||
return nil | |||
} | |||
|
|||
// consulSupportsTLSSkipVerify returns true if Consul supports TLSSkipVerify. | |||
func consulSupportsTLSSkipVerify(self map[string]map[string]interface{}) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
command/agent/consul/script_test.go
Outdated
} | ||
} | ||
|
||
// Test exit codes with errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use go 1.8 table driven test style
command/agent/consul/unit_test.go
Outdated
@@ -242,6 +260,188 @@ func TestConsul_ChangeTags(t *testing.T) { | |||
} | |||
} | |||
|
|||
// TestConsul_ChangePorts asserts that changing the ports on a service updates | |||
// it in Consul. Since ports are part of the service ID this is a slightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ports are not part of the service ID
Also fix the diff tests: https://travis-ci.org/hashicorp/nomad/builds/223416717#L5545 |
@@ -2074,6 +2074,7 @@ type ServiceCheck struct { | |||
Interval time.Duration // Interval of the check | |||
Timeout time.Duration // Timeout of the response from the check before consul fails the check | |||
InitialStatus string // Initial status of the check | |||
TLSSkipVerify bool // Skip TLS verification when Protocol=https |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also document on both HCL and JSON job spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to implement it for non-agent checks in a followup PR (and doc it then), but that seems silly. Fixing.
Fixes #2478 #2474 #1995 #2294 The new client only handles agent and task service advertisement. Server discovery is mostly unchanged. The Nomad client agent now handles all Consul operations instead of the executor handling task related operations. When upgrading from an earlier version of Nomad existing executors will be told to deregister from Consul so that the Nomad agent can re-register the task's services and checks. Drivers - other than qemu - now support an Exec method for executing abritrary commands in a task's environment. This is used to implement script checks. Interfaces are used extensively to avoid interacting with Consul in tests that don't assert any Consul related behavior.
...and still protect against leaking agent entries in Consul on shutdown.
Previous implementation assumed all struct fields were included in service and check IDs. Service IDs never include port labels and check IDs *optionally* include port labels, so lots of things had to change. Added a really big test to exercise this.
Support for TLSSkipVerify in other checks coming soon!
@@ -1900,20 +1900,22 @@ func TestTaskGroupDiff(t *testing.T) { | |||
|
|||
func TestTaskDiff(t *testing.T) { | |||
cases := []struct { | |||
Name string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bless your soul
152ed01
to
a5dcf6b
Compare
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #2478 #2474 #1995 #2294
The new client only handles agent and task service advertisement. Server
discovery is mostly unchanged.
The Nomad client agent now handles all Consul operations instead of the
executor handling task related operations. When upgrading from an
earlier version of Nomad existing executors will be told to deregister
from Consul so that the Nomad agent can re-register the task's services
and checks.
Drivers - other than qemu - now support an Exec method for executing
abritrary commands in a task's environment. This is used to implement
script checks.
Interfaces are used extensively to avoid interacting with Consul in
tests that don't assert any Consul related behavior.