Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Consul Syncer into new ServiceClient #2467

Merged
merged 38 commits into from
Apr 19, 2017
Merged

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Mar 22, 2017

Fixes #2478 #2474 #1995 #2294

The new client only handles agent and task service advertisement. Server
discovery is mostly unchanged.

The Nomad client agent now handles all Consul operations instead of the
executor handling task related operations. When upgrading from an
earlier version of Nomad existing executors will be told to deregister
from Consul so that the Nomad agent can re-register the task's services
and checks.

Drivers - other than qemu - now support an Exec method for executing
abritrary commands in a task's environment. This is used to implement
script checks.

Interfaces are used extensively to avoid interacting with Consul in
tests that don't assert any Consul related behavior.

@schmichael schmichael force-pushed the f-consul-refactor branch 3 times, most recently from 307426f to 6bfab4b Compare March 29, 2017 23:38
@schmichael schmichael closed this Mar 29, 2017
@schmichael schmichael reopened this Mar 29, 2017
@schmichael schmichael changed the title WIP Consul Refactor Refactor Consul Syncer into new ServiceClient Mar 29, 2017
@schmichael schmichael force-pushed the f-consul-refactor branch 2 times, most recently from 4196022 to 99e7126 Compare March 30, 2017 18:35
client: consulClient,
logger: logger,
retryInterval: defaultSyncInterval, //TODO what should this default to?!
syncInterval: defaultSyncInterval,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

lastOk = false
c.logger.Printf("[WARN] consul: failed to update services in Consul: %v", err)
}
//TODO Log? and jitter/backoff
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exp backoff from 1...30s

c.logger.Printf("[DEBUG] consul: registered %d services / %d checks; deregisterd %d services / %d checks", regServiceN, regCheckN, deregServiceN, deregCheckN)
return nil

//TODO Labels and gotos are nasty; move to a function?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add helper functions to get a snapshot of the maps and another function to merge outstanding functions back in

}

if err != nil {
//FIXME Backoff? Retry faster?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this


//FIXME is there a better place to do this? used to be in executor
// Prepare services
interpolateServices(r.getTaskEnv(), r.task)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this

// ScriptExecutor is the interface the ServiceClient uses to execute script
// checks inside a container.
type ScriptExecutor interface {
Exec(ctx context.Context, cmd string, args []string) ([]byte, int, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be defined on the DriverHandle interface

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a new interface that is defined there that wraps the DriverHandle. I don't think that is the best approach because for plugin versions of drivers there will not be a good way to do that.

@@ -390,6 +388,10 @@ func (h *javaHandle) Update(task *structs.Task) error {
return nil
}

func (h *javaHandle) Exec(ctx context.Context, cmd string, args []string) ([]byte, int, error) {
return execChroot(ctx, h.taskDir, cmd, args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java driver won't always be chroot'ed. Only on linux

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in #2554

@@ -250,6 +245,10 @@ func (h *rawExecHandle) Update(task *structs.Task) error {
return nil
}

func (h *rawExecHandle) Exec(ctx context.Context, cmd string, args []string) ([]byte, int, error) {
return execChroot(ctx, "", cmd, args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really an appropriate name...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in #2554

// execChroot executes cmd with args inside chroot if set and returns the
// output, exit code, and error. If chroot is an empty string the command is
// executed on the host.
func execChroot(ctx context.Context, chroot, name string, args []string) ([]byte, int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this. I feel like the executor should be used to execute commands and it should launch them into the same cgroup/chroot/namespaces etc that it already has context for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #2554 to handle this in a followup PR.

@@ -1361,6 +1426,9 @@ func (r *TaskRunner) handleUpdate(update *structs.Allocation) error {
// given limit. It returns whether the task was destroyed and the error
// associated with the last kill attempt.
func (r *TaskRunner) handleDestroy() (destroyed bool, err error) {
// Remove from Consul
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior is to deregister services as soon as they exit and re-register once that have have started. This will leave the service registered even as the task may be restarting with an arbitrarily long delay

nomad/server.go Outdated
defaultConsulDiscoveryInterval time.Duration = 9 * time.Second

// defaultConsulDiscoveryIntervalRetry is how often to poll Consul for
// new servers if there is no leader and the last Consul query failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the opposite of the behavior we want. This makes it so under error condition we speed up instead of backing off?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

"github.com/hashicorp/nomad/nomad/structs"
)

var mark = struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment

// ScriptExecutor is the interface the ServiceClient uses to execute script
// checks inside a container.
type ScriptExecutor interface {
Exec(ctx context.Context, cmd string, args []string) ([]byte, int, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a new interface that is defined there that wraps the DriverHandle. I don't think that is the best approach because for plugin versions of drivers there will not be a good way to do that.

type ServiceClient struct {
client AgentAPI
logger *log.Logger
retryInterval time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment

type scriptHandle struct {
// cancel the script
cancel func()
done chan struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exitCh

case context.DeadlineExceeded:
// Log deadline exceeded every time, but flip last check to false
s.lastCheckOk = false
s.logger.Printf("[WARN] consul.checks: check %q timed out (%s)", s.check.Name, s.check.Timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have the allocation ID and task name.

}
if err != nil {
state = api.HealthCritical
output = []byte(err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string -> []byte -> string. Instead:

var outputMsg string
if err != nil {
  outputMsg = err.Error()
} else {
  outputMsg = string(output)
}


func (b *blockingScriptExec) Exec(ctx context.Context, _ string, _ []string) ([]byte, int, error) {
b.running <- mark
cmd := exec.CommandContext(ctx, "/bin/sleep", "9000")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"github.com/hashicorp/consul/api"
"github.com/hashicorp/nomad/nomad/structs"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the normal case tests:

  • Getting correct exit code
  • Getting output
  • Getting exec errors

var mErr multierror.Error
r.handleLock.Lock()
if r.handle != nil {
// Need to check driver abilities for updating services
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment?

@@ -1061,8 +1072,12 @@ func (r *TaskRunner) run() {
}
}

// cleanup calls Driver.Cleanup when a task is stopping. Errors are logged.
// cleanup removes Consul entries and calls Driver.Cleanup when a task is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't get called through restarts which means the check is registered during the restart sleep. (the wait channel firing in the run loop)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Fixed and tested: 9625e19

Ensured test failed without the RemoveTask call when restarting.

}
if !conf.TLSConfig.EnableHTTP {
consulServices[consul.GenerateServiceKey(httpServ)] = httpServ
consulServices = append(consulServices, httpServ)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump! Just want to make sure this doesn't slip through

c.logger.Printf("[WARN] consul.sync: failed to update services in Consul: %v", err)
}
failures++
if !retryTimer.Stop() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment here what you are doing with the timer. I know it is the recommended practice but still not super common

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added. I hate the timer API.

}

// commit operations and returns false if shutdown signalled before committing.
func (c *ServiceClient) commit(ops *operations) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really using the return value anywhere in that meaningful of a way. Remove?

}

// Service exists and wasn't updated, don't add it later
delete(newIDs, existingID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the port changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally missed the fact that port changes are excluded from service and check IDs, so great catch. Added a huge test to exercise port changes and did lots of fixing: 970979d

import (
"log"

"github.com/hashicorp/consul/api"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have been following the habit of _testing.go to put importable stubs/mocks/etc

return
case context.DeadlineExceeded:
// Log deadline exceeded every time, but flip last check to false
s.lastCheckOk = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't the code == 0 in this case?

cancel()

state := api.HealthCritical
switch code {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state is initialized to the default case (critical/failing)

os.Exit(m.Run())
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test:

  • exit code 0
  • exit code 1
  • exit code 2
  • Capturing output
  • Unknown command

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added except "Unknown command" which depends on the Exec implementation to return an error (which is tested): 9c14e8b

Exec implementations are tested directly for example: https://github.com/hashicorp/nomad/pull/2467/files#diff-1eb2911dd4ed9b612194897284c68251R304

@schmichael schmichael force-pushed the f-consul-refactor branch 2 times, most recently from 2713281 to 9866a15 Compare April 19, 2017 04:31
@schmichael
Copy link
Member Author

Ready for another review. Note that #2218 isn't supported for non-agent checks. Will do that in a follow up PR.

@@ -680,3 +703,59 @@ func (a *Agent) setupConsul(consulConfig *config.ConsulConfig) error {
go a.consulService.Run()
return nil
}

// consulSupportsTLSSkipVerify returns true if Consul supports TLSSkipVerify.
func consulSupportsTLSSkipVerify(self map[string]map[string]interface{}) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

// Test exit codes with errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use go 1.8 table driven test style

@@ -242,6 +260,188 @@ func TestConsul_ChangeTags(t *testing.T) {
}
}

// TestConsul_ChangePorts asserts that changing the ports on a service updates
// it in Consul. Since ports are part of the service ID this is a slightly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ports are not part of the service ID

@dadgar
Copy link
Contributor

dadgar commented Apr 19, 2017

Also fix the diff tests: https://travis-ci.org/hashicorp/nomad/builds/223416717#L5545

@@ -2074,6 +2074,7 @@ type ServiceCheck struct {
Interval time.Duration // Interval of the check
Timeout time.Duration // Timeout of the response from the check before consul fails the check
InitialStatus string // Initial status of the check
TLSSkipVerify bool // Skip TLS verification when Protocol=https
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also document on both HCL and JSON job spec

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to implement it for non-agent checks in a followup PR (and doc it then), but that seems silly. Fixing.

Fixes #2478 #2474 #1995 #2294

The new client only handles agent and task service advertisement. Server
discovery is mostly unchanged.

The Nomad client agent now handles all Consul operations instead of the
executor handling task related operations. When upgrading from an
earlier version of Nomad existing executors will be told to deregister
from Consul so that the Nomad agent can re-register the task's services
and checks.

Drivers - other than qemu - now support an Exec method for executing
abritrary commands in a task's environment. This is used to implement
script checks.

Interfaces are used extensively to avoid interacting with Consul in
tests that don't assert any Consul related behavior.
@@ -1900,20 +1900,22 @@ func TestTaskGroupDiff(t *testing.T) {

func TestTaskDiff(t *testing.T) {
cases := []struct {
Name string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bless your soul

@github-actions
Copy link

github-actions bot commented Apr 2, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants