Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: do not restart restored tasks until server is contacted #5669

Merged
merged 6 commits into from
May 14, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ __BACKWARDS INCOMPATIBILITIES:__
to update your code. [[GH-5536](https://github.com/hashicorp/nomad/pull/5536)]
* client: The format of check IDs in Consul has changed. If you rely upon
Nomad's check IDs you will need to update your code. [[GH-5536](https://github.com/hashicorp/nomad/pull/5536)]
* client: On startup a client will reattach to running tasks as before but
will not restart exited tasks. Exited tasks will be restarted only after the
client has reestablished communication with servers. System jobs will always
be restarted. [[GH-5669](https://github.com/hashicorp/nomad/pull/5669)]

FEATURES:

Expand Down
7 changes: 7 additions & 0 deletions client/allocrunner/alloc_runner.go
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,11 @@ type allocRunner struct {
// driverManager is responsible for dispensing driver plugins and registering
// event handlers
driverManager drivermanager.Manager

// serversContactedCh is passed to TaskRunners so they can detect when
// servers have been contacted for the first time in case of a failed
// restore.
serversContactedCh chan struct{}
}

// NewAllocRunner returns a new allocation runner.
Expand Down Expand Up @@ -167,6 +172,7 @@ func NewAllocRunner(config *Config) (*allocRunner, error) {
prevAllocMigrator: config.PrevAllocMigrator,
devicemanager: config.DeviceManager,
driverManager: config.DriverManager,
serversContactedCh: config.ServersContactedCh,
}

// Create the logger based on the allocation ID
Expand Down Expand Up @@ -205,6 +211,7 @@ func (ar *allocRunner) initTaskRunners(tasks []*structs.Task) error {
DeviceStatsReporter: ar.deviceStatsReporter,
DeviceManager: ar.devicemanager,
DriverManager: ar.driverManager,
ServersContactedCh: ar.serversContactedCh,
}

// Create, but do not Run, the task runner
Expand Down
4 changes: 4 additions & 0 deletions client/allocrunner/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -51,4 +51,8 @@ type Config struct {

// DriverManager handles dispensing of driver plugins
DriverManager drivermanager.Manager

// ServersContactedCh is closed when the first GetClientAllocs call to
// servers succeeds and allocs are synced.
ServersContactedCh chan struct{}
}
49 changes: 48 additions & 1 deletion client/allocrunner/taskrunner/task_runner.go
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,15 @@ type TaskRunner struct {
// maxEvents is the capacity of the TaskEvents on the TaskState.
// Defaults to defaultMaxEvents but overrideable for testing.
maxEvents int

// serversContactedCh is passed to TaskRunners so they can detect when
// GetClientAllocs has been called in case of a failed restore.
serversContactedCh <-chan struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more precisely: fetched expected running allocations from server - not heartbeat/registration/etc. Not sure if it's worth making the distinction or where to add it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we've never had good nomenclature for this. The code base often talks about "syncing" allocs, but since client->server and server->client updates are completely independent code paths it's always ambiguous to say "syncing".

  • syncedFromServersCh
  • allocsPulledCh
  • allocsRetrievedCh
  • unblockedRestoredCh - we could go with a name related to its use instead of the event/state it represents, but that doesn't seem as good.

Any thoughts/preferences? allocsPulledCh might be my preference as we use the term pulled in related code/logging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be ok with clarifying in comment with keeping the field name the same.


// waitOnServers defaults to false but will be set true if a restore
// fails and the Run method should wait until serversContactedCh is
// closed.
waitOnServers bool
}

type Config struct {
Expand Down Expand Up @@ -222,6 +231,10 @@ type Config struct {
// DriverManager is used to dispense driver plugins and register event
// handlers
DriverManager drivermanager.Manager

// ServersContactedCh is closed when the first GetClientAllocs call to
// servers succeeds and allocs are synced.
ServersContactedCh chan struct{}
}

func NewTaskRunner(config *Config) (*TaskRunner, error) {
Expand Down Expand Up @@ -270,6 +283,7 @@ func NewTaskRunner(config *Config) (*TaskRunner, error) {
devicemanager: config.DeviceManager,
driverManager: config.DriverManager,
maxEvents: defaultMaxEvents,
serversContactedCh: config.ServersContactedCh,
}

// Create the logger based on the allocation ID
Expand Down Expand Up @@ -381,6 +395,19 @@ func (tr *TaskRunner) Run() {
// - should be handled serially.
go tr.handleUpdates()

// If restore failed wait until servers are contacted before running.
// #1795
if tr.waitOnServers {
tr.logger.Info("task failed to restore; waiting to contact server before restarting")
select {
case <-tr.killCtx.Done():
case <-tr.shutdownCtx.Done():
return
case <-tr.serversContactedCh:
tr.logger.Trace("server contacted; unblocking waiting task")
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a timeout after which we allow the task to restart? Some people may be relying on the fact that disconnected Nomad nodes will continue to run their tasks. thinking

Alternatively we could have a client config and/or jobspec parameter for controlling this behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure - but I'd probably avoid adding more knobs to maintain until we hear clear demand.

To me the problem is that some tasks are not restarted and having partially running alloc while the nomad agent isn't running, potentially for hours. I find it easier to reason that in recovery a client that starts without being able to connect to server has the same behavior as no client starting. Adding a timeout seems to complicate the story for such a narrow case, without really addressing the big problem of having partially failing/running alloc while client is gone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My one big concern is that when this is released it will break the ability for disconnected nodes to restart.

Right now completely disconnected nodes can fully reboot and will restart anything they were running before regardless of whether or not contact is ever made with servers.

It's difficult to know if anyone is relying on this behavior because it is much more of an emergent property of the client's code than an intentional and documented design choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline and decided to block indefinitely. If a client can't contact servers in a timely fashion (seconds) it's likely to have its non-system allocs marked as lost and rescheduled elsewhere anyway so there's even less reason to restart at that point.

}

MAIN:
for !tr.Alloc().TerminalStatus() {
select {
Expand Down Expand Up @@ -858,8 +885,28 @@ func (tr *TaskRunner) Restore() error {
if taskHandle := tr.localState.TaskHandle; taskHandle != nil {
//TODO if RecoverTask returned the DriverNetwork we wouldn't
// have to persist it at all!
tr.restoreHandle(taskHandle, tr.localState.DriverNetwork)
restored := tr.restoreHandle(taskHandle, tr.localState.DriverNetwork)

// If the handle could not be restored, the alloc is
// non-terminal, and the task isn't a system job: wait until
// servers have been contacted before running. #1795
if restored {
return nil
}

alloc := tr.Alloc()
if alloc.TerminalStatus() || alloc.Job.Type == structs.JobTypeSystem {
return nil
}

tr.logger.Trace("failed to reattach to task; will not run until server is contacted")
tr.waitOnServers = true

ev := structs.NewTaskEvent(structs.TaskRestoreFailed).
SetDisplayMessage("failed to restore task; will not run until server is contacted")
tr.UpdateState(structs.TaskStatePending, ev)
}

return nil
}

Expand Down
Loading