-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: do not restart restored tasks until server is contacted #5669
Changes from all commits
846b482
e7042b6
4b854cc
6a2792a
796c05b
abd809d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -193,6 +193,15 @@ type TaskRunner struct { | |
// maxEvents is the capacity of the TaskEvents on the TaskState. | ||
// Defaults to defaultMaxEvents but overrideable for testing. | ||
maxEvents int | ||
|
||
// serversContactedCh is passed to TaskRunners so they can detect when | ||
// GetClientAllocs has been called in case of a failed restore. | ||
serversContactedCh <-chan struct{} | ||
|
||
// waitOnServers defaults to false but will be set true if a restore | ||
// fails and the Run method should wait until serversContactedCh is | ||
// closed. | ||
waitOnServers bool | ||
} | ||
|
||
type Config struct { | ||
|
@@ -222,6 +231,10 @@ type Config struct { | |
// DriverManager is used to dispense driver plugins and register event | ||
// handlers | ||
DriverManager drivermanager.Manager | ||
|
||
// ServersContactedCh is closed when the first GetClientAllocs call to | ||
// servers succeeds and allocs are synced. | ||
ServersContactedCh chan struct{} | ||
} | ||
|
||
func NewTaskRunner(config *Config) (*TaskRunner, error) { | ||
|
@@ -270,6 +283,7 @@ func NewTaskRunner(config *Config) (*TaskRunner, error) { | |
devicemanager: config.DeviceManager, | ||
driverManager: config.DriverManager, | ||
maxEvents: defaultMaxEvents, | ||
serversContactedCh: config.ServersContactedCh, | ||
} | ||
|
||
// Create the logger based on the allocation ID | ||
|
@@ -381,6 +395,19 @@ func (tr *TaskRunner) Run() { | |
// - should be handled serially. | ||
go tr.handleUpdates() | ||
|
||
// If restore failed wait until servers are contacted before running. | ||
// #1795 | ||
if tr.waitOnServers { | ||
tr.logger.Info("task failed to restore; waiting to contact server before restarting") | ||
select { | ||
case <-tr.killCtx.Done(): | ||
case <-tr.shutdownCtx.Done(): | ||
return | ||
case <-tr.serversContactedCh: | ||
tr.logger.Trace("server contacted; unblocking waiting task") | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should there be a timeout after which we allow the task to restart? Some people may be relying on the fact that disconnected Nomad nodes will continue to run their tasks. thinking Alternatively we could have a client config and/or jobspec parameter for controlling this behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure - but I'd probably avoid adding more knobs to maintain until we hear clear demand. To me the problem is that some tasks are not restarted and having partially running alloc while the nomad agent isn't running, potentially for hours. I find it easier to reason that in recovery a client that starts without being able to connect to server has the same behavior as no client starting. Adding a timeout seems to complicate the story for such a narrow case, without really addressing the big problem of having partially failing/running alloc while client is gone. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My one big concern is that when this is released it will break the ability for disconnected nodes to restart. Right now completely disconnected nodes can fully reboot and will restart anything they were running before regardless of whether or not contact is ever made with servers. It's difficult to know if anyone is relying on this behavior because it is much more of an emergent property of the client's code than an intentional and documented design choice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Discussed offline and decided to block indefinitely. If a client can't contact servers in a timely fashion (seconds) it's likely to have its non-system allocs marked as lost and rescheduled elsewhere anyway so there's even less reason to restart at that point. |
||
} | ||
|
||
MAIN: | ||
for !tr.Alloc().TerminalStatus() { | ||
select { | ||
|
@@ -858,8 +885,28 @@ func (tr *TaskRunner) Restore() error { | |
if taskHandle := tr.localState.TaskHandle; taskHandle != nil { | ||
//TODO if RecoverTask returned the DriverNetwork we wouldn't | ||
// have to persist it at all! | ||
tr.restoreHandle(taskHandle, tr.localState.DriverNetwork) | ||
restored := tr.restoreHandle(taskHandle, tr.localState.DriverNetwork) | ||
|
||
// If the handle could not be restored, the alloc is | ||
// non-terminal, and the task isn't a system job: wait until | ||
// servers have been contacted before running. #1795 | ||
if restored { | ||
return nil | ||
} | ||
|
||
alloc := tr.Alloc() | ||
if alloc.TerminalStatus() || alloc.Job.Type == structs.JobTypeSystem { | ||
return nil | ||
} | ||
|
||
tr.logger.Trace("failed to reattach to task; will not run until server is contacted") | ||
tr.waitOnServers = true | ||
|
||
ev := structs.NewTaskEvent(structs.TaskRestoreFailed). | ||
SetDisplayMessage("failed to restore task; will not run until server is contacted") | ||
tr.UpdateState(structs.TaskStatePending, ev) | ||
} | ||
|
||
return nil | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more precisely: fetched expected running allocations from server - not heartbeat/registration/etc. Not sure if it's worth making the distinction or where to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we've never had good nomenclature for this. The code base often talks about "syncing" allocs, but since
client->server
andserver->client
updates are completely independent code paths it's always ambiguous to say "syncing".syncedFromServersCh
allocsPulledCh
allocsRetrievedCh
unblockedRestoredCh
- we could go with a name related to its use instead of the event/state it represents, but that doesn't seem as good.Any thoughts/preferences?
allocsPulledCh
might be my preference as we use the termpulled
in related code/logging.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be ok with clarifying in comment with keeping the field name the same.