-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker: fix bug where network pause containers would be erroneously reconciled #16352
Conversation
359b9b2
to
dc84d23
Compare
dc84d23
to
aea6e83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left one last note about threading a context thru but that's fairly straightforward
return | ||
} | ||
|
||
containers, listErr := dockerClient.ListContainers(docker.ListContainersOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ListContainersOptions
has a Context
field we could thread from the Driver.ctx
. This is one of the places for the Docker client where it seems unambiguously correct to bail out.
how soon can we get a release with this fix? |
Hi @gbolo we're currently expecting to release v1.5.1 with this fix, plus a few other important things on Monday the 13th. |
@shoenig will this be back ported to v1.4.x? Seems like https://github.com/hashicorp/nomad/blob/v1.4.5/drivers/docker/network.go#L128 |
yes, good catch @gbolo! |
When we added recovery of pause containers in #16352 we called the recovery function from the plugin factory function. But in our plugin setup protocol, a plugin isn't ready for use until we call `SetConfig`. This meant that recovering pause containers was always done with the default config. Setting up the Docker client only happens once, so setting the wrong config in the recovery function also means that all other Docker API calls will use the default config. Move the `recoveryPauseContainers` call into the `SetConfig`. Fix the error handling so that we return any error but also don't log when the context is canceled, which happens twice during normal startup as we fingerprint the driver.
When we added recovery of pause containers in #16352 we called the recovery function from the plugin factory function. But in our plugin setup protocol, a plugin isn't ready for use until we call `SetConfig`. This meant that recovering pause containers was always done with the default config. Setting up the Docker client only happens once, so setting the wrong config in the recovery function also means that all other Docker API calls will use the default config. Move the `recoveryPauseContainers` call into the `SetConfig`. Fix the error handling so that we return any error but also don't log when the context is canceled, which happens twice during normal startup as we fingerprint the driver.
This PR adds tracking of pause containers to the docker driver, fixing a bug introduced by #15898 where the containers are now subject to dangling container reconciliation. The pause container created for allocs with docker tasks making use of bridge networking is not created in the same flow as a normal Task - which have a
TaskHandle
state. The set of tasks not to reconcile was identified by scanning the set of these states, which does not include pause containers.To remedy this, this PR now tracks pause containers in their own little store. Since the Nomad Client may be restarted, we scan existing running containers on startup to reload the store from existing running containers.
Fixes: #16338