Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker: fix bug where network pause containers would be erroneously reconciled #16352

Merged
merged 2 commits into from
Mar 7, 2023

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Mar 6, 2023

This PR adds tracking of pause containers to the docker driver, fixing a bug introduced by #15898 where the containers are now subject to dangling container reconciliation. The pause container created for allocs with docker tasks making use of bridge networking is not created in the same flow as a normal Task - which have a TaskHandle state. The set of tasks not to reconcile was identified by scanning the set of these states, which does not include pause containers.

To remedy this, this PR now tracks pause containers in their own little store. Since the Nomad Client may be restarted, we scan existing running containers on startup to reload the store from existing running containers.

Fixes: #16338

@shoenig shoenig force-pushed the pause-reconciliation branch from 359b9b2 to dc84d23 Compare March 6, 2023 20:28
@shoenig shoenig changed the title docker: fix bug where network pause containers would be erroneously gc'd docker: fix bug where network pause containers would be erroneously reconciled Mar 6, 2023
drivers/docker/driver.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left one last note about threading a context thru but that's fairly straightforward

return
}

containers, listErr := dockerClient.ListContainers(docker.ListContainersOptions{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListContainersOptions has a Context field we could thread from the Driver.ctx. This is one of the places for the Docker client where it seems unambiguously correct to bail out.

@gbolo
Copy link

gbolo commented Mar 8, 2023

how soon can we get a release with this fix?

@shoenig
Copy link
Member Author

shoenig commented Mar 8, 2023

Hi @gbolo we're currently expecting to release v1.5.1 with this fix, plus a few other important things on Monday the 13th.

@gbolo
Copy link

gbolo commented Mar 10, 2023

@shoenig will this be back ported to v1.4.x? Seems like 1.4.5 is also affected:

https://github.com/hashicorp/nomad/blob/v1.4.5/drivers/docker/network.go#L128

@shoenig
Copy link
Member Author

shoenig commented Mar 10, 2023

yes, good catch @gbolo!

tgross added a commit that referenced this pull request Mar 29, 2023
When we added recovery of pause containers in #16352 we called the recovery
function from the plugin factory function. But in our plugin setup protocol, a
plugin isn't ready for use until we call `SetConfig`. This meant that
recovering pause containers was always done with the default
config. Setting up the Docker client only happens once, so setting the wrong
config in the recovery function also means that all other Docker API calls will
use the default config.

Move the `recoveryPauseContainers` call into the `SetConfig`. Fix the error
handling so that we return any error but also don't log when the context is
canceled, which happens twice during normal startup as we fingerprint the
driver.
tgross added a commit that referenced this pull request Mar 29, 2023
When we added recovery of pause containers in #16352 we called the recovery
function from the plugin factory function. But in our plugin setup protocol, a
plugin isn't ready for use until we call `SetConfig`. This meant that
recovering pause containers was always done with the default
config. Setting up the Docker client only happens once, so setting the wrong
config in the recovery function also means that all other Docker API calls will
use the default config.

Move the `recoveryPauseContainers` call into the `SetConfig`. Fix the error
handling so that we return any error but also don't log when the context is
canceled, which happens twice during normal startup as we fingerprint the
driver.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tasks using Docker drivers frequently failed to restart
3 participants