-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker images immediately cleaned up #9045
Comments
I noticed this behaviour and tried to add "docker.cleanup.image.delay" = "96h" option to the nomad client options but it did not changed the behaviour and the images were not reused inside this time interval. |
I did a quick scan of the code (please excuse that I am VERY unfamiliar with the Nomad code base) but from what I can see the image cleanup setting is never used. As I track through the following files/lines - I just see the task counter being checked that there are no running tasks for an image - and if none, then it's deleted. I could be way off base, but I'd be curious to hear if I'm on the right track. nomad/drivers/docker/driver.go Line 1365 in c14c616
nomad/drivers/docker/driver.go Line 1377 in c14c616
nomad/drivers/docker/coordinator.go Line 293 in 42de704
nomad/drivers/docker/coordinator.go Line 302 in 42de704
|
Just to make sure we're testing the right thing here, can you provide your client configuration file, @idrennanvmware ? |
Hi @tgross We don't have anything fancy configured but I've attached our config file(s) for Nomad Clients to verify. We have them broken up as attached. I also included the docker config file referenced (its just an ECR helper) Thanks! |
Okay, so here's your client configuration for Docker: plugin "docker" {
config {
auth {
config = "/etc/docker/config.json"
}
}
} So the Docker GC config is the default:
# run the job
$ nomad job run ./example.nomad
==> Monitoring evaluation "da025578"
Evaluation triggered by job "example"
Evaluation within deployment: "c45cc5ba"
Allocation "eabe5e5c" created: node "6a3518cd", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "da025578" finished with status "complete"
# edit the job
$ emacs example.nomad
# re-register the edited job
$ nomad job run ./example.nomad
==> Monitoring evaluation "96feca81"
Evaluation triggered by job "example"
Evaluation within deployment: "a897aa34"
Allocation "44d88b57" created: node "6a3518cd", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "96feca81" finished with status "complete"
# image was removed while task was stopped!
$ docker images | grep redis
# wait for new task to start...
$ docker images | grep redis
redis 3.2 87856cc39862 2 years ago 76MB Let's change plugin "docker" {
config {
gc {
image = false
image_delay = "3m"
}
}
} Then try this again, fully stopping a job this time: # nothing up our sleeves
$ docker images | grep redis
# run the job
$ nomad job run ./example.nomad
==> Monitoring evaluation "a53edacb"
Evaluation triggered by job "example"
Allocation "85810658" created: node "734adbda", group "cache"
Evaluation within deployment: "7f5903df"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "a53edacb" finished with status "complete"
# image is there as expected
$ docker images | grep redis
redis 3.2 87856cc39862 2 years ago 76MB
# stop the job
$ nomad job stop example
==> Monitoring evaluation "7cadb920"
Evaluation triggered by job "example"
Evaluation within deployment: "7e00e5b6"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "7cadb920" finished with status "complete"
# image is still there
$ docker images | grep redis
redis 3.2 87856cc39862 2 years ago 76MB
# wait 3+ minutes...
$ date
Thu Oct 15 15:01:16 UTC 2020
$ date
Thu Oct 15 15:04:46 UTC 2020
# image is still there
$ docker images | grep redis
redis 3.2 87856cc39862 2 years ago 76MB So it looks like the |
Quick patch (from $ git diff
diff --git a/drivers/docker/coordinator.go b/drivers/docker/coordinator.go
index dd16215dd..91cc82047 100644
--- a/drivers/docker/coordinator.go
+++ b/drivers/docker/coordinator.go
@@ -306,6 +306,7 @@ func (d *dockerCoordinator) removeImageImpl(id string, ctx context.Context) {
// We have been cancelled
return
case <-time.After(d.removeDelay):
+ d.logger.Debug("remove delay expired", "delay", d.removeDelay)
}
// Ensure we are suppose to delete. Do a short check while holding the lock
@@ -325,6 +326,7 @@ func (d *dockerCoordinator) removeImageImpl(id string, ctx context.Context) {
for i := 0; i < 3; i++ {
err := d.client.RemoveImage(id)
if err == nil {
+ d.logger.Debug("removed image", "image_id", id)
break
} The relevant logs show 0s delay!:
But if I set a real value instead of relying on the defaults: plugin "docker" {
config {
gc {
image = true
image_delay = "3m"
}
}
} Then I get the following in the logs:
So it looks like the |
Ohhh, that's a good enough work around for now! We are more than happy to explicitly add that to our configs. Thank you @tgross |
Fixed in #9101, which will ship in the upcoming 1.0. |
Legends <3 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.12.3 (2db8abd)
Operating system and Environment details
PhotonOS3
Issue
We are seeing docker images immediately removed when a job is deployed with an update.
We deployed a job (telegraf in this case) and as expected on the first run the docker image was pulled to the machine. We then changed the resourcing for the job (see screenshot) and when that was deployed the job upgraded as expected but the image was downloaded again (job deployed to the same machine). While the deployment was underway I was able to verify the image was no longer there with "docker image ls"
We have seen other similar issues where Nomad appears to be aggressively cleaning up the images which are causing pull operations to happen unnecessarily.
We were able to replicate this repeatedly in another exercise where we changed the config to have a hard CPU limit
There are no changes to the default docker image retention policies in nomad. Per the documentation we would expect this to be a default of 3m
The text was updated successfully, but these errors were encountered: