Docker images immediately cleaned up #9045

idrennanvmware · 2020-10-07T19:12:22Z

Nomad version

Nomad v0.12.3 (2db8abd)

Operating system and Environment details

PhotonOS3

Issue

We are seeing docker images immediately removed when a job is deployed with an update.

We deployed a job (telegraf in this case) and as expected on the first run the docker image was pulled to the machine. We then changed the resourcing for the job (see screenshot) and when that was deployed the job upgraded as expected but the image was downloaded again (job deployed to the same machine). While the deployment was underway I was able to verify the image was no longer there with "docker image ls"

We have seen other similar issues where Nomad appears to be aggressively cleaning up the images which are causing pull operations to happen unnecessarily.

We were able to replicate this repeatedly in another exercise where we changed the config to have a hard CPU limit

There are no changes to the default docker image retention policies in nomad. Per the documentation we would expect this to be a default of 3m

The text was updated successfully, but these errors were encountered:

carlosrbcunha · 2020-10-09T10:15:15Z

I noticed this behaviour and tried to add "docker.cleanup.image.delay" = "96h" option to the nomad client options but it did not changed the behaviour and the images were not reused inside this time interval.

idrennanvmware · 2020-10-14T11:26:38Z

I did a quick scan of the code (please excuse that I am VERY unfamiliar with the Nomad code base) but from what I can see the image cleanup setting is never used. As I track through the following files/lines - I just see the task counter being checked that there are no running tasks for an image - and if none, then it's deleted.

I could be way off base, but I'd be curious to hear if I'm on the right track.

nomad/drivers/docker/driver.go

Line 1365 in c14c616

if err := d.cleanupImage(h); err != nil {

nomad/drivers/docker/driver.go

Line 1377 in c14c616

func (d *Driver) cleanupImage(handle *taskHandle) error {

nomad/drivers/docker/coordinator.go

Line 293 in 42de704

go d.removeImageImpl(imageID, ctx)

nomad/drivers/docker/coordinator.go

Line 302 in 42de704

func (d *dockerCoordinator) removeImageImpl(id string, ctx context.Context) {

tgross · 2020-10-14T15:00:35Z

Just to make sure we're testing the right thing here, can you provide your client configuration file, @idrennanvmware ?

idrennanvmware · 2020-10-14T15:26:57Z

Hi @tgross

We don't have anything fancy configured but I've attached our config file(s) for Nomad Clients to verify. We have them broken up as attached.
Archive.zip

I also included the docker config file referenced (its just an ECR helper)

Thanks!
Ian

tgross · 2020-10-15T15:10:12Z

Okay, so here's your client configuration for Docker:

plugin "docker" {
  config {
    auth {
      config = "/etc/docker/config.json"
    }
  }
}

So the Docker GC config is the default:

image - Defaults to true. Changing this to false will prevent Nomad from removing images from stopped tasks.

image_delay - A time duration, as defined here, that defaults to 3m. The delay controls how long Nomad will wait between an image being unused and deleting it. If a tasks is received that uses the same image within the delay, the image will be reused.

# run the job
$ nomad job run ./example.nomad
==> Monitoring evaluation "da025578"
    Evaluation triggered by job "example"
    Evaluation within deployment: "c45cc5ba"
    Allocation "eabe5e5c" created: node "6a3518cd", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "da025578" finished with status "complete"

# edit the job
$ emacs example.nomad

# re-register the edited job
$ nomad job run ./example.nomad
==> Monitoring evaluation "96feca81"
    Evaluation triggered by job "example"
    Evaluation within deployment: "a897aa34"
    Allocation "44d88b57" created: node "6a3518cd", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "96feca81" finished with status "complete"

# image was removed while task was stopped!
$ docker images | grep redis

# wait for new task to start...
$ docker images | grep redis
redis                                  3.2                 87856cc39862        2 years ago         76MB

Let's change image = false:

plugin "docker" {

  config {
    gc {
      image       = false
      image_delay = "3m"
    }
  }
}

Then try this again, fully stopping a job this time:

# nothing up our sleeves
$ docker images | grep redis

# run the job
$ nomad job run ./example.nomad
==> Monitoring evaluation "a53edacb"
    Evaluation triggered by job "example"
    Allocation "85810658" created: node "734adbda", group "cache"
    Evaluation within deployment: "7f5903df"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "a53edacb" finished with status "complete"

# image is there as expected
$ docker images | grep redis
redis                                  3.2                 87856cc39862        2 years ago         76MB

# stop the job
$ nomad job stop example
==> Monitoring evaluation "7cadb920"
    Evaluation triggered by job "example"
    Evaluation within deployment: "7e00e5b6"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7cadb920" finished with status "complete"

# image is still there
$ docker images | grep redis
redis                                  3.2                 87856cc39862        2 years ago         76MB

# wait 3+ minutes...
$ date
Thu Oct 15 15:01:16 UTC 2020

$ date
Thu Oct 15 15:04:46 UTC 2020

# image is still there
$ docker images | grep redis
redis                                  3.2                 87856cc39862        2 years ago         76MB

So it looks like the gc.image field is being respected correctly, but not the gc.image_delay.

tgross · 2020-10-15T15:37:34Z

Quick patch (from master) to add some logging:

$ git diff
diff --git a/drivers/docker/coordinator.go b/drivers/docker/coordinator.go
index dd16215dd..91cc82047 100644
--- a/drivers/docker/coordinator.go
+++ b/drivers/docker/coordinator.go
@@ -306,6 +306,7 @@ func (d *dockerCoordinator) removeImageImpl(id string, ctx context.Context) {
                // We have been cancelled
                return
        case <-time.After(d.removeDelay):
+               d.logger.Debug("remove delay expired", "delay", d.removeDelay)
        }

        // Ensure we are suppose to delete. Do a short check while holding the lock
@@ -325,6 +326,7 @@ func (d *dockerCoordinator) removeImageImpl(id string, ctx context.Context) {
        for i := 0; i < 3; i++ {
                err := d.client.RemoveImage(id)
                if err == nil {
+                       d.logger.Debug("removed image", "image_id", id)
                        break
                }

The relevant logs show 0s delay!:

2020-10-15T15:27:00.853Z [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:87856cc39862cec77541d68382e4867d7ccb29a85a17221446c857ddaebca916 references=0
2020-10-15T15:27:00.853Z [DEBUG] client.driver_mgr.docker: remove delay expired: driver=docker delay=0s
...
2020-10-15T15:27:00.916Z [DEBUG] client.driver_mgr.docker: cleanup removed downloaded image: driver=docker image_id=sha256:87856cc39862cec77541d68382e4867d7ccb29a85a17221446c857ddaebca916

But if I set a real value instead of relying on the defaults:

plugin "docker" {

  config {
    gc {
      image       = true
      image_delay = "3m"
    }
  }
}

Then I get the following in the logs:

2020-10-15T15:30:38.134Z [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:87856cc39862cec77541d68382e4867d7ccb29a85a17221446c857ddaebca916 references=0
...
2020-10-15T15:33:38.136Z [DEBUG] client.driver_mgr.docker: remove delay expired: driver=docker delay=3m0s
2020-10-15T15:33:38.215Z [DEBUG] client.driver_mgr.docker: removed image: driver=docker image_id=sha256:87856cc39862cec77541d68382e4867d7ccb29a85a17221446c857ddaebca916
2020-10-15T15:33:38.215Z [DEBUG] client.driver_mgr.docker: cleanup removed downloaded image: driver=docker image_id=sha256:87856cc39862cec77541d68382e4867d7ccb29a85a17221446c857ddaebca916

So it looks like the image_delay is being respected, but only if explicitly set and not relying on the default. Probably a simple configuration parsing/validation bug.

idrennanvmware · 2020-10-15T15:39:30Z

Ohhh, that's a good enough work around for now! We are more than happy to explicitly add that to our configs. Thank you @tgross

tgross · 2020-10-15T16:36:39Z

Fixed in #9101, which will ship in the upcoming 1.0.

idrennanvmware · 2020-10-15T16:42:42Z

Legends <3

github-actions · 2022-11-01T02:44:13Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

shoenig added stage/needs-investigation theme/driver/docker labels Oct 8, 2020

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. type/bug and removed stage/needs-investigation labels Oct 15, 2020

tgross mentioned this issue Oct 15, 2020

docker: image_delay default missing without gc stanza #9101

Merged

tgross closed this as completed in #9101 Oct 15, 2020

tgross added this to the 1.0 milestone Oct 15, 2020

idrennanvmware mentioned this issue Dec 7, 2020

Nomad restart caused docker image clean up to no longer run #9549

Open

github-actions bot locked as resolved and limited conversation to collaborators Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker images immediately cleaned up #9045

Docker images immediately cleaned up #9045

idrennanvmware commented Oct 7, 2020 •

edited

Loading

carlosrbcunha commented Oct 9, 2020

idrennanvmware commented Oct 14, 2020 •

edited

Loading

tgross commented Oct 14, 2020

idrennanvmware commented Oct 14, 2020

tgross commented Oct 15, 2020

tgross commented Oct 15, 2020

idrennanvmware commented Oct 15, 2020

tgross commented Oct 15, 2020

idrennanvmware commented Oct 15, 2020 •

edited

Loading

github-actions bot commented Nov 1, 2022

Docker images immediately cleaned up #9045

Docker images immediately cleaned up #9045

Comments

idrennanvmware commented Oct 7, 2020 • edited Loading

Nomad version

Operating system and Environment details

Issue

carlosrbcunha commented Oct 9, 2020

idrennanvmware commented Oct 14, 2020 • edited Loading

tgross commented Oct 14, 2020

idrennanvmware commented Oct 14, 2020

tgross commented Oct 15, 2020

tgross commented Oct 15, 2020

idrennanvmware commented Oct 15, 2020

tgross commented Oct 15, 2020

idrennanvmware commented Oct 15, 2020 • edited Loading

github-actions bot commented Nov 1, 2022

idrennanvmware commented Oct 7, 2020 •

edited

Loading

idrennanvmware commented Oct 14, 2020 •

edited

Loading

idrennanvmware commented Oct 15, 2020 •

edited

Loading