Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker images immediately cleaned up #9045

Closed
idrennanvmware opened this issue Oct 7, 2020 · 10 comments · Fixed by #9101
Closed

Docker images immediately cleaned up #9045

idrennanvmware opened this issue Oct 7, 2020 · 10 comments · Fixed by #9101
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker type/bug
Milestone

Comments

@idrennanvmware
Copy link
Contributor

idrennanvmware commented Oct 7, 2020

Nomad version

Nomad v0.12.3 (2db8abd)

Operating system and Environment details

PhotonOS3

Issue

We are seeing docker images immediately removed when a job is deployed with an update.

We deployed a job (telegraf in this case) and as expected on the first run the docker image was pulled to the machine. We then changed the resourcing for the job (see screenshot) and when that was deployed the job upgraded as expected but the image was downloaded again (job deployed to the same machine). While the deployment was underway I was able to verify the image was no longer there with "docker image ls"

We have seen other similar issues where Nomad appears to be aggressively cleaning up the images which are causing pull operations to happen unnecessarily.

We were able to replicate this repeatedly in another exercise where we changed the config to have a hard CPU limit

Screen Shot 2020-10-07 at 3 08 36 PM

There are no changes to the default docker image retention policies in nomad. Per the documentation we would expect this to be a default of 3m

@carlosrbcunha
Copy link

I noticed this behaviour and tried to add "docker.cleanup.image.delay" = "96h" option to the nomad client options but it did not changed the behaviour and the images were not reused inside this time interval.

@idrennanvmware
Copy link
Contributor Author

idrennanvmware commented Oct 14, 2020

I did a quick scan of the code (please excuse that I am VERY unfamiliar with the Nomad code base) but from what I can see the image cleanup setting is never used. As I track through the following files/lines - I just see the task counter being checked that there are no running tasks for an image - and if none, then it's deleted.

I could be way off base, but I'd be curious to hear if I'm on the right track.

if err := d.cleanupImage(h); err != nil {

func (d *Driver) cleanupImage(handle *taskHandle) error {

go d.removeImageImpl(imageID, ctx)

func (d *dockerCoordinator) removeImageImpl(id string, ctx context.Context) {

@tgross
Copy link
Member

tgross commented Oct 14, 2020

Just to make sure we're testing the right thing here, can you provide your client configuration file, @idrennanvmware ?

@idrennanvmware
Copy link
Contributor Author

Hi @tgross

We don't have anything fancy configured but I've attached our config file(s) for Nomad Clients to verify. We have them broken up as attached.
Archive.zip

I also included the docker config file referenced (its just an ECR helper)

Thanks!
Ian

@tgross
Copy link
Member

tgross commented Oct 15, 2020

Okay, so here's your client configuration for Docker:

plugin "docker" {
  config {
    auth {
      config = "/etc/docker/config.json"
    }
  }
}

So the Docker GC config is the default:

image - Defaults to true. Changing this to false will prevent Nomad from removing images from stopped tasks.

image_delay - A time duration, as defined here, that defaults to 3m. The delay controls how long Nomad will wait between an image being unused and deleting it. If a tasks is received that uses the same image within the delay, the image will be reused.

# run the job
$ nomad job run ./example.nomad
==> Monitoring evaluation "da025578"
    Evaluation triggered by job "example"
    Evaluation within deployment: "c45cc5ba"
    Allocation "eabe5e5c" created: node "6a3518cd", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "da025578" finished with status "complete"

# edit the job
$ emacs example.nomad

# re-register the edited job
$ nomad job run ./example.nomad
==> Monitoring evaluation "96feca81"
    Evaluation triggered by job "example"
    Evaluation within deployment: "a897aa34"
    Allocation "44d88b57" created: node "6a3518cd", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "96feca81" finished with status "complete"

# image was removed while task was stopped!
$ docker images | grep redis

# wait for new task to start...
$ docker images | grep redis
redis                                  3.2                 87856cc39862        2 years ago         76MB

Let's change image = false:

plugin "docker" {

  config {
    gc {
      image       = false
      image_delay = "3m"
    }
  }
}

Then try this again, fully stopping a job this time:

# nothing up our sleeves
$ docker images | grep redis

# run the job
$ nomad job run ./example.nomad
==> Monitoring evaluation "a53edacb"
    Evaluation triggered by job "example"
    Allocation "85810658" created: node "734adbda", group "cache"
    Evaluation within deployment: "7f5903df"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "a53edacb" finished with status "complete"

# image is there as expected
$ docker images | grep redis
redis                                  3.2                 87856cc39862        2 years ago         76MB

# stop the job
$ nomad job stop example
==> Monitoring evaluation "7cadb920"
    Evaluation triggered by job "example"
    Evaluation within deployment: "7e00e5b6"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7cadb920" finished with status "complete"

# image is still there
$ docker images | grep redis
redis                                  3.2                 87856cc39862        2 years ago         76MB

# wait 3+ minutes...
$ date
Thu Oct 15 15:01:16 UTC 2020

$ date
Thu Oct 15 15:04:46 UTC 2020

# image is still there
$ docker images | grep redis
redis                                  3.2                 87856cc39862        2 years ago         76MB

So it looks like the gc.image field is being respected correctly, but not the gc.image_delay.

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. type/bug and removed stage/needs-investigation labels Oct 15, 2020
@tgross
Copy link
Member

tgross commented Oct 15, 2020

Quick patch (from master) to add some logging:

$ git diff
diff --git a/drivers/docker/coordinator.go b/drivers/docker/coordinator.go
index dd16215dd..91cc82047 100644
--- a/drivers/docker/coordinator.go
+++ b/drivers/docker/coordinator.go
@@ -306,6 +306,7 @@ func (d *dockerCoordinator) removeImageImpl(id string, ctx context.Context) {
                // We have been cancelled
                return
        case <-time.After(d.removeDelay):
+               d.logger.Debug("remove delay expired", "delay", d.removeDelay)
        }

        // Ensure we are suppose to delete. Do a short check while holding the lock
@@ -325,6 +326,7 @@ func (d *dockerCoordinator) removeImageImpl(id string, ctx context.Context) {
        for i := 0; i < 3; i++ {
                err := d.client.RemoveImage(id)
                if err == nil {
+                       d.logger.Debug("removed image", "image_id", id)
                        break
                }

The relevant logs show 0s delay!:

2020-10-15T15:27:00.853Z [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:87856cc39862cec77541d68382e4867d7ccb29a85a17221446c857ddaebca916 references=0
2020-10-15T15:27:00.853Z [DEBUG] client.driver_mgr.docker: remove delay expired: driver=docker delay=0s
...
2020-10-15T15:27:00.916Z [DEBUG] client.driver_mgr.docker: cleanup removed downloaded image: driver=docker image_id=sha256:87856cc39862cec77541d68382e4867d7ccb29a85a17221446c857ddaebca916

But if I set a real value instead of relying on the defaults:

plugin "docker" {

  config {
    gc {
      image       = true
      image_delay = "3m"
    }
  }
}

Then I get the following in the logs:

2020-10-15T15:30:38.134Z [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:87856cc39862cec77541d68382e4867d7ccb29a85a17221446c857ddaebca916 references=0
...
2020-10-15T15:33:38.136Z [DEBUG] client.driver_mgr.docker: remove delay expired: driver=docker delay=3m0s
2020-10-15T15:33:38.215Z [DEBUG] client.driver_mgr.docker: removed image: driver=docker image_id=sha256:87856cc39862cec77541d68382e4867d7ccb29a85a17221446c857ddaebca916
2020-10-15T15:33:38.215Z [DEBUG] client.driver_mgr.docker: cleanup removed downloaded image: driver=docker image_id=sha256:87856cc39862cec77541d68382e4867d7ccb29a85a17221446c857ddaebca916

So it looks like the image_delay is being respected, but only if explicitly set and not relying on the default. Probably a simple configuration parsing/validation bug.

@idrennanvmware
Copy link
Contributor Author

Ohhh, that's a good enough work around for now! We are more than happy to explicitly add that to our configs. Thank you @tgross

@tgross
Copy link
Member

tgross commented Oct 15, 2020

Fixed in #9101, which will ship in the upcoming 1.0.

@idrennanvmware
Copy link
Contributor Author

idrennanvmware commented Oct 15, 2020

Legends <3

@github-actions
Copy link

github-actions bot commented Nov 1, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants