Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker driver consistently fails to pull image from GCE repo, but docker pull succeeds #1119

Closed
bradobro opened this issue Apr 22, 2016 · 7 comments

Comments

@bradobro
Copy link

bradobro commented Apr 22, 2016

Nomad versions

  • Nomad v0.3.2-rc2 ('e6be454154c2296a563a4c2319e49eff1484e839+CHANGES')
  • Nomad v0.3.1
  • Nomad v0.3.0

The output and experience detailed below is from Nomad v0.3.2-rc2.

Operating system and Environment details

Google Compute Engine clusters of fresh Debian Jessie VMs: one cluster with three Nomad servers and five Nomad agents, the other with one server and three agent vms.

Docker container repository is the provided by Google Cluster Engine with each project. I see some similar issues pulling from the AWS container repository.

Docker version info for all machines:

# docker version
Client:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:22:53 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:22:53 2016
 OS/Arch:      linux/amd64

Issue

Nomad consistently returns 403 errors trying to pull new images while docker pull succeeds.

This may be related to fsouza/go-dockerclient#498

Reproduction steps

  1. Run a job requiring a docker image from a Google Cluster Engine registry that is not on any agents.
  2. The job will fail because it can't pull the image.
  3. SSH to the worker where the allocation failed.
  4. Run docker pull <image tag>.
  5. The pull succeeds.

Nomad Server logs (if appropriate)

Apr 22 17:57:10: 2016/04/22 17:57:10 [DEBUG] http: Request /v1/jobs (7.985992ms)
Apr 22 17:57:10: 2016/04/22 17:57:10 [DEBUG] worker: dequeued evaluation 1adef29e-c0d9-1c13-37a3-75d4136ab20b
Apr 22 17:57:10: 2016/04/22 17:57:10 [DEBUG] sched: <Eval '1adef29e-c0d9-1c13-37a3-75d4136ab20b' JobID: 'api3-develop'>: allocs: (place 1) (update 0) (migrate 0) (stop 0) (ignore 0)
Apr 22 17:57:10: http: Request /v1/evaluation/1adef29e-c0d9-1c13-37a3-75d4136ab20b (114.149µs)
Apr 22 17:57:10: http: Request /v1/evaluation/1adef29e-c0d9-1c13-37a3-75d4136ab20b/allocations (142.03µs)
Apr 22 17:57:10: 2016/04/22 17:57:10 [DEBUG] http: Request /v1/evaluation/1adef29e-c0d9-1c13-37a3-75d4136ab20b (114.149µs)
Apr 22 17:57:10: 2016/04/22 17:57:10 [DEBUG] http: Request /v1/evaluation/1adef29e-c0d9-1c13-37a3-75d4136ab20b/allocations (142.03µs)
Apr 22 17:57:10: worker: submitted plan for evaluation 1adef29e-c0d9-1c13-37a3-75d4136ab20b
Apr 22 17:57:10: sched: <Eval '1adef29e-c0d9-1c13-37a3-75d4136ab20b' JobID: 'api3-develop'>: setting status to complete
Apr 22 17:57:10: 2016/04/22 17:57:10 [DEBUG] worker: submitted plan for evaluation 1adef29e-c0d9-1c13-37a3-75d4136ab20b
Apr 22 17:57:10: 2016/04/22 17:57:10 [DEBUG] sched: <Eval '1adef29e-c0d9-1c13-37a3-75d4136ab20b' JobID: 'api3-develop'>: setting status to complete
Apr 22 17:57:10: worker: updated evaluation <Eval '1adef29e-c0d9-1c13-37a3-75d4136ab20b' JobID: 'api3-develop'>
Apr 22 17:57:10: worker: ack for evaluation 1adef29e-c0d9-1c13-37a3-75d4136ab20b
Apr 22 17:57:10: 2016/04/22 17:57:10 [DEBUG] worker: updated evaluation <Eval '1adef29e-c0d9-1c13-37a3-75d4136ab20b' JobID: 'api3-develop'>
Apr 22 17:57:10: 2016/04/22 17:57:10 [DEBUG] worker: ack for evaluation 1adef29e-c0d9-1c13-37a3-75d4136ab20b
Apr 22 17:57:11: http: Request /v1/evaluation/1adef29e-c0d9-1c13-37a3-75d4136ab20b (137.587µs)
Apr 22 17:57:11: 2016/04/22 17:57:11 [DEBUG] http: Request /v1/evaluation/1adef29e-c0d9-1c13-37a3-75d4136ab20b (137.587µs)
Apr 22 17:57:11: http: Request /v1/evaluation/1adef29e-c0d9-1c13-37a3-75d4136ab20b/allocations (196.941µs)
Apr 22 17:57:11: 2016/04/22 17:57:11 [DEBUG] http: Request /v1/evaluation/1adef29e-c0d9-1c13-37a3-75d4136ab20b/allocations (196.941µs)
Apr 22 17:58:12: worker: dequeued evaluation d239a0b6-8eda-817a-aec3-2177392d506f
Apr 22 17:58:12: sched.core: eval GC: scanning before index 90871 (1h0m0s)
Apr 22 17:58:12: worker: ack for evaluation d239a0b6-8eda-817a-aec3-2177392d506f
Apr 22 17:58:12: worker: dequeued evaluation 65e26b01-5a74-727b-00ef-e95845cfcd7d
Apr 22 17:58:12: sched.core: job GC: scanning before index 79126 (4h0m0s)
Apr 22 17:58:12: worker: ack for evaluation 65e26b01-5a74-727b-00ef-e95845cfcd7d
Apr 22 17:58:12: worker: dequeued evaluation a6279102-f9a8-a691-091c-05d0286cb6f7
Apr 22 17:58:12: sched.core: node GC: scanning before index 73764 (24h0m0s)
Apr 22 17:58:12: worker: ack for evaluation a6279102-f9a8-a691-091c-05d0286cb6f7
Apr 22 17:58:12: 2016/04/22 17:58:12 [DEBUG] worker: dequeued evaluation d239a0b6-8eda-817a-aec3-2177392d506f
Apr 22 17:58:12: 2016/04/22 17:58:12 [DEBUG] sched.core: eval GC: scanning before index 90871 (1h0m0s)
Apr 22 17:58:12: 2016/04/22 17:58:12 [DEBUG] worker: ack for evaluation d239a0b6-8eda-817a-aec3-2177392d506f
Apr 22 17:58:12: 2016/04/22 17:58:12 [DEBUG] worker: dequeued evaluation 65e26b01-5a74-727b-00ef-e95845cfcd7d
Apr 22 17:58:12: 2016/04/22 17:58:12 [DEBUG] sched.core: job GC: scanning before index 79126 (4h0m0s)
Apr 22 17:58:12: 2016/04/22 17:58:12 [DEBUG] worker: ack for evaluation 65e26b01-5a74-727b-00ef-e95845cfcd7d
Apr 22 17:58:12: 2016/04/22 17:58:12 [DEBUG] worker: dequeued evaluation a6279102-f9a8-a691-091c-05d0286cb6f7
Apr 22 17:58:12: 2016/04/22 17:58:12 [DEBUG] sched.core: node GC: scanning before index 73764 (24h0m0s)
Apr 22 17:58:12: 2016/04/22 17:58:12 [DEBUG] worker: ack for evaluation a6279102-f9a8-a691-091c-05d0286cb6f7

Nomad Client logs (if appropriate)

Apr 22 17:54:20 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:20 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 7)
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: client: 4 evaluations triggered by node update
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: client: state updated to ready
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [DEBUG] client: 4 evaluations triggered by node update
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [DEBUG] client: state updated to ready
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: client: updated allocations at index 96907 (pulled 1) (filtered 7)
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: client: allocs: (added 1) (removed 0) (updated 0) (ignore 7)
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [DEBUG] client: updated allocations at index 96907 (pulled 1) (filtered 7)
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [DEBUG] client: allocs: (added 1) (removed 0) (updated 0) (ignore 7)
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [DEBUG] client: starting task runners for alloc 'd9191f99-5ebc-975c-be1f-b0263fb255bf'
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: client: starting task runners for alloc 'd9191f99-5ebc-975c-be1f-b0263fb255bf'
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [DEBUG] client: starting task context for 'api3-develop' (alloc 'd9191f99-5ebc-975c-be1f-b0263fb255bf')
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: client: starting task context for 'api3-develop' (alloc 'd9191f99-5ebc-975c-be1f-b0263fb255bf')
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: client: updated allocations at index 96911 (pulled 0) (filtered 8)
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: client: allocs: (added 0) (removed 0) (updated 0) (ignore 8)
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [DEBUG] client: updated allocations at index 96911 (pulled 0) (filtered 8)
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 8)
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: driver.docker: failed pulling container gcr.io/kindrid-nomad-feasibility-study/api3:3.0.17-alpha.7: Error: Status 403 trying to pull repository kindrid-stage/api3: "Unable to access the repository: kindrid-stage/api3; please verify that it exists and you have permission to access it (no valid credential was supplied)."
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: client: failed to start task 'api3-develop' for alloc 'd9191f99-5ebc-975c-be1f-b0263fb255bf': failed to create image: Failed to pull `gcr.io/kindrid-nomad-feasibility-study/api3:3.0.17-alpha.7`: Error: Status 403 trying to pull repository kindrid-stage/api3: "Unable to access the repository: kindrid-stage/api3; please verify that it exists and you have permission to access it (no valid credential was supplied)."
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: client: Not restarting task: api3-develop for alloc: d9191f99-5ebc-975c-be1f-b0263fb255bf
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [ERR] driver.docker: failed pulling container gcr.io/kindrid-nomad-feasibility-study/api3:3.0.17-alpha.7: Error: Status 403 trying to pull repository kindrid-stage/api3: "Unable to access the repository: kindrid-stage/api3; please verify that it exists and you have permission to access it (no valid credential was supplied)."
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [ERR] client: failed to start task 'api3-develop' for alloc 'd9191f99-5ebc-975c-be1f-b0263fb255bf': failed to create image: Failed to pull `gcr.io/kindrid-nomad-feasibility-study/api3:3.0.17-alpha.7`: Error: Status 403 trying to pull repository kindrid-stage/api3: "Unable to access the repository: kindrid-stage/api3; please verify that it exists and you have permission to access it (no valid credential was supplied)."
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [INFO] client: Not restarting task: api3-develop for alloc: d9191f99-5ebc-975c-be1f-b0263fb255bf
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: client: updated allocations at index 96913 (pulled 0) (filtered 8)
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: client: allocs: (added 0) (removed 0) (updated 0) (ignore 8)
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [DEBUG] client: updated allocations at index 96913 (pulled 0) (filtered 8)
Apr 22 17:54:24 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:24 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 8)
Apr 22 17:54:26 worker-2-1-psir nomad[1104]: client: state changed, updating node.
Apr 22 17:54:26 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:26 [DEBUG] client: state changed, updating node.
Apr 22 17:54:26 worker-2-1-psir nomad[1104]: client: node registration complete
Apr 22 17:54:26 worker-2-1-psir nomad[1104]: 2016/04/22 17:54:26 [DEBUG] client: node registration complete

Nomad Client Machine Command Line (after failure).

We believe this rules our a repository or docker daemon problem, making us suspect the Nomad driver or the docker client library it uses.

# docker pull gcr.io/kindrid-nomad-feasibility-study/api3:3.0.17-alpha.7
3.0.17-alpha.7: Pulling from kindrid-stage/api3
fdd5d7827f33: Already exists
....
b3a3d58733e6: Pull complete
Digest: sha256:164c8b32b811a6ca7c67399319993355dc4479b02911e9d791d67cec9ca78f1c
Status: Downloaded newer image for gcr.io/kindrid-nomad-feasibility-study/api3:3.0.17-alpha.7

Job file (if appropriate)

job "api3-develop" {
    region = "global"
    datacenters = ["staging"]
    type = "service"
    priority = 50

    update {
        stagger = "10s"
        max_parallel = 1
    }

    group "api3-develop" {
        count = 1

        restart {
            interval = "5m"
            attempts = 10
            delay = "25s"
            mode = "delay"
        }

        task "api3-develop" {
            driver = "docker"

            config {
                image = "gcr.io/kindrid-nomad-feasibility-study/api3:3.0.17-alpha.7"
                auth {server_address = "https://gcr.io"}
                port_map {main = 8091}
            }


            service {
                name = "api3-develop"
                port = "main"
                check {
                    type = "http"
                    path = "/status"
                    interval = "10s"
                    timeout = "1s"
                }
            }

            resources {
                cpu = 100       # Mhz
                memory = 200    # MB
                network {
                    mbits = 10
                    port "main" {}
                }
            }
        }
    }
}
@bradobro
Copy link
Author

@cbednarski We suspect the client library might be problem, perhaps an incompatibility with the version of Docker. Any advice on debugging? We're not scared to dig into the Go.

@diptanu
Copy link
Contributor

diptanu commented Apr 22, 2016

@bradobro So I don't have a lot of experience with GCE, I have a few questions - Does GCE provide with an auth config file? We allow users to specify the auth config file in the client options with docker.auth.config. The Docker daemon doesn't understand any cloud platform as such and usually, the IAAS providers passes in the configuration file somewhere in the environment.

@bradobro
Copy link
Author

@diptanu Yes, it does supply an auth config, usually in ~/.docker/config.json, which has to be renewed regularly.

Here's ours:

{
        "auths": {
                "https://gcr.io": {
                        "auth": "<REDACTED BASE 64 CHARS>=",
                        "email": "[email protected]"
                }
        }
}

@bradobro
Copy link
Author

@diptanu We found the problem in our jobfile. The task config needs to have ssl = true (as yet undocumented--thanks for open source!).

            config {
                image = "gcr.io/kindrid-nomad-feasibility-study/api3:3.0.17-alpha.7"
                ssl = true  # THE MISSING PIECE
                auth {server_address = "https://gcr.io"}
                port_map {main = 8091}
            }

Then the pull from GCE's provided container repository works.

@diptanu
Copy link
Contributor

diptanu commented Apr 22, 2016

@bradobro Sorry we missed documenting that, I have added that in master now.

@diptanu diptanu closed this as completed Apr 22, 2016
@bradobro
Copy link
Author

Great, thanks @diptanu!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants