Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from 1.4.5 - 1.4.6 and 1.5.0 - 1.5.1 breaks docker driver with non default endpoint set #16709

Closed
drofloh opened this issue Mar 29, 2023 · 6 comments · Fixed by #16713

Comments

@drofloh
Copy link

drofloh commented Mar 29, 2023

Nomad version

$ nomad version
Nomad v1.5.1
BuildDate 2023-03-10T22:05:57Z
Revision 6c118dd

Operating system and Environment details

CentOS Linux release 7.9.2009 (Core)

Issue

When upgrading nomad clients from either 1.4.5 - 1.4.6 and higher or 1.5.0 - 1.5.1 / 1.5.2, the docker plugin stops working when a non default endpoint is set in the docker plugins client config

Reproduction steps

configure docker with a non standard location for its unix socket (required when running docker as a non root user) and then configure the nomad clients docker plugin to use the endpoint, eg:

plugin "docker" {
  config {
    endpoint = "unix:///export/home/dockerrootless/.docker/xrd/docker.sock"
  }
}

Expected Result

nomad starts and the docker driver is available, jobs deploy fine which use the docker driver

Actual Result

nomad starts and the docker driver is not available, jobs start to fail with the below as an example:

Mar 29, '23 12:19:59 +0100 | Driver Failure | Failed to pull `<docker registry hostname>/nginx:latest`: dial unix /var/run/docker.sock: connect: no such file or directory

Nomad Client logs (if appropriate)

2023-03-29T12:19:59.309+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4955d303-1780-4a24-8b0f-0fa824cc08e1 task=web-server type=Driver msg="Downloading image" failed=false
2023-03-29T12:19:59.309+0100 [ERROR] client.driver_mgr.docker: failed pulling container: driver=docker image_ref=<docker registry hostname>/nginx:latest error="dial unix /var/run/docker.sock: connect: no such file or directory"
2023-03-29T12:19:59.311+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4955d303-1780-4a24-8b0f-0fa824cc08e1 task=web-server type="Driver Failure" msg="Failed to pull `<docker registry hostname>/nginx:latest`: dial unix /var/run/docker.sock: connect: no such file or directory" failed=false
2023-03-29T12:19:59.312+0100 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=4955d303-1780-4a24-8b0f-0fa824cc08e1 task=web-server error="Failed to pull `<docker registry hostname>/nginx:latest`: dial unix /var/run/docker.sock: connect: no such file or directory"
2023-03-29T12:19:59.312+0100 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=4955d303-1780-4a24-8b0f-0fa824cc08e1 task=web-server reason="Exceeded allowed attempts 2 in interval 30m0s and mode is \"fail\""

Any further details please let me know.

@tgross
Copy link
Member

tgross commented Mar 29, 2023

Hi @drofloh! In the versions you're reporting, we introduced #16352 which ensures we're tracking pause containers. I'm not sure how it could impact the socket path but I'm realizing there's a bug here where we're starting the pause tracker before we've called SetConfig from the client. So at best the pause tracker isn't using the right config, but it's entirely possible that it's setting a global in the Docker library we're using and that's breaking other use cases. I'll investigate further and report back.

@tgross tgross self-assigned this Mar 29, 2023
@drofloh
Copy link
Author

drofloh commented Mar 29, 2023

@tgross appreciate the fast response, I did take a look at 16352 but tbh I'm pretty new to nomad and go really and wasn't able to see anything obvious... for now we will keep our clients behind on the clusters we have updated. If you need any more information I'd be happy to help if I can.

@tgross
Copy link
Member

tgross commented Mar 29, 2023

I've got #16713 open with the fix and that'll ship in the next patch release of Nomad 1.5.x (plus backports to 1.4.x and 1.3.x)

@drofloh
Copy link
Author

drofloh commented Mar 29, 2023

@tgross thanks very much for this. Any idea (roughly) when an official release will be with this included?

@tgross
Copy link
Member

tgross commented Mar 29, 2023

1.5.2 only shipped just last week, so not likely for another couple weeks at least.

If it's a total blocker for you, running from main is relatively safe or you could backport the patch from #16713 (backporting the patch is probably the safer move for production clusters). The patch is entirely in the client, so there's no risk to servers or need to update them to the patched version if that's how you wanted to go.

Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 12, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

2 participants