Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker: use official client instead of fsouza/go-dockerclient #23966

Merged
merged 29 commits into from
Sep 26, 2024

Conversation

pkazmierczak
Copy link
Contributor

@pkazmierczak pkazmierczak commented Sep 16, 2024

This PR replaces fsouza/go-dockerclient 3rd party docker client library with
docker's official SDK.

Notes to reviewers:

  • this PR is split into 3 commits, one containing driver logic, one for the
    unit tests, and one for go mod/sum
  • the vast majority of the refactoring is pretty straightforward, replacing
    methods, changing argument orders and types. There are a few exceptions that
    require attention, namely:
    • any methods that utilize capturing stdio/stdout/stderr had to be re-written,
      e.g., dockerLogger.Start, driver.ExecTaskStreaming
    • signal handling in the task handler,
    • stats collector

@pkazmierczak pkazmierczak force-pushed the f-docker-official-client branch from 9ca639d to a1dd764 Compare September 17, 2024 19:10
@pkazmierczak pkazmierczak force-pushed the f-docker-official-client branch 2 times, most recently from 98d1578 to 7773315 Compare September 19, 2024 19:50
tgross added a commit that referenced this pull request Sep 27, 2024
In ##23966 when we switched to using the official Docker SDK client, we had more
contexts to add because most of the library methods take one. But for some APIs
like waiting for a container to exit after we've started it, we never want to
close this context, because the operation can outlive the Nomad agent itself.
tgross added a commit that referenced this pull request Sep 30, 2024
In ##23966 when we switched to using the official Docker SDK client, we had more
contexts to add because most of the library methods take one. But for some APIs
like waiting for a container to exit after we've started it, we never want to
close this context, because the operation can outlive the Nomad agent itself.
tgross added a commit that referenced this pull request Sep 30, 2024
In ##23966 when we switched to using the official Docker SDK client, we had to
rework the stats collection loop for the new client. But we missed resetting the
timer on the collection loop, which meant that we'd only collect stats once and
then never again.

* Ref: [NET-11202 (comment)](https://hashicorp.atlassian.net/browse/NET-11202?focusedCommentId=550814)
* This has shipped in Nomad 1.9.0-beta.1 but not production yet.
tgross added a commit that referenced this pull request Oct 1, 2024
In ##23966 when we switched to using the official Docker SDK client, we had to
rework the stats collection loop for the new client. But we missed resetting the
timer on the collection loop, which meant that we'd only collect stats once and
then never again.

* Ref: [NET-11202 (comment)](https://hashicorp.atlassian.net/browse/NET-11202?focusedCommentId=550814)
* This has shipped in Nomad 1.9.0-beta.1 but not production yet.
tgross added a commit that referenced this pull request Oct 1, 2024
In ##23966 when we switched to using the official Docker SDK client, this
included new API calls for attaching to the "exec objects" created for running
processes inside a running Docker task. When we updated the API for the
non-streaming cases (script health checks, and `change_mode = "script"`), we
used the container ID and not the exec object ID. These IDs aren't identical
because you can have multiple exec objects for a given container. This results
in errors like "unable to upgrade to tcp, received 404" because the Docker API
can't find the exec object with the container ID.

* Ref: [NET-11202 (comment)](https://hashicorp.atlassian.net/browse/NET-11202?focusedCommentId=551618)
* This has shipped in Nomad 1.9.0-beta.1 but not production yet.
tgross added a commit that referenced this pull request Oct 1, 2024
In ##23966 when we switched to using the official Docker SDK client, this
included new API calls for attaching to the "exec objects" created for running
processes inside a running Docker task. When we updated the API for the
non-streaming cases (script health checks, and `change_mode = "script"`), we
used the container ID and not the exec object ID. These IDs aren't identical
because you can have multiple exec objects for a given container. This results
in errors like "unable to upgrade to tcp, received 404" because the Docker API
can't find the exec object with the container ID.

* Ref: [NET-11202 (comment)](https://hashicorp.atlassian.net/browse/NET-11202?focusedCommentId=551618)
* This has shipped in Nomad 1.9.0-beta.1 but not production yet.
@ahjohannessen
Copy link

@tgross Seems like this has a breaking change with regards to:

plugin "docker" {
    config {
        auth {
            config = "/etc/nomad.d/.docker-creds"

        }
    }
}

Nomad 1.8.4 the above works, bumping to 1.9.0 results in:

Failed to pull `quay.io/our-repo/app:0.0.1`: Error response from daemon: unauthorized: access to the requested resource is not authorized

@tgross
Copy link
Member

tgross commented Oct 11, 2024

@ahjohannessen please open a new issue

tgross added a commit that referenced this pull request Oct 16, 2024
In #23966 we switched to the official Docker SDK for the `docker` driver. In the
process we refactored code around stats collection to use the "one shot" version
of stats. Unfortunately this "one shot" stats collection does not include the
`PreCPU` stats, which are the stats from the previous read. This breaks the
calculation we use to determine CPU ticks, because now we're subtracting 0 from
the current value to get the delta.

Switch back to using the streaming stats collection. Add a test that fully
exercises the `TaskStats` API.

Fixes: #24224
Ref: https://hashicorp.atlassian.net/browse/NET-11348
pkazmierczak added a commit that referenced this pull request Oct 16, 2024
…rectly (#24215)

In #23966 we introduced an official Docker client and did not notice that in
contrast to our previous 3rd party client, the official SDK PullOptions object
expects a base64 encoded JSON with username and password, instead of username/
password pair.
tgross added a commit that referenced this pull request Oct 17, 2024
In #23966 we switched to the official Docker SDK for the `docker` driver. In the
process we refactored code around stats collection to use the "one shot" version
of stats. Unfortunately this "one shot" stats collection does not include the
`PreCPU` stats, which are the stats from the previous read. This breaks the
calculation we use to determine CPU ticks, because now we're subtracting 0 from
the current value to get the delta.

Switch back to using the streaming stats collection. Add a test that fully
exercises the `TaskStats` API.

Fixes: #24224
Ref: https://hashicorp.atlassian.net/browse/NET-11348
pkazmierczak added a commit that referenced this pull request Oct 17, 2024
…24237)

During a refactoring of the docker driver in #23966 we introduced a bug: API
version negotiation option was not passed to every new client call.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants