Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to pull images from AWS ECR with docker driver in on-prem instances #10722

Closed
AlexITC opened this issue Jun 8, 2021 · 13 comments
Closed

Comments

@AlexITC
Copy link

AlexITC commented Jun 8, 2021

Nomad version

Nomad v1.1.0 (2678c3604bc9530014208bc167415e167fd440fc)

Operating system and Environment details

On premise Ubuntu 20.04 AMD x64.

Issue

My jobs are unable to pull images from ECR, what seems to be the issue is that even after following the steps to configure aws cli, the credentials aren't being picked by nomad while pulling the image.

For example, I'm running the nomad client as root user, which has the aws credentials stored at /root/.aws/credentials (HOME being /root).

While reading from the hashicorp forum, github issues, and nomad's gitter, I have tried many different approaches but everything seems to lead to the same issue, nomad is not supporting the aws authentication mechanism.

I haven't checked nomad source to find why this is failing but I'm specified the approaches I have explored.

Approach 1

Use docker login with ECR credentials.

Running this command as root gets a token, allowing docker pull but nomad fails due to missing credentials, it seems that it is not using the /root/.docker/config.json settings which include the token:

  • aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin [account].dkr.ecr.us-east-2.amazonaws.com

Approach 2

Set env variables that hold the AWS credentials in the systemd service that runs nomad

Update nomad.service to include EnvironmentFile=/root/.env, I found out this in comments from other users, unfortunately, it hasn't worked for me.

.env:

AWS_ACCESS_KEY_ID="replace_me"
AWS_SECRET_ACCESS_KEY="replace_me"
AWS_DEFAULT_REGION="us-east-2"

Approach 3

Set HOME=/root env variable in the systemd service that runs nomad.

Update nomad.service to include Environment=HOME=/root, I found out this in comments from other users, unfortunately, it hasn't worked for my case.

Approach 4

Use docker-credential-ecr-helper while storing the sdk credentials at /root/.aws/credentials, installing it by either of these:

  • sudo apt install amazon-ecr-credential-helper , printing 0.6.0 as the version.
  • Pick it up from https://github.com/awslabs/amazon-ecr-credential-helper/releases/tag/v0.5.0, printing 0.6.3 as the version.

This requires the nomad client config (like client.hcl) to include this snippet:

plugin "docker" {
  config {
    auth {
      config = "/etc/docker-auth.json"
    }
  }
}

While /etc/docker-auth.json has:

{
  "credHelpers": {
    "[account]].dkr.ecr.us-east-2.amazonaws.com": "ecr-login"
  }
}

While echo "[account].dkr.ecr.us-east-2.amazonaws.com/[image]" | docker-credential-ecr-login get works, when nomad tries pulling the image it fails with:

time="2021-06-07T22:12:36Z" level=error msg="Error retrieving credentials" error="ecr: Failed to get authorization token: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors"

Reproduction steps

It is difficult to provide these steps because a private registry is involved.

Expected Result

The job involving images from private ECR would work.

Actual Result

The job always fails due to the images not being pulled from ECR.

Job file (if appropriate)

job "test-job" {
  datacenters = ["dc1"]
  type = "service"

  group "the-api" {
    count = 1

    network { mode = "bridge" }

    task "the-grpc-task" {
      driver = "docker"
      
      config {
        image = "[account].dkr.ecr.us-east-2.amazonaws.com/[image]]:[version]]"
      }
    }
  }
}
@tgross
Copy link
Member

tgross commented Jun 8, 2021

Hi @AlexITC!

"Approach 4" is close, but plugin.config.auth.config isn't a supported key here... unfortunately because of the way the task drivers work we can't fail that on job submission (see also #9287 (comment)), but I'd have expected that to throw an error somewhere in the client logs at least.

What you'll need in the client configuration is the configuration of the credential helper. See the docker driver's authentication docs. It should look something like this:

plugin "docker" {
  config {
    auth {
      # Nomad will prepend "docker-credential-" to the helper value and call
      # that script name.
      helper = "ecr-login"
    }
  }
}

@AlexITC
Copy link
Author

AlexITC commented Jun 8, 2021

Ok, that's another approach I forget to mention (which is documented in the official docs), this one fails to pull public images.

nomad client log:

Jun 08 13:23:21 ip-172-31-47-220 nomad[100495]:     2021-06-08T13:23:21.102Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=83b33fa6-a335-77a8-b341-77a1ab09aad1 task=connect-proxy-postgres error="Failed to find docker auth for repo "envoyproxy/envoy": docker-credential-ecr-login with input "envoyproxy/envoy" failed with stderr: "

@tgross
Copy link
Member

tgross commented Jun 8, 2021

Can you try that with auth_soft_fail = true?

@AlexITC
Copy link
Author

AlexITC commented Jun 8, 2021

That seems to made the trick, thanks!

By now it was difficult to know what I haven't tried, besides the way this got solved, shouldn't the $HOME/.aws/credentials/$HOME/.docker/config.json be taken when using different approach lie the 1st one (docker login)?

@tgross
Copy link
Member

tgross commented Jun 8, 2021

shouldn't the $HOME/.aws/credentials/$HOME/.docker/config.json be taken when using different approach lie the 1st one (docker login)?

It probably should, but keep in mind we're not running docker CLI commands but using the Docker API. I suspect what's happening here is that we're passing an empty-but-not-nil auth options which overrides the default behavior of the server. That'd be worth looking into as a papercut for sure.

@tgross
Copy link
Member

tgross commented Jun 8, 2021

I've opened #10726 to improve that situation. Going to close this issue out as resolved though. Thanks for your patience, @AlexITC!

@tgross tgross closed this as completed Jun 8, 2021
@AlexITC
Copy link
Author

AlexITC commented Jun 8, 2021

Thanks!

@AlexITC
Copy link
Author

AlexITC commented Jun 8, 2021

Unfortunately, I tried replicating the setup in a new VM with the comments from this thread, it won't work, the problem seems to be the same, nomad doesn't invoke docker-credential-ecr-login with the necessary access to AWS authentication which lives at /root/.aws

@AlexITC
Copy link
Author

AlexITC commented Jun 8, 2021

Set HOME=/root env variable in the systemd service that runs nomad

Actually, following the comments and using Approach 3 seems to do the trick, do you think this should be necessary?

@tgross
Copy link
Member

tgross commented Jun 9, 2021

Yeah, it looks like systemd doesn't implicitly set $HOME, so if the AWS SDK is looking at that value you'll need to set it explicitly:

Unit file:

[Unit]
Description=TESTENV

[Service]
Type=oneshot
ExecStart=/opt/testenv.sh
RemainAfterExit=true
ExecStop=/opt/testenv.sh
StandardOutput=journal

[Install]
WantedBy=multi-user.target

Test script:

#!/usr/bin/env bash

env

Output:

$ journalctl -u testenv
-- Logs begin at Wed 2021-05-19 15:18:20 UTC, end at Wed 2021-06-09 12:44:20 UTC. --
Jun 09 12:44:20 linux systemd[1]: Starting TESTNEV...
Jun 09 12:44:20 linux testenv.sh[2057]: LANG=en_US.UTF-8
Jun 09 12:44:20 linux testenv.sh[2057]: INVOCATION_ID=f82925288e6249bfb1ab9e09e35099fc
Jun 09 12:44:20 linux testenv.sh[2057]: PWD=/
Jun 09 12:44:20 linux testenv.sh[2057]: JOURNAL_STREAM=9:24351
Jun 09 12:44:20 linux testenv.sh[2057]: SHLVL=1
Jun 09 12:44:20 linux testenv.sh[2057]: LANGUAGE=en_US:
Jun 09 12:44:20 linux testenv.sh[2057]: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
Jun 09 12:44:20 linux testenv.sh[2057]: _=/usr/bin/env
Jun 09 12:44:20 linux systemd[1]: Started TESTENV.

@AlexITC
Copy link
Author

AlexITC commented Jun 9, 2021

Makes sense, I think it is worth documenting this behavior, I'm definitely not the only one being bitten by it, do you think that this part is adequate to highlight the issue? if so, I can submit a PR about it.

@tgross
Copy link
Member

tgross commented Jun 9, 2021

Makes sense, I think it is worth documenting this behavior, I'm definitely not the only one being bitten by it, do you think that this part is adequate to highlight the issue? if so, I can submit a PR about it.

Yeah I think the right place for it would be where we have the specific example of the ECR helper:

Example agent configuration, using a helper script "docker-credential-ecr-login" in $PATH

Maybe leave a parenthetical there saying something like "you may need to set $HOME in your Nomad environment, see awslabs/amazon-ecr-credential-helper#161"

Also, weirdly I somehow missed that what I said here is wrong:

but plugin.config.auth.config isn't a supported key here...

It's right at https://www.nomadproject.io/docs/drivers/docker#config

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants