Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault address is ignored in nomad clients on v1.7.0 #19380

Closed
L-P opened this issue Dec 8, 2023 · 7 comments · Fixed by #19439
Closed

Vault address is ignored in nomad clients on v1.7.0 #19380

L-P opened this issue Dec 8, 2023 · 7 comments · Fixed by #19439

Comments

@L-P
Copy link

L-P commented Dec 8, 2023

Nomad version

# nomad version
Nomad v1.7.0
BuildDate 2023-12-07T08:28:54Z
Revision e4150e9703f3be6ee2339f0e45ff0801186e022b

Operating system and Environment details

# lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 12 (bookworm)
Release:	12
Codename:	bookworm
# uname -a
Linux worker-plane 6.1.0-13-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux

Running on AWS EC2 t3.small.

Issue

Nomad client attempts to reach local vault agent Vault using 127.0.0.1:8200,
ignoring its configuration.

Downgrading the client to v1.6.4 fixes the issue.

Reproduction steps

  1. Follow what is now the "legacy" vault setup using a local vault agent
    configured to listen on a socket:
    vault {
      enabled = true
      address = "unix:///var/run/vault/agent.socket"
      create_from_role = "nomad-cluster"
    }
    

Expected Result

The value of address is used to connect to the vault agent.

Actual Result

Another value is used, thus failing to reach the agent.

Job file (if appropriate)

Uses templates with {{with secret "<path>"}}.

Nomad Server logs (if appropriate)

n/a

Nomad Client logs (if appropriate)

2023-12-08T10:08:32.525Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=e3289405-b9fb-3498-290a-0d1a5a44b59a task=fabio type=Killing msg="Template failed: vault.write(pki_int/issue/consul -> 1656cb6c): vault.write(pki_int/issue/consul -> 1656cb6c): Put \"https://127.0.0.1:8200/v1/pki_int/issue/consul\": dial tcp 127.0.0.1:8200: connect: connection refused" failed=true
@L-P L-P added the type/bug label Dec 8, 2023
@tgross
Copy link
Member

tgross commented Dec 8, 2023

Hi @L-P! Unfortunately I wasn't able to reproduce this. I used the following Vault listener config:

listener "unix" {
  address = "/home/tim/tmp/vault.sock"
}

The following Nomad vault config block:

vault {
  address = "unix:///home/tim/tmp/vault.sock"
  token = "<redacted>"
  enabled = true
}

And just for good measure, I ensured that no traffic could be getting to the TCP listener sudo iptables -A INPUT -i lo -p tcp --dport 8200 -j DROP. I can see a successful fingerprint via nomad node status -self -verbose | grep vault and a job I ran with a template seems to be working.

As an aside, I noticed that you don't have a token field in your configuration, which suggest to me that you might be using the VAULT_TOKEN environment variable. This was accidentally broken in 1.7.0 and fixed in #19349 which will go out shortly in Nomad 1.7.1.

Another thing that you can check for me is the /agent/self API that shows the parsed configuration. Here's what I get from the config above:

$ nomad operator api "/v1/agent/self" | jq .config.Vaults
[
  {
    "Addr": "unix:///home/tim/tmp/vault.sock",
    "AllowUnauthenticated": true,
    "ConnectionRetryIntv": 30000000000,
    "DefaultIdentity": null,
    "Enabled": true,
    "JWTAuthBackendPath": "jwt-nomad",
    "Name": "default",
    "Namespace": "",
    "Role": "",
    "TLSCaFile": "",
    "TLSCaPath": "",
    "TLSCertFile": "",
    "TLSKeyFile": "",
    "TLSServerName": "",
    "TLSSkipVerify": null,
    "TaskTokenTTL": "",
    "Token": "<redacted>"
  }
]

You can also look for a log line like this:

2023-12-08T10:34:25.709-0500 [DEBUG] agent: (runner) final config:

And that'll have a big ol' blob of JSON that's given to the template runner. In there should be a Vault.Address field that'll show what the template runner is using for its address.

$ cat ~/tmp/blob.json| jq .Vault.Address
"unix:///home/tim/tmp/vault.sock"

@tgross tgross added this to the 1.7.x milestone Dec 8, 2023
@tgross tgross self-assigned this Dec 8, 2023
@L-P
Copy link
Author

L-P commented Dec 11, 2023

There's no VAULT_TOKEN, the local vault agent is open tokenless (it
authenticates itself via AWS EC2 instance roles and only serves as an
authenticating proxy).

Here are the requested info plus task logs, using the exact same nomad config,
creating the instance from scratch every time:

Nomad v1.6.4

$ nomad version && nomad operator api -tls-skip-verify -address="https://127.0.0.1:4646" "/v1/agent/self" | jq .config.Vault
Nomad v1.6.4
BuildDate 2023-12-07T08:27:54Z
Revision dbd5f36a24a924e2ba4dd6195af6a45c922ac8c6
{
  "Addr": "unix:///var/run/vault/agent.socket",
  "AllowUnauthenticated": true,
  "ConnectionRetryIntv": 30000000000,
  "Enabled": true,
  "Namespace": "",
  "Role": "nomad-cluster",
  "TLSCaFile": "",
  "TLSCaPath": "",
  "TLSCertFile": "",
  "TLSKeyFile": "",
  "TLSServerName": "",
  "TLSSkipVerify": null,
  "TaskTokenTTL": "",
  "Token": ""
}

$ nomad node status -self -verbose -address=https://127.0.0.1:4646 -tls-skip-verify | grep vault
# (no output)

Task logs:

Dec 11, '23 09:07:31 +0100	Started	Task started by client
Dec 11, '23 09:07:25 +0100	Driver	Downloading image
Dec 11, '23 09:07:23 +0100	Task Setup	Building Task Directory
Dec 11, '23 09:07:23 +0100	Received	Task received by client

Nomad v1.7.0

$ nomad version && nomad operator api -tls-skip-verify -address="https://127.0.0.1:4646" "/v1/agent/self" | jq .config.Vaults
Nomad v1.7.0
BuildDate 2023-12-07T08:28:54Z
Revision e4150e9703f3be6ee2339f0e45ff0801186e022b
[
  {
    "Addr": "unix:///var/run/vault/agent.socket",
    "AllowUnauthenticated": true,
    "ConnectionRetryIntv": 30000000000,
    "DefaultIdentity": null,
    "Enabled": true,
    "JWTAuthBackendPath": "jwt-nomad",
    "Name": "default",
    "Namespace": "",
    "Role": "nomad-cluster",
    "TLSCaFile": "",
    "TLSCaPath": "",
    "TLSCertFile": "",
    "TLSKeyFile": "",
    "TLSServerName": "",
    "TLSSkipVerify": null,
    "TaskTokenTTL": "",
    "Token": ""
  }
]

$ nomad node status -self -verbose -address=https://127.0.0.1:4646 -tls-skip-verify | grep vault
# (no output)

Task logs:

Dec 11, '23 08:19:41 +0100	Killing	Template failed: vault.write(pki_int/issue/consul -> 1656cb6c): vault.write(pki_int/issue/consul -> 1656cb6c): Put "https://127.0.0.1:8200/v1/pki_int/issue/consul": dial tcp 127.0.0.1:8200: connect: connection refused
Dec 11, '23 08:13:03 +0100	Template	Missing: vault.write(pki_int/issue/consul -> 1656cb6c)
Dec 11, '23 08:13:00 +0100	Task Setup	Building Task Directory
Dec 11, '23 08:13:00 +0100	Received	Task received by client

Nomad v1.7.1 behavior is slighly different (but /v1/agent/self output is identical):

Dec 11, '23 08:33:19 +0100	Killing	Template failed: vault.write(pki_int/issue/consul -> 1656cb6c): vault.write(pki_int/issue/consul -> 1656cb6c): Put "https://127.0.0.1:8200/v1/pki_int/issue/consul": dial tcp 127.0.0.1:8200: connect: connection refused
Dec 11, '23 08:26:42 +0100	Template	Missing: vault.write(pki_int/issue/consul -> 1656cb6c)
Dec 11, '23 08:26:28 +0100	Template	Missing: vault.write(pki_int/issue/consul -> 1656cb6c)
Dec 11, '23 08:23:20 +0100	Template	Missing: vault.write(pki_int/issue/consul -> 1656cb6c)
Dec 11, '23 08:23:17 +0100	Task Setup	Building Task Directory
Dec 11, '23 08:23:17 +0100	Received	Task received by client

@tgross
Copy link
Member

tgross commented Dec 11, 2023

Thanks @L-P! @lgfa29 and I are still working on a reproduction. Just one question for clarity: your job has a vault block in the group or task, with the appropriate policies, right?

Another item that jumps out at me from your last post is this:

$ nomad node status -self -verbose -address=https://127.0.0.1:4646 -tls-skip-verify | grep vault
# (no output)

And you're getting that for both 1.6.4 and 1.7.0, which means Nomad isn't fingerprinting Vault at all. Is the -address flag you're providing the address of the client node where the job is supposed to be running? I wouldn't expect the Nomad scheduler to place the job at all in that case.

Any Vault agent logs you can provide from the Nomad agent startup, as well as Vault agent logs from the job starting up, would be very helpful here.

@tgross
Copy link
Member

tgross commented Dec 11, 2023

Hi @L-P I think I've got a reproduction here and it's at the intersection of a missing vault block in the job and the peculiarities of how the Vault agent proxies access to Vault.

I ran Vault in dev mode with the following configuration:

listener "tcp" {
  address = "127.0.0.1:8202"
  cluster_address = "127.0.0.1:8203"
  tls_disable = true
}

I ran a Vault Agent with the following configuration:

pid_file = "/home/tim/tmp/.vault-agent-pidfile"

vault {
  address = "http://127.0.0.1:8202"
  retry {
    num_retries = 5
  }
}

auto_auth {
  method {
    type = "token_file"

    config = {
      token_file_path = "/home/tim/.vault-token"
    }
  }

}

cache {}

api_proxy {
  use_auto_auth_token = true
}

listener "unix" {
  address = "/home/tim/tmp/vault.sock"
  tls_disable = true

  agent_api {
    enable_quit = true
  }
}

My Nomad agent's vault config block is:

vault {
  address = "unix:///home/tim/tmp/vault.sock"
  token = "hvs.<redacted>"
  create_from_role = "nomad-cluster"
  enabled = true
}

Last, I configured an iptables rule to ensure that it was impossible for traffic on the default port to work: sudo iptables -A INPUT -i lo -p tcp --dport 8200 -j DROP.

Then I ran through the Vault integration and retrieving dynamic secrets tutorial:

  • First I ran through with Nomad 1.6.4 to make sure I had a working setup
  • Then I ran through with Nomad 1.7.0, and it works just fine.
  • Then I ran through with Nomad 1.7.0 again, but this time I removed the vault block from the job that needs secrets. This time it failed! The Consul Template configuration in the logs shows an empty string for the address, rather than the unix domain socket. So CT picks up the default address which won't work.

I've got a preliminary patch that seems to fix the problem by not requiring the vault block to get a valid template runner configuration:

diff --git a/client/allocrunner/taskrunner/template_hook.go b/client/allocrunner/taskrunner/template_hook.go
index 3824dd02a8..ea01687ad2 100644
--- a/client/allocrunner/taskrunner/template_hook.go
+++ b/client/allocrunner/taskrunner/template_hook.go
@@ -16,7 +16,6 @@ import (
        cstructs "github.com/hashicorp/nomad/client/structs"
        "github.com/hashicorp/nomad/client/taskenv"
        "github.com/hashicorp/nomad/nomad/structs"
-       structsc "github.com/hashicorp/nomad/nomad/structs/config"
 )

 const (
@@ -212,14 +211,11 @@ func (h *templateHook) Poststart(ctx context.Context, req *interfaces.TaskPostst
 func (h *templateHook) newManager() (unblock chan struct{}, err error) {
        unblock = make(chan struct{})

-       var vaultConfig *structsc.VaultConfig
-       if h.task.Vault != nil {
-               vaultCluster := h.task.GetVaultClusterName()
-               vaultConfig = h.config.clientConfig.GetVaultConfigs(h.logger)[vaultCluster]
+       vaultCluster := h.task.GetVaultClusterName()
+       vaultConfig := h.config.clientConfig.GetVaultConfigs(h.logger)[vaultCluster]

-               if vaultConfig == nil {
-                       return nil, fmt.Errorf("Vault cluster %q is disabled or not configured", vaultCluster)
-               }
+       if h.task.Vault != nil && vaultConfig == nil {
+               return nil, fmt.Errorf("Vault cluster %q is disabled or not configured", vaultCluster)
        }

        tg := h.config.alloc.Job.LookupTaskGroup(h.config.alloc.TaskGroup)

Ordinarily leaving out the vault block won't work at all, but it works with the Vault agent because it passes the request through with its own token.

@L-P
Copy link
Author

L-P commented Dec 12, 2023

I can confirm I don't have a vault block in the job.

I tried adding one but it failed because Task x has a Vault block with an empty list of policies, which is odd because per the docs "it is deprecated" and I had the role set.
If I set the policies instead, nothing happens. The job is created with 0 allocations. (Not unplaced allocations, no allocations at all.)

So I don't seem to have a workaround.

@tgross
Copy link
Member

tgross commented Dec 12, 2023

Ah, the docs could definitely be much more clear on this. The new vault.role field only works with the new Workload Identity workflow. So if you have a vault block it'll be expecting policies.

f I set the policies instead, nothing happens. The job is created with 0 allocations. (Not unplaced allocations, no allocations at all.)

It'd be worth taking a look at the eval status of the eval created when you submit the job, as well as the debug-level logs on the scheduler for that eval.

Copy link

github-actions bot commented Jan 2, 2025

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

3 participants