Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault 'default' name is not set on server, #19901

Open
rwenz3l opened this issue Feb 7, 2024 · 8 comments
Open

Vault 'default' name is not set on server, #19901

rwenz3l opened this issue Feb 7, 2024 · 8 comments

Comments

@rwenz3l
Copy link

rwenz3l commented Feb 7, 2024

Nomad version

Output from nomad version

client+server:

$ nomad version
Nomad v1.7.3
BuildDate 2024-01-15T16:55:40Z
Revision 60ee328f97d19d2d2d9761251b895b06d82eb1a1

Operating system and Environment details

3 VMs for the servers, many client nodes for the jobs. All running Rocky Linux release 9.3

Issue

I recently started looking into vault integrations - and while this worked in the past, I noticed with a recent test on the newer version that I get an error when scheduling jobs:

Error submitting job: Unexpected response code: 500 (rpc error: 1 error occurred:
        * Vault "default" not enabled but used in the job)

The error comes from here:

return nil, fmt.Errorf("Vault %q not enabled but used in the job",

The name options is mentioned here and mention that it should be omitted for non-enterprise setups:
https://developer.hashicorp.com/nomad/docs/configuration/vault#parameters-for-nomad-clients-and-servers

The job spec mentions it here:
https://developer.hashicorp.com/nomad/docs/job-specification/vault#cluster

my vault config on the servers was like this:

vault {
  enabled          = true
  token            = "{{ nomad_vault_token }}"
  address          = "https://vault.****.com"
  create_from_role = "nomad-cluster-access-auth"
}

config of the job/task:

      vault {
        # Attach our default policies to the task,
        # so it is able to retrieve secrets from vault.
        policies = ["nomad-cluster-access-kv"]
      }

I noticed there has been some work done on this, e.g. here:
1ef99f0

and I think there might be a bug with the initialization of the "default" value.. It's either not set or not read.

Reproduction steps

I think one might be able to reproduce this by setting up a 1.7.3 cluster and simply integrating vault.
If I add the name = default to both server and client, it works. If I don't I get the mentioned error message.

Expected Result

The "default" cluster is available by default.

Actual Result

Error submitting job: Unexpected response code: 500 (rpc error: 1 error occurred:
        * Vault "default" not enabled but used in the job)
@lgfa29
Copy link
Contributor

lgfa29 commented Feb 7, 2024

Hi @rwenz3l 👋

I have not been able to reproduce this problem 🤔

Would you be able to share the Vault configuration as returned by the command /v1/agent/self? You will need to run this on each of your servers.

Could you also make sure all three servers are running Nomad v1.7.3?

Thanks!

@rwenz3l
Copy link
Author

rwenz3l commented Feb 8, 2024

Sure:

ctrl1
    "Vaults": [
      {
        "Addr": "https://vault.*******.com",
        "AllowUnauthenticated": true,
        "ConnectionRetryIntv": 30000000000,
        "DefaultIdentity": null,
        "Enabled": true,
        "JWTAuthBackendPath": "jwt-nomad",
        "Name": "default",
        "Namespace": "",
        "Role": "nomad-cluster-access-auth",
        "TLSCaFile": "",
        "TLSCaPath": "",
        "TLSCertFile": "",
        "TLSKeyFile": "",
        "TLSServerName": "",
        "TLSSkipVerify": null,
        "TaskTokenTTL": "",
        "Token": "<redacted>"
      }
    ],
ctrl2
    "Vaults": [
      {
        "Addr": "https://vault.********.com",
        "AllowUnauthenticated": true,
        "ConnectionRetryIntv": 30000000000,
        "DefaultIdentity": null,
        "Enabled": true,
        "JWTAuthBackendPath": "jwt-nomad",
        "Name": "default",
        "Namespace": "",
        "Role": "nomad-cluster-access-auth",
        "TLSCaFile": "",
        "TLSCaPath": "",
        "TLSCertFile": "",
        "TLSKeyFile": "",
        "TLSServerName": "",
        "TLSSkipVerify": null,
        "TaskTokenTTL": "",
        "Token": "<redacted>"
      }
    ],
ctrl3
    "Vaults": [
      {
        "Addr": "https://vault.**********.com",
        "AllowUnauthenticated": true,
        "ConnectionRetryIntv": 30000000000,
        "DefaultIdentity": null,
        "Enabled": true,
        "JWTAuthBackendPath": "jwt-nomad",
        "Name": "default",
        "Namespace": "",
        "Role": "nomad-cluster-access-auth",
        "TLSCaFile": "",
        "TLSCaPath": "",
        "TLSCertFile": "",
        "TLSKeyFile": "",
        "TLSServerName": "",
        "TLSSkipVerify": null,
        "TaskTokenTTL": "",
        "Token": "<redacted>"
      }
    ],

I will continue my work on the vault integration and gather some more info with this.

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 8, 2024

Thanks for the extra information @rwenz3l.

All three configuration look right, "Name": "default" and "Enabled": true. Is the cluster a fresh install or have upgraded the servers from a previous version of Nomad.

As an aside, you mentioned you're just starting to look into the Vault integration so I would suggest you to follow the new workflow released in Nomad 1.7 as this will become the only supported option in the future. Here's a tutorial that covers it: https://developer.hashicorp.com/nomad/tutorials/integrate-vault/vault-acl

@rwenz3l
Copy link
Author

rwenz3l commented Feb 8, 2024

We've been running this nomad since 1.3 or something iirc., we usually update to the latest major/minor shortly after release.

We definitely plan to use the new workload identities with this, I initially configured the vault integration when workload identity did not exist yet. It was working fine back then, so I guess it was something before the 1.7.x that maybe did something to this key/value. From my limited view, it feels like the default value was not read properly, if the key is missing in the nomad.hcl configuration. No need to invest too much time, I would advise to set the "name = default" in the nomad config in case someone sees this error. If I find some more info, I will update here.

@Tirieru
Copy link

Tirieru commented Mar 6, 2024

I had the same error after configuring the Vault integration following the new 1.7 workflow.

After a few tries, I realized this was caused by a syntax error, and I was missing a comma inside the vault block of the nomad config file (which in my case, is written in json).

I would expect Nomad to not start at all while having a syntax error in the json config file, but apparently it only made the Vault integration not work? It might be something similar in your case.

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 20, 2024

Thanks for the extra info @Tirieru. Improving agent configuration validation is something that's been on our plate for a bit now (#11819).

Would you be able to share the exact invalid configuration that caused this error? I have not been able to reproduce it yet.

Thanks!

@lgfa29 lgfa29 removed their assignment Mar 20, 2024
@Tirieru
Copy link

Tirieru commented Mar 21, 2024

This is how the nomad server configuration looked like while the error was happening:

{
  "name": "nomad-1",
  "data_dir": "/opt/nomad/data",
  "bind_addr": "<HOST_ADDRESS>",
  "datacenter": "dc1",
  "ports": {
    "http": 4646,
    "rpc": 4647,
    "serf": 4648
  },
  "addresses": {
    "http": "0.0.0.0",
    "rpc": "0.0.0.0",
    "serf": "0.0.0.0"
  },
  "advertise": {
    "http": "<HOST_ADDRESS>",
    "rpc": "<HOST_ADDRESS>",
    "serf": "<HOST_ADDRESS>"
  },
  "acl": {
    "enabled": true
  },
  "server": {
    "enabled": true,
    "rejoin_after_leave": true,
    "raft_protocol": 3,
    "encrypt": "<ENCRYPT_KEY>",
    "bootstrap_expect": 1,
    "job_gc_interval": "1h",
    "job_gc_threshold": "24h",
    "deployment_gc_threshold": "120h",
    "heartbeat_grace": "60s"
  },
  "limits": {
    "http_max_conns_per_client": 300,
    "rpc_max_conns_per_client": 300
  },
  "vault": {
    "token": "<VAULT_TOKEN>",
    "create_from_role": "nomad-cluster",
    "default_identity": {
      "aud": ["<VAULT_AUD>"],
      "ttl": ""
    }
    "address": "<VAULT_ADDRESS>",
    "enabled": true  
  },
  "log_level": "INFO"
}

Adding the missing comma on line 45 fixed the issue.

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 21, 2024

Thank you @Tirieru!

Yes, I can verify that the invalid JSON does cause the same error message but, unlike in the case of @rwenz3l, the /v1/agent/self API does return the default Vault configuration as disabled:

    "Vaults": [
      {
        "Addr": "https://vault.service.consul:8200",
        "AllowUnauthenticated": true,
        "ConnectionRetryIntv": 30000000000,
        "DefaultIdentity": {
          "Audience": [
            "vault.io"
          ],
          "Env": null,
          "File": null,
          "TTL": null
        },
        "Enabled": null,
        "JWTAuthBackendPath": "jwt-nomad",
        "Name": "default",
        "Namespace": "",
        "Role": "nomad-cluster",
        "TLSCaFile": "",
        "TLSCaPath": "",
        "TLSCertFile": "",
        "TLSKeyFile": "",
        "TLSServerName": "",
        "TLSSkipVerify": null,
        "TaskTokenTTL": "",
        "Token": "<redacted>"
      }
    ],

I'm not sure why this configuration is accepted though. I think the root cause is that Nomad agent configuration is still parsed with the old HCLv1 syntax, which has a less strict JSON parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants