-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad Jobs unable to update Consul Services after undetermined amount of time. #20159
Comments
To add some more information, we are seeing these logs on the consul clients when the issue is happening.
These logs where taken from one of the clusters where we are experiencing the issue. I've not been able to reproduce the issue since restarting the nomad server and client yet. |
Hi @JoaoPPinto 👋 I was kind of able to reproduce this issue by manually deleting the Consul ACL token created for the Nomad service. By any chance, do you have some kind of background process that may have deleted the token? Would you be able to check the Consul logs for messages such as this?
And, if possible, would you be able to share more lines around the error log line? Having the full log could be useful as well. You can email them to [email protected] with the GitHub issue ID in the subject to avoid leaking sensitive data. Thanks! |
Thank you for your reply @lgfa29 ! I can confirm that manually deleting the service token in consul reproduces the problem in the test environment I created.
We also noticed after opening the issue that in that cluster, when the problem is manifesting itself Nomad clients themselves are not able to register the Nomad service in Consul.
The token the agent's are using does exist and has a pretty permissive policy in this environment, including What doesn't make sense to me is that I'm using the exact same policy (without bootstrap token) in the separate test cluster and haven't seen that issue yet. In the meantime, I'll try to collect additional logs. |
Thanks for the extra info. I suspect this may be related to #20184, which I noticed while investigating this issue. Do you know if clients have been restarted?
With the new workload identity workflow, the token used to call most of Consul API endpoints are the ones derived for each task, not the agent token. The integration docs has a diagram that kind of illustrates this, so I think the Consul given to the Nomad shouldn't affect this 🤔 My biggest hunch so far is that there seems to be a few places in the Consul integration that relies on ephemeral storage, which may be lost if a Nomad agent restarts or crashes. With the information we have so far I find the
Looking through Consul's code base, we find that this message is supposed to have the accessor ID of the secret ID used in the API call. But the From the other logs you've shared, I think Nomad is trying to register a service that is missing from the Consul catalog. nomad/command/agent/consul/service_client.go Lines 1071 to 1080 in 23e4b7c
We can see that the token used is fetched from an local map. nomad/command/agent/consul/service_client.go Lines 1912 to 1917 in 23e4b7c
And this map is updated every time we register a new workload or update it. nomad/command/agent/consul/service_client.go Lines 1509 to 1511 in 23e4b7c
nomad/command/agent/consul/service_client.go Lines 1647 to 1649 in 23e4b7c
This points to either of two scenarios:
So it seems like there's some data consistency issue inside Nomad, but I have not been able to pinpoint exactly where, and it may take us a while to find it....I have placed this into our board so we can properly roadmap it and find some time for a more in-depth investigation. It's not ideal, but if this is something that is happening frequently you may want to keep an eye on the metric And if you have more logs that could be helpful as well, the more the better, so if you could look for the allocation ID, or the service name so we can understand their entire lifecycle that could help us. |
Thanks again for the reply. | Do you know if clients have been restarted? Yes. We noticed that restarting the Nomad Client would result in the services being able to be registered, so our workaround for the moment is restarting the Nomad Clients when we detect that problem has manifested itself. | I have checked our metrics and it seems we started seeing an increase when we changed the cluster configuration to workload identity. I believe the problem has happened in the test cluster I stood up, I'll collect all the information I can and submit it hopefully today still. |
I'll send the full log files via the nomad-oss-debug address as they are quite large, but I'll post what I believe are the most relevant bits here. The error was noticed on 21 Mar at ~09:18, when creating a new allocation for a job already executing (by stopping the running allocation via the UI) Nomad Version
Operating System
Neither Client nor Server were restarted in the couple days prior to the error.
Nomad Server LogsI took the liberty to remove some debug http request logs to improve readibility
Nomad Client LogAgain, I removed some http request logs to improve readability
Consul Client LogsFor completeness' sake, here are the consul client logs from the agent on the same machine as the nomad client
It intrigues me that the Consul Client log states that the check |
Thanks for the extra info. Another thing to check is to make sure the tokens created with the workload identity really don't have an expiration date. You can use the |
Seeing similar behavior, except:
Also seeing logs like this:
The services and other consul interaction for the nomad client itself work fine and the configured tokens (nomad, consul client default, consul client agent) do have sufficient ACLs. Service registration for individual nomad jobs will not work at all, however, and result in the above error messages. Also noting the We have had a long-standing (years) issue of similar behavior where this would start happen after a while. This has never been resolved properly but was addressed by scanning logs for the error and then restarting the nomad process when it happens (this would return service registration capabilities and a SIGHUP or other measures we tried would not suffice; since we're templating in certs and tokens from Vault, solving this reliably requires #4593 and #3885). What is now new as of last week (existing job definition that worked fine previoulsy, on newly provisioned clients with config that is identical to and working fine on other clients in the same cluster) is that we now have jobs/clients which persistently fail like above from the get-go and this does not get resolved by restarts of either consul or nomad clients and we never seem to get those failing instances working at all. These issues look like they could be related:
|
| Another thing to check is to make sure the tokens created with the workload identity really don't have an expiration date. |
I noticed today when the issue manifested itself, that when nomad was trying to deregister a service (and failling), the healthcheck for that service was still showing as healthy several hours after the allocation was replaced (and thus impossible to be healthy). The consul client does throw a warning that the check is critical, but in the UI there is no error indication. I suppose that if anything this is a different issue in the consul repo, but I figured I should throw this info here. |
We are experiencing a similar problem: A single nomad client + consul client agent consistently loses "sync" with the rest of cluster somehow and as a result nomad stops being able to write to the consul agent. The node this is running one has above-average latency and occasional packet loss to the rest of the cluster for what it's worth. This is what we're seeing:
Only a restart of consul + nomad allows the nomad client to once again write to the consul client. Any help would be appreciated.
|
Hi, we're testing workload identities and have encountering this also:
We have set the consul auth method TTL to 24h. We started to see problems to sync with consul from nomad in the client logs. Going to the point where it starts to fail. We can see this on the client:
It matches with this event on the consul server:
I guess after the 24h mark from where the allocation started (the token derived using the jwt), consul deletes the token and nomad starts complaining about the consul sync as now it's using that token instead of the configuration token to perform registrations. Looking at the code I saw that consul tokens are only derived in the alloc Prerun() (if I'm not mistaken). So I guess that during the lifetime of the allocation a new token it's not derived. |
Hey folks, just a heads up that I'm picking this back up. Just wanted to point out to folks that if your Consul Auth Method has a TTL then that is not the issue described here but instead is #20185. We do not currently expect that the tokens being issues for Workload Identities have a TTL (but I will fix that as well). This issue is for the case where you definitely don't have a TTL on the Auth Method and aren't otherwise deleting the token out-of-band somehow, and we don't yet know why that is. I'll report back soon on the results of more investigation. |
I'm digging through the logs posted and the logs on the client illustrate the problem nicely. We see the allocation
Then only 37s later we see the client gets an updated version of the alloc from the server and stops it. At that point we get the failed
If we look at the Consul agent logs from the same time period, we can see we're deregistering the service for
I dug in from there and found that the client gets "stuck" like this thereafter because That still doesn't explain where the token went in the first place! I suspect that at least for some folks reporting here that they don't have the Consul agent token set with appropriate permissions, as described here: #23800. But I'm sure there are other cases as well. Still investigating... |
Because I just got an internal message on this issue that was confused on the subject again: if you think you're hitting this, check the auth method you've configured in Consul. It should look something like this and not have a field like
Nomad cannot currently work with |
After an extended discussion with the Consul folks to make sure what we were doing was reasonable, I've identified two core problems with the current workflow:
Both of these fixes are in #24166. We missed the boat on the 1.9.0 release train, but this will ship in 1.9.1 (with Enterprise backports). I've also added documentation around the auth method TTL in #24167. |
I would like to clarify one point. When using the Workload Identity workflow for consul, does the nomad server policy still require |
@philek we've left that in there until 1.10 LTS ships because anyone doing an upgrade to WI will still need |
When the local Consul agent receives a deregister request, it performs a pre-flight check using the locally cached ACL token. The agent then sends the request upstream to the Consul servers as part of anti-entropy, using its own token. This requires that the token we use for deregistration is valid even though that's not the token used to write to the Consul server. There are several cases where the service identity token might no longer exist at the time of deregistration: * A race condition between the sync and destroying the allocation. * Misconfiguration of the Consul auth method with a TTL. * Out-of-band destruction of the token. Additionally, Nomad's sync with Consul returns early if there are any errors, which means that a single broken token can prevent any other service on the Nomad agent from being registered or deregistered. Update Nomad's sync with Consul to use the Nomad agent's own Consul token for deregistration, regardless of which token the service was registered with. Accumulate errors from the sync so that they no longer block deregistration of other services. Fixes: #20159
Hello,
We recently upgraded one of our Nomad clusters to 1.7.5 and enabled Workload Identities for both Consul and Vault.
We noticed a behavior where nomad services apparently cannot update/register services in consul after an undetermined amount of time (from our tests it could range from minutes to hours).
Restarting the Nomad Client Agents appears to mitigate the issue temporarily.
We managed to reproduce the problem in a newly created Nomad/Consul/Vault, with 1 Server agent and 1 Client Agent for each.
The workload identity configuration was done with the help of
nomad setup consul
and the default values were used for policies/binding-rules/roles.I have since changed the Nomad log level to DEBUG, but haven't managed to reproduce the issue again after restarting the agents.
Nomad version
Operating system and Environment details
Issue
After an undetermined amount of time, Nomad jobs no longer can correctly update Services in Consul. (eg. if a new allocation of a service replaces an existing, no health checks are created for the new allocation, and the health checks for old allocation remain in place)
Reproduction steps
Stand up new Nomad cluster with Consul Workload Integration, Deploy test job, and trigger a new allocation placement after initial service registration.
Expected Result
Services are properly updated
Actual Result
Services are not updated at all. New services are not registered, Old services are not deregistered, existing services are not updated.
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
This is the only error in the log. I have since changed the log level to DEBUG in both Server and Client agents in an attempt to get more logs, but haven't been able to reproduce the problem since restarted the agents.
The text was updated successfully, but these errors were encountered: