-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Services not unregistered #16616
Comments
Hi @dani, do you have any logs from the clients that were running the allocations that had services that should be deregistered? If you do and can pass them along, I can take a look through them and see if I can identify anything useful. If you have any other useful information that would be great, in order to try and reproduce this. |
1.5.2 included https://github.com/hashicorp/nomad/pull/16289/files which was supposed to fix a bug where we would attempt to deregister services twice. The key difference is we now set a flag that the services have been deregistered after the Thinking about it now and reading our own docs, it is unclear whether PostRun implies an alloc is terminal ... if it isn't, and the services get re-registered for the same allocation, they'll never be deregistered. |
I was just able to trigger it on my prometheus Job :
Here's my system logs during this rolling update : |
Hi @dani, I've not been able to reproduce this yet locally; are you able to share the jobspec, or a redacted version you are using and what exactly is being changed before you register the new version? Thanks. |
ok, this particular job file was quite big, I'll try to reproduce with a simpler one (but I'll first have to install 1.5.2 again, as I had to revert to 1.5.1 because this issue made my test cluster totaly unusable) |
I have seen the same issue, I've even reproduced it using the counter demo app. deploy the demo app, add an additional tag, and re-deploy and you now have two instances registered in consul. hope this helps. I have also reverted my lab to 1.5.1-1 |
I had the same problem and the way the NSD had the same problem |
Hi, we had the same problem after upgrade Nomad from 1.4.5 to 1.4.7 and restart Consul agents on nodes. Reverted to Nomad 1.4.5. Regards. |
Hi everyone and thanks for the information and additional context. We have been able to reproduce this locally and have some useful information to start investigating, so will update here once we have anything more. |
Additional repro that I've closed as a dupe, but just in case there's anything useful in the logs: #16739 |
Hi everyone, we are continuing to look into this and while we were able to reproduce it in a manner, I wanted to gather some more information. Those that have experienced this, are you setting the Consul agent ACL token via the It seems to specifically affect Nomad v1.5.2, v1.4.7, and v1.3.12. If you do set the above token, are you able to provide context on the deployment that has the problem? |
In my case, I set the token in the config file, like
Is this unsupported now ? (it's easier to set it in the config when deployed with tools like ansible) |
We are also setting it via consul config:
|
This issue still present in nomad v 1.5.3 |
An observation from my side. I created this bash script to go clean up the services that were not unregistered in consul as a interim solution. #!/bin/bash
CONSUL_HTTP_ADDR="http://consul.service.consul:8500/"
CONSUL_TOKEN=XXXX
# Get all unhealthy checks
unhealthy_checks=$(curl -s --header "X-Consul-Token: ${CONSUL_TOKEN}" "${CONSUL_HTTP_ADDR}/v1/health/state/critical" | jq -c '.[]')
# Iterate over the unhealthy checks and deregister the associated service instances
echo "$unhealthy_checks" | while read -r check; do
service_id=$(echo "$check" | jq -r '.ServiceID')
node=$(echo "$check" | jq -r '.Node')
if [ "$service_id" != "null" ] && [ "$node" != "null" ]; then
echo "Deregistering unhealthy service instance: ${service_id} on node ${node}"
curl --header "X-Consul-Token: ${CONSUL_TOKEN}" -X PUT "${CONSUL_HTTP_ADDR}/v1/catalog/deregister" -d "{\"Node\": \"${node}\", \"ServiceID\": \"${service_id}\"}"
else
echo "Skipping check with no associated service instance or node"
fi
done This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them. This is still on Nomad 1.5.2. |
Indeed. The issue does not belongs to consul. If you restart the nomad service then the dead services dissapears from nomad and consul.
1.5.3 have the same bug |
We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.
|
I have no acl enabled. I don't think that this issue belongs to the acl system. |
Not sure if related but I keep experiencing this without ACLs being turned on. It's hard to pinpoint, but from what I've seen it mostly happens when an ASG cycles the Nomad hosts and the job is being rescheduled on the new host. Combo is: Consul |
I had this issue with nomad service provider without ACLs, the context details in #16890. ubuntu 22.04.2 LTS |
encountered this as well, I am able to reproduce quite reliably with the following sequence:
using nomad 1.4.7. the staled entries in consul are automatically cleaned up after restarting the nomad client where the allocation was placed suspecting it could be related to #16289, but havent confirm update: downgraded clients to 1.4.6, and does not (seem) to see this issue anymore using the above steps |
I've broken out the data integrity fixes to #20590, and I'll do the client-side work as a separate PR. |
This changeset fixes three potential data integrity issues between allocations and their Nomad native service registrations. * When a node is marked down because it missed heartbeats, we remove Vault and Consul tokens (for the pre-Workload Identity workflows) after we've written the node update to Raft. This is unavoidably non-transactional because the Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But we've unnecessarily mirrored this same behavior to deregister Nomad services. This makes it possible for the leader to successfully write the node update to Raft without removing services. To address this, move the delete into the same Raft transaction. One minor caveat with this approach is the upgrade path: if the leader is upgraded first and a node is marked down during this window, older followers will have stale information until they are also upgraded. This is unavoidable without requiring the leader to unconditionally make an extra Raft write for every down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in data integrity for stale reads seems like a reasonable tradeoff. * When an allocation is marked client-terminal from the client in `UpdateAllocsFromClient`, we have an opportunity to ensure data integrity by deregistering services for that allocation. * When an allocation is deleted during eval garbage collection, we have an opportunity to ensure data integrity by deregistering services for that allocation. This is a cheap no-op if the allocation has been previously marked client-terminal. This changeset does not address client-side retries for the originally reported issue, which will be done in a separate PR. Ref: #16616
This changeset fixes three potential data integrity issues between allocations and their Nomad native service registrations. * When a node is marked down because it missed heartbeats, we remove Vault and Consul tokens (for the pre-Workload Identity workflows) after we've written the node update to Raft. This is unavoidably non-transactional because the Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But we've unnecessarily mirrored this same behavior to deregister Nomad services. This makes it possible for the leader to successfully write the node update to Raft without removing services. To address this, move the delete into the same Raft transaction. One minor caveat with this approach is the upgrade path: if the leader is upgraded first and a node is marked down during this window, older followers will have stale information until they are also upgraded. This is unavoidable without requiring the leader to unconditionally make an extra Raft write for every down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in data integrity for stale reads seems like a reasonable tradeoff. * When an allocation is marked client-terminal from the client in `UpdateAllocsFromClient`, we have an opportunity to ensure data integrity by deregistering services for that allocation. * When an allocation is deleted during eval garbage collection, we have an opportunity to ensure data integrity by deregistering services for that allocation. This is a cheap no-op if the allocation has been previously marked client-terminal. This changeset does not address client-side retries for the originally reported issue, which will be done in a separate PR. Ref: #16616
Client-side work is up in #20596 |
When the allocation is stopped, we deregister the service in the alloc runner's `PreKill` hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload only logs errors and never retries them. Add a short retry loop to the `RemoveWorkload` method for Nomad services, so that transient errors give us an extra opportunity to deregister the service before the tasks are stopped, before we need to fall back to the data integrity improvements implemented in #20590. Ref: #16616
) When the allocation is stopped, we deregister the service in the alloc runner's `PreKill` hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload only logs errors and never retries them. Add a short retry loop to the `RemoveWorkload` method for Nomad services, so that transient errors give us an extra opportunity to deregister the service before the tasks are stopped, before we need to fall back to the data integrity improvements implemented in #20590. Ref: #16616
Ok folks, #20596 and #20590 have both been merged and will ship in the upcoming Nomad 1.8.0 (and supported backport versions). I'm going to close this issue. "But I saw it again!" If you see services left behind after an allocation has stopped from Nomad 1.8.0 or beyond, please let us know. We may move reports to a new issue in order to properly triage. Make sure to include the following:
|
@tgross Looks like we have indication of this again in Nomad v1.8.1
Will work to get a working reproduction on the side to deliver if needed just let me know in Slack. Wanted to post here to keep track. |
Hi @tgross , We also still have this issue in our CI environment (1 server nomad + consul + vault + 3 clients).
Thanks |
@ngcmac I'm going to continue the investigation for the case where we're getting "ACL not found" with Workload Identities in #23494. @natemollica-nm is going to do some follow-up on his report and we may or may not adjust #23494 to cover that case as well depending on the outcome. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Just upgraded to Nomad 1.5.2. Since then, services are not always unregistered from Consul service catalog when they are shuted down / upgraded. So old services versions appear as failed, eg
Environment :
Haven't found yet a pattern to reproduce it 100% of the time
The text was updated successfully, but these errors were encountered: