-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.8.3 upgrade seems to make consul checks unstable. #4256
Comments
Something that may be related: https://gist.github.com/apenney/f09f90177c7ba168b8f120b299729961 The first example (-production) is working fine. I notice the address/taggedaddress/serviceaddress all match, and are of the nomad instances themselves. In the second example (-uat) the address/taggedaddress is of one of the consul servers itself, and then the serviceAddress is set correctly. Could this be related? I don't know how nomad would be able to make a check that had the address of a consul server in the first place. |
I think my last comment was a red herring. I do notice:
(wait a bit)
So it's definitely deregistering and reregistering the service over and over for no reason. |
In debug mode one of the clients exihibiting this issue is doing this -constantly-:
I feel like it's weird that it would need to re-register constantly, but this may be normal behavior. |
I tried a bunch of other stuff tonight:
Not sure what else to try anymore. |
I did notice from staring at sysdig traces:
10.30.3.123 is a UAT box (so node.class = uat).
But it's deregistering a service that was registered by a totally different staging box. It's like they are all fighting with each other. |
I just downgraded the entire cluster to 0.8.1, hoping #4170 was involved. Unfortunately, while at first it seemed to help, it then fell apart and started deregistering services again. |
I downgraded all nodes to 0.7.1 which stopped the service churn in Consul. The masters are still 0.8.3. This seems to be a nomad bug. |
From someone on slack, who's been trying to help diagnose: Aha yep, these are correlated to the second on our deploys; we do canary + promote, and it's at the promotion step where we see the rpc errors
Hmm I wonder if there's a problem with the order of operations? This strikes me as a potentially benign error
Does de-registering the service de-register the check at the same time, making the second step redundant? Is it complainy that it can't de-register a check that has already gone away due to the first step? wolfman [11:48 PM] wolfman [11:56 PM] wolfman [12:00 AM] |
@apenney Thanks for reporting this and attempting to debug. Wanted to note a few things:
The comment trail above is pretty long, I will try to reproduce this locally and post an update by tomorrow. Don't want to rule out #4170 too early, we are syncing services and checks with Consul regularly in Nomad now and there might be a race condition with that. |
I did suspect #4170 but (buried in the giant wall of text) I mentioned we rolled the entire cluster to 0.8.1 to try and roll that out. |
Quick update: Using the same consul cluster I made a new nomad cluster, with new nodes/servers (completely separate from the other cluster) and deployed a single service. It immediately started deleting/recreating services over and over. It may just be interference from the other nodes as they talk to the same consul (different "datacenter" name for the new cluster, along with different was hoping this would rule out cross talk). Just a data point. |
Further woes:
I went to bed, woke up, everything was broken and no consul services existed in two of the three environments we run in the 0.7.1 cluster. All the containers were working. I restarted nomad on one of the three production boxes and all the prod services registered themselves again. Same for UAT. I don't know anymore. |
I... may have it. We had our consul {} section with an address of consul.service.consul instead of 127.0.0.1:8500. Since rebuilding the cluster (again) with this changed to localhost, the fighting over services I was seeing seems to have stopped. I'll wait the rest of the day to confirm we've actually helped it, but this does seem to be involved. Talking directly to the masters instead of the local agent seemed to cause problems. |
@apenney Your latest comment seems very likely to be the issue. You should have the agents pointing to the local consul agent. As you said give it a day but if this is the root cause, which is likely, would you mind closing this issue. |
We haven't had any issues since. We should add some serious documentation additions about the requirement to point at a local consul node and not a remote one, as the docs only really reference localhost as a default and not a mandatory requirement. |
@apenney Would appreciate either a PR or filing an issue for this. Obviously would prefer a PR |
I'm experiencing this same issue trying to experiment with consul and nomad locally. I'm just running a consul node in docker in dev mode and can confirm the constant churn of the nomad services in consul even though the nomad job containers seem to be stable and all the health checks should be passing. What was your resolution here, @apenney? |
Ah, forgive my ignorance, I had only run the single server node in dev mode and did not run the agents alongside it, my mistake. |
Ok, that didn't resolve it either. Even with a 3 node consul cluster locally the services keep registering and deregistering. |
@karlgrz You should have your Nomad clients talking to a local Consul agent. You want to avoid the Nomad clients all talking to the same agent. The Nomad clients are doing a diff between what it wants registered and what the consul agent has. So if you have multiple nodes all pointing at the same Consul agent you will see this flapping. |
@dadgar thanks for the comments. That's interesting as I can replicate this consistently with a single consul node, a single Nomad server, and a single Nomad client. This is all running locally, consul running in a docker container, the Nomad server and client running on my host. The health checks all look to be passing according to consul, but then Nomad client deregisters all of them and they disappear from consul completely for a time, which is basically rendering any kind of reliable consul-template service discovery useless. I'd be happy to supply configurations I'm using if it would be helpful, could be PEBKAC with my configurations or something but it seems to be wired up properly. Maybe I'm just missing something obvious. |
@dadgar I can reproduce this just by spinning up the redis job generated by:
|
@karlgrz Ah you are still pointing two agents at the same consul. The Nomad Server and the Client. I am assuming what is flapping is the Nomad agent health checks. If you want to run a server/client on the same host and have it talk to the same consul, run |
@dadgar ah ha! Success! Thank you so much! That seems to be holding up well. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.8.3 (c85483d)
Operating system and Environment details
Ubuntu 16.04
Issue
I'm in the process of doing a 0.7 -> 0.8.3 upgrade and I've run into a weird situation. We have some services that seem stable from the point of view of nomad (they've been running in docker for an hour now) but the service check inside consul seems to get created and deregistered over and over.
Example of the check:
When I watch the consul UI I can see the check appear, eventually go green, then just disappear. I'll wait a few seconds and see it come back. From the nomad client side running the successful job I see:
Which kind of lines up with it disappearing unexpectedly.
Reproduction steps
This is tough. It's consul 1.0.7 with nomad 0.8.3. I'm not sure it's easy to reproduce as some services on 0.8.3 are working out. The 0.7 nodes are also still working fine. At this point I'm looking for help figuring out:
The text was updated successfully, but these errors were encountered: